Clustering (K-Means) basic

This experiment clusters similar companies into same group given their Wikipedia articles and can be used to assign cluster to new company.

#Clustering: Find similar companies
This experiment demonstrates how to use the K-Means clustering algorithm to perform segmentation on companies from the Standard & Poor (S&P) 500 index, based on the text of Wikipedia articles about each company.
![enter image description here][1]
#Data
The articles from Wikipedia were pre-processed outside Azure ML Studio to extract and partially clean text content related to each company. The processing included:
- Removing wiki formatting
- Removing non-alphanumeric characters
- Converting all text to lowercase
- Adding company categories, where known
For some companies, articles could not be found; therefore the number of records is less than 500.
#Model
First, the contents of each Wiki article were passed to the Feature Hashing module, which tokenizes the text string and then transforms the data into a series of numbers, based on the hash value of each token.
Even with this transformation, the dimensionality of the data is too high and sparse to be used by the K-Means clustering algorithm directly. Therefore, Principal Component Analysis (PCA) was applied using a custom R script in the Execute R Script module to reduce the dimensionality to 10 variables. You can review the result of PCA by double-clicking the right-hand output of the Execute R Script R module.
From trial and error, we learned that the first variable in the PCA transformed data had the highest variance and appears to have had a detrimental effect on clustering. Therefore, we removed it from the feature set using Project Columns.
Once the data was prepared, we created K-Means Clustering module and trained models on the text data.Finally, we used Metadata Editor to change the cluster labels into categorical values.
![enter image description here][99]
Thanks to Microsoft - Brandon Rohrer. https://gallery.cortanaintelligence.com/Experiment/Clustering-Find-similar-companies-23
<br><br>
----------
> This ML experiment is for [Microsoft Azure Machine Learning Course][101].<br>
For the complete experiment list [Click here][102].<br>
Laploy | laploy@gmail.com | 084 007 5544 | [www.laploy.com][103]<br>
![enter image description here][104]
----------
[101]: https://notebooks.azure.com/laploy/libraries/loyml/html/00001%20Sessions%20summary.ipynb
[102]: https://gallery.cortanaintelligence.com/Home/Author?authorId=81E333F747E3429B55A3445E6714C36F60B397C13B4D0B07F34DEF1421F64D73
[103]: http://laploy.com
[104]: https://raw.githubusercontent.com/laploy/mli/master//loy-small.jpg
[1]: https://raw.githubusercontent.com/laploy/mli/master//12520-000.PNG
[99]: https://raw.githubusercontent.com/laploy/mli/master//12520-099.PNG
[11]: https://raw.githubusercontent.com/laploy/mli/master//loy-small.jpg