After some research on what dataset I could obtain from the web, I found a women clothings dataset of a real e-commerce business.

I think it could be cool and useful to a business if I could develop an automation for the business to extract insights from their clothing reviews.

Because it is not easy to read thousands of reviews and it is a time consuming task.

This could be valuable for various reasons.

For example:Understand trends: to understand what people are talking about, things they like or things they do not like about.

Improve your products from users feedbacks.

To follow up with your user about the product that they don’t like and further to understand the problem.

To decrease return rate, re-stocking fees is one of the big expenses for e-commerce to succeed or even stay alive.

Above are just a few things that you could do with customer reviews.

Problems that I want to solveSo for the purpose of this project, I want to explore the followings:Topic Modeling: for example, what are the positive and negative things people are talking about that clothing/shoes.

To see if I could find any topic by calculating frequencies of word or combination of words happen in a topic.

“Separation” of good and bad reviews using clustering: to separate out or find pattern of bad and good reviews for different products, so ones can send them to corresponding departments for attention by using clustering methods.

This could be very hard since clustering method is an unsupervised machine learning technique that find hidden patterns from the data.

Generate a WordCloud to see what are the most frequent words that people are talking about.

Preform Topic Modelings to see if I could find some clear different topics that people are talking about.

Use clustering methods to cluster out pattern from my text data and see if I could cluster out those bad reviews (or separate types of reviews).

And use TSNE to visualize my clusters.

Lastly, perform a supervised learning problem with the Rating column from the dataset to classify good and bad reviews.

Data And Technologies I usedThe dataset I used could be obtained from Kaggle, consists of 23486 entires of different clothings reviews and 11 different columns.

snapshot of the dataThe tools that I have used in this project are numpy, pandas, matplotlib, seaborn, wordcloud, sklearn especially with CountVectorizer, TfidfVectorizer, Kmeans, TSNE, NMF, TruncatedSVD, silhouette_score, MultinomialNB and LogisticRegression.

Data Cleaning & Exploratory Data Analysis (EDA)how many NAs in the datasetThere are some NAs in the dataset, and I will just drop them off.

ReviewText column will be my primary column for NLP.

Beside the ReviewText column, I created another column called CombinedText, which is joining the Title and ReviewText column together.

Because I think there could be some hidden data you can get from the review title as well.

Lastly I pickle my cleaned data for further usage.

WordCloudNext thing I do is to create a WordCloud to see what words people are talking/using the most.

Before I do that, I need to:modify my texts into all lower caseremove some of the less useful frequent words that could exist in the reviews, such as dress, dresses and etc.

ReviewTextLower)the code is basically saying vectorize the text into 1-gram and 2-gram (also tried with 3-gram), using the pre-set ‘english’ stop words from the package, everything and pattern is in lower case, ignores words that has a frequency of higher than 0.

6 from the documents, with a maximum of 4000 features/dimensions.

Then I use the following code to create a WordCloud:for_wordcloud = count_vectorizer.

show()most frequent words that customers are talking aboutTopic ModelingThere is one more step before I can do topic modeling, which is to use LSA and NMF to reduce the dimension of my input text data.

fit_transform(cv_data)Then we can do topic modeling and below is an example of the output:example of a few topicsYou can generate different amount of topics, by testing with different numbers of topics to find the best number, and see if those topics make sense to you.

ClusteringIt is better to standardize your input data to mean of 0 and standard deviation of 1 before you run clustering algorithms.

Because your features might not all be on the same scale, on the other words, that might not be the same thing as increasing 1 unit from feature a comparing to increasing 1 unit from feature b.

fit_transform(nmf_cv_data)Then you can use unsupervised machine learning algorithm to make clusters for different topics or different types of reviews.

In this project, I used KMeans, and also used inertia and silhouette scores as proxy to help me identify what is the best number of clusters I should use.

Then using TSNE to help me visualize the clusters generated.

For example:TSNE plots for different number of clustersAfter you identified how many clusters are the best, you can print out the documents that are the closest to the centroid of each clusters for examinations.

For example:indices_max = [index for index, value in enumerate(kmeans.

ReviewText[rev_index])) print(".")example of a few documentsClassificationAnother thing we can try to separate the good or bad reviews from analyzing the text data is to perform a classification problem.

snapshot of the dataIn our data, we have a feature named Rating which is a rating score that a customer give to the product, while 1 is the least satisfied and 5 is the most satisfied.

We can set the Rating column as our target variable and our engineered CombinedText column as independent variable to see if we could build a classifier to automatically classify a comment.

First thing I did was to group rank 1 to 4 together as bad review (labelled as 1), while rank 5 is our good review (labelled as 5).

The two classes are not totally balance but they are in the acceptable range.

I built classification models with naivesbayes and logistic classifiers.

Before and After modification of the Rating columnThe metric that I used for model evaluation, I used recall score because I care the cases when I predicted the review is good review but actually it is not.

The best recall score that I got is 0.

74 without a lot of engineering.

The score could be better if there is more time and exploration on the model.

recall score fo both bad (rank 1) and good (rank 5) reviewsLesson LearnedUnsupervised learning is really very different than supervised learning because of its nature!You will expect to spend a lot of time trying to understand how to cluster your data, there are a lot of clustering methods out there beside KMeans.

Doing text analytics or NLP, you can expect to spend a lot of your time cleaning your text data for best results.

For example, how to decide what stop words to use based on the context of your data and problem you want to solve, how to lemmatize, how to vectorize, how to reduce your dimensionality and avoid curse of dimensionality, and etc.

In The FutureIf I have a chance to extend the project, I would like to follow up with the followings:Explore more on different types of clustering algorithms and NLP techniques.

Add new stop words.

Build a Flask prototype application to create an automatic process to recommend (separate) different topics from the user comments.

Thank you so much for reading, and if you are interested to explore my code and resources I used, this project is on my github repo.