Natural Language Processing using python

Introduction

Let’s learn from a precise demo on Natural Language Processing on Newsgroup data for Machine Learning

What we will do :

1. Read the newsgroup data 2. Use TfIdfVectorizer for converting a collection of raw documents to a matrix of TF-IDF features. 3. Fit random forest and multinomial model (No crossvalidation is used here) 4. Check both model accuracy on test data set

We can see from above that

1. Only 2034 observations are used 2. Number of features or variables created by TfidfVectorizer function is 34118. 3. Number of features is quite large. 4. Large dimentionality of feature set is not good in general. 5. There are many ways to handle dimentionality issues(Like Dimentionality reduction techniques ,adding regularization term in models,etc.) 6.We are not focusing on any of these things.This is just a starter .

Model Fitting & F1-score Metric

Let’s fit multinomial model and random forest model on the train data set and check the metric for each model on test data set

analyticsdataexploration.com is your Data Analytics,Machine Learning and Artificial intelligence website. We provide you with the information and tutorials on latest Machine Learning technologies and videos straight from the data analytics industry.