How To Solve Your First Ever NLP Classification Challenge

A huge portion of the data that exists today is textual and as a Data Scientist, it is very important to have the skill sets to process these textual data. Natural Language processing has been around for a long time and it has been growing in popularity. Today almost all tech devices have some sort of NLP technology that let them communicate with us.

NLP should be one of the most updated skill sets in a Data Scientist’s Tool kit. In this article, we will learn to implement Natural Language Processing in Machine Learning in the simplest way possible to solve MachineHack’s – Whose Line Is It Anyway: Identify The Author Hackathon

About The Data Set

The dataset we are going to use consists of sentences from thousands of books of 10 authors. The idea is to train our machine to predict which author has written a specific sentence. This is an NLP classification problem where the objective is to classify each sentence based on who wrote it.

The above code block consists of the necessary libraries that we need to implement our NLP classifier. We will look into each of them as we come across various methods.

Importing the dataset

dataset = pd.read_csv('TRAIN.csv')

The above code block reads the data from the csv file and loads it into a pandas data-frame using the read_csv method of the pandas library that we imported earlier.

Let’s have a peek at the dataset :

Cleaning and preprocessing the data

Cleaning the data is one of the most essential tasks in not just Natural Language Processing but in the entire Data Science spectrum. In Natural Language Processing, there are various stages of cleaning. Some of the basic stages are listed below :

Cleaning the test for unnecessary data (noises such as symbols, emojis, special characters, etc.)

Stemming or lemmatization for reducing the words to its root form.

Removing stopwords.

Note:

Stemming is the process of reducing a word to its root form. This helps remove redundancy in words. For example, if the words ‘run’, ‘ran’ and ‘running’ are present in a sentence, each word is reduced to its base or root form ‘run’ and counted as 3 occurrences of the same word instead of counting each word as unique.

Stopwords are the words that are too often used in a natural language and hence are useless when comparing documents or sentences. For example, ‘the’, ‘a’, ‘an’, ‘has’, ‘do’, ‘what’, etc are some of the stopwords. Such words are removed for NLP.

nltk.download('stopwords') #downloading the stopwords from nltkcorpus = [] # List for storing cleaned dataps = PorterStemmer() #Initializing object for stemmingfor i in range(len(dataset)): # for each obervation in the dataset #Removing special characters text = re.sub('[^a-zA-Z]', ' ', dataset['text'][i]).lower().split() #Stemming and removing stop words text = [ps.stem(word) for word in text if not word in set(stopwords.words('english'))] #Joining all the cleaned words to form a sentence text = ' '.join(text) #Adding the cleaned sentence to a list corpus.append(text)

The NLTK library comes with a collection of stopwords which we can use to clean the dataset. The PorterStemmer method of nltk.stem.porter library is used to perform stemming. In the above code block, we traverse through each observation in the dataset, removing special characters, performing stemming and removing stop words.

Let’s see the cleaned data :

Generating Count Vectors

With the above code block, we will create a Bag-of-Words model. The CountVectorizer method imported from sklearn.feature_extraction.text creates a matrix of vectors consisting of the counts of each word in a sentence. The parameter max_features = 120 selects a maximum of 120 unique words. We transform the cleaned data in corpus into CountVector X which is the independent variable set for the test classifier that we will build in the coming steps.

Here is what X looks like :

Each row represents a row in the actual observation and each column represents a word of the 120 selected words.

Splitting the dataset into the Training set and Validation set

In the above code block, we split the dataset into training and validation sets. The parameter test_size = 0.2 specifies that the test set (X_val & y_val ) should consist of 20 % of the overall data in X and y. The random_state parameter allows us to set a seed value to reproduce the exact same results.

Building a classifier

classifier = SVC()classifier.fit(X_train, y_train)

Since we are ready with the training data we can now use it to train a classifier. The above code block initializes a Support Vector Classifier and fits the training data for learning.

Predicting the author

y_pred = classifier.predict(X_val)

After training the classifier with X_train and y_train, we can now make the classifier to predict the authors for the texts in the validation set X_val.

Evaluating the model

After predicting for the validation set, we need to check how many of the predictions are actually right.To do this, we will make use of the confusion matrix.Using the confusion matrix we will compare the predicted values in y_pred and the actual values in y_val.The accuracy from a confusion matrix can be calculated by summing up the diagonal elements and diving it by the total sum of elements in the matrix. We define a method as shown below:

Provide your comments below

A Computer Science Engineer who is passionate about AI and all related technologies. He is someone who loves to stay updated with the Tech-revolutions that AI brings in.
Contact: amal.nair@analyticsindimag.com