Past couple of months I have been working on a Question Answering System and in my upcoming blog posts, I would like to share some things I learnt in the whole process. I haven’t reached to a satisfactory accuracy with the answers fetched by the system, but it is work in progress. Adam QAS on Github.

In this post, we are specifically going to focus on the Question Classification part. The goal is to classify a given input question into predefined categories. This classification will help us in Query Construction / Modelling phases.

DESC:manner How did serfdom develop in and then leave Russia ?
ENTY:cremat What films featured the character Popeye Doyle ?
DESC:manner How can I find a list of celebrities ' real names ?
...

Prep Training data for the SVM

For this classifier, we will be using a Linear Support Vector Machine. Now let us identify the features in the question which will affect its classification and train our classifier based on these features.

WH-word: The WH-word in a question holds a lot of information about the intent of the question and what basically it is trying to seek. (What, When, How, Where and so on)

WH-word POS: The part of speech of the WH-word (wh-determiner, wh-pronoun, wh-adverb)

POS of the word next to WH-word: The part of speech of the word adjacent to WH-word or the word at 1st position in the bigram (0th being the WH-word).

Root POS: The part of speech of the word at the root of the dependency parse tree.

Note: We will be extracting the WH-Bigram also (just for reference); A bigram is nothing but two consecutive words, in this case, we will consider the WH-word and the word that follows it. (What is, How many, Where do…)

We have to extract these features from our labelled dataset and store them in a CSV file with the respective label. This is where spaCy comes in action. It will enable us to get the Part of Speech, Dependency relation of each token in the question.

import spacy
import csv

clean_old_data()
en_nlp = spacy.load("en_core_web_md")

First, we load the English language model and clean our CSV file from old training data. And then we read our raw labelled data, extract the features for each question, store these features and labels in a CSV file.

read_input_file(fp, en_nlp)

This function splits the raw data into the question and its respective label and passes it on for further NLP processing.

The above function feeds the question into the NLP pipeline en_doc = en_nlp(u'' + question) and obtains a Doc object containing linguistic annotations of the question. This Doc also performs sentence boundary detection/segmentation and we have to obtain the list of sentences which acts as the decomposed questions or sub questions. (Here I am only operating on the first sub question). Let us iterate over each token in the sentence to get its Parts of Speech and Dependency label. To extract only the WH-word we have to look for WDT, WP, WP$, WRB tags and to extract the root token from the sentence we look for its dependency label as ROOT. After writing all the records to the training data CSV file, it looks something like this:

#Question|WH|WH-Bigram|WH-POS|WH-NBOR-POS|Root-POS|Class
How did serfdom develop in and then leave Russia ?|How|How did|WRB|VBD|VB|DESC
What films featured the character Popeye Doyle ?|What|What films|WP|NNS|VBD|ENTY
...

Training the SVM and Prediction

from sklearn.svm import LinearSVC
import pandas

I prefer pandas over sklearn.datasets, First thing is we load our training dataset CSV file in the pandas DataFrame. This data frame will have all the features extracted in column-row fashion. Now to train our classifier we need to separate the features column and the class/label column so, we pop the label column from the data frame and store it separately. Along with that, we will also pop some unnecessary columns.

Here, the get_dummies() function converts the actual values into dummy values or binary values. What this means is that, if a record is something like below it will be converted to its binary form with 1 being the feature is present in the record and 0 as being absent.

The problem here is that the size (number of features) of prediction data frame and the training data frame varies due to the absense of some features in the prediction data frame. It is obvious that the question to be classified will be missing a majority of features that are present in the training dataset of 5000 questions. So, to equate the size (number of features) we append the missing feature columns that are present in the training data frame to the prediction data frame with the value of 0 (because these features are not present in the question to classify).

After we have both the data frames with the same size, we classify the question based on the training dataset using Linear Support Vector Machine. The LinearSVC model is fitted with the training features and respective labels. This fitted object is later used to predict the class with respect to the prediction data. It returns the question class/category.

Note: Here the DataFrame has multiple zero entries, hence you convert it into a sparse matrix representation; csr_matrix() takes care of that. from scipy.sparse import csr_matrix

Hi, Is there any way to do same thing via Spark Mllib? I was trying it via PySpark but not finding any suitable method to create ‘dataframes’. How to create dataframes for training in PySpark. Do you have any idea?