Text Classification, Part I - Convolutional Networks

Nov 26, 2016
6 minute read

Text classification is a very classical problem. The goal is to classify documents into a fixed number of predefined categories, given a variable length of text bodies. It is widely use in sentimental analysis (IMDB, YELP reviews classification), stock market sentimental analysis, to GOOGLE’s smart email reply. This is a very active research area both in academia and industry. In the following series of posts, I will try to present a few different approaches and compare their performances. Ultimately, the goal for me is to implement the paper Hierarchical Attention Networks for Document Classification.

Given the limitation of data set I have, all exercises are based on Kaggle’s IMDB dataset. And implementation are all based on Keras.

GLOVE_DIR="~/data/glove"embeddings_index={}f=open(os.path.join(GLOVE_DIR,'glove.6B.100d.txt'))forlineinf:values=line.split()word=values[0]coefs=np.asarray(values[1:],dtype='float32')embeddings_index[word]=coefsf.close()embedding_matrix=np.random.random((len(word_index)+1,EMBEDDING_DIM))forword,iinword_index.items():embedding_vector=embeddings_index.get(word)ifembedding_vectorisnotNone:# words not found in embedding index will be all-zeros.embedding_matrix[i]=embedding_vector

A simplified Convolutional

First, I will just use a very simple convolutional architecture here. Simply use total 128 filters with size 5 and max pooling of 5 and 35, following the sample from this blog

Conclusion

Based on the observation, the complexity of convolutional neural network doesn’t seem to improve performance, at least using this small dataset. We might be able to see performance improvement using larger dataset, which I won’t be able to verify here. One observation I have is allowing the embedding layer training or not does significantly impact the performance, same did pretrained Google Glove word vectors. In both cases, I can see performance improved from 82% to 90%.