Then you can use the process_wiki.py script to get the text from the bzip and xml wikipedia data based on Gensim wikipedia process script, which use pyhton multiprocessing module to accelerated the process speed:

Then you can use the process_wiki.py script to get the text from the bzip and xml wikipedia data based on Gensim wikipedia process script, which use pyhton multiprocessing module to accelerated the process speed:

The data combined with simplified Chinese, traditional Chinese and English Text. We can convert the traditional Chinese to Simplified Chinese by OpenCC, and certainly, you should Install OpenCC first, which depends the system you use. After installed opencc, then use the following command to convert traditional Chinese to simplified Chinese text:

]]>http://textminingonline.com/training-a-chinese-wikipedia-word2vec-model-by-gensim-and-jieba/feed0Dive Into NLTK, Part XI: From Word2Vec to WordNethttp://textminingonline.com/dive-into-nltk-part-xi-from-word2vec-to-wordnet
http://textminingonline.com/dive-into-nltk-part-xi-from-word2vec-to-wordnet#respondTue, 30 May 2017 15:36:09 +0000http://textminingonline.com/?p=962Continue reading →]]>This is the eleventh article in the series “Dive Into NLTK“, here is an index of all the articles in the series that have been published to date:

WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short definitions and usage examples, and records a number of relations among these synonym sets or their members. WordNet can thus be seen as a combination of dictionary and thesaurus. While it is accessible to human users via a web browser, its primary use is in automatic text analysis and artificial intelligence applications. The database and software tools have been released under a BSD style license and are freely available for download from the WordNet website. Both the lexicographic data (lexicographer files) and the compiler (called grind) for producing the distributed database are available.

]]>http://textminingonline.com/dive-into-nltk-part-xi-from-word2vec-to-wordnet/feed0Exploiting Wikipedia Word Similarity by Word2Vechttp://textminingonline.com/exploiting-wikipedia-word-similarity-by-word2vec
http://textminingonline.com/exploiting-wikipedia-word-similarity-by-word2vec#commentsTue, 25 Apr 2017 15:15:59 +0000http://textminingonline.com/?p=916Continue reading →]]>We have written “Training Word2Vec Model on English Wikipedia by Gensim” before, and got a lot of attention. Recently, I have reviewed Word2Vec related materials again and test a new method to process the English wikipedia data and train Word2Vec model on it by gensim, the model is used to compute the word similarity. For word2vec, I recommended the “Getting started with Word2Vec” frist.

It took about 2.5 hours to process the dump wikipedia data, and the processed wiki data is divided into many parts on sub diretories:

The content in wiki_00 like this:

Anarchism
Anarchism is a political philosophy that advocates self-governed societies based on voluntary institutions. These are often described as stateless societies, although several authors have defined them more specifically as institutions based on non-hierarchical free associations. Anarchism holds the state to be undesirable, unnecessary, and harmful.
...
Criticisms of anarchism include moral criticisms and pragmatic criticisms. Anarchism is often evaluated as unfeasible or utopian by its critics.
Autism
Autism is a neurodevelopmental disorder characterized by impaired social interaction, verbal and non-verbal communication, and restricted and repetitive behavior. Parents usually notice signs in the first two years of their child's life. These signs often develop gradually, though some children with autism reach their developmental milestones at a normal pace and then regress. The diagnostic criteria require that symptoms become apparent in early childhood, typically before age three.
...
...

Note that the word tokenize is used with patten english tokenize module, and certainly, you can use nltk word tokenize or other english word tokenize module. Now we can use this script to train a word2vec model on the full English wikipedia data：

]]>http://textminingonline.com/exploiting-wikipedia-word-similarity-by-word2vec/feed1Dive Into NLTK, Part X: Play with Word2Vec Models based on NLTK Corpushttp://textminingonline.com/dive-into-nltk-part-x-play-with-word2vec-models-based-on-nltk-corpus
http://textminingonline.com/dive-into-nltk-part-x-play-with-word2vec-models-based-on-nltk-corpus#commentsSun, 26 Mar 2017 13:45:07 +0000http://textminingonline.com/?p=872Continue reading →]]>This is the tenth article in the series “Dive Into NLTK“, here is an index of all the articles in the series that have been published to date:

Accessing Text Corpora in NLTK is very easily. NLTK provides a NLTK Corpus Package to read and manage the corpus data. For example, NLTK includes a small selection of texts from the Project Gutenberg electronic text archive, which name Gutenberg Corpus. About Project Gutenberg:

Project Gutenberg (PG) is a volunteer effort to digitize and archive cultural works, to “encourage the creation and distribution of eBooks”. It was founded in 1971 by Michael S. Hart and is the oldest digital library. Most of the items in its collection are the full texts of public domain books. The project tries to make these as free as possible, in long-lasting, open formats that can be used on almost any computer. As of 3 October 2015, Project Gutenberg reached 50,000 items in its collection.

We can list the ebook file name of the Gutenberg Corpus in NLTK like this:

The King James Version (KJV), also known as Authorized [sic] Version (AV) or simply King James Bible (KJB), is an English translation of the Christian Bible for the Church of England begun in 1604 and completed in 1611.[a] The books of the King James Version include the 39 books of the Old Testament, an intertestamental section containing 14 books of the Apocrypha, and the 27 books of the New Testament.

]]>http://textminingonline.com/dive-into-nltk-part-x-play-with-word2vec-models-based-on-nltk-corpus/feed1Dive Into TensorFlow, Part VI: Beyond Deep Learninghttp://textminingonline.com/dive-into-tensorflow-part-vi-beyond-deep-learning
http://textminingonline.com/dive-into-tensorflow-part-vi-beyond-deep-learning#respondThu, 15 Dec 2016 09:53:14 +0000http://textminingonline.com/?p=838Continue reading →]]>This is the sixth article in the series “Dive Into TensorFlow“, here is an index of all the articles in the series that have been published to date:

In addition to deep learning, TensorFlow provides a high-level machine leaning api: TF Learn, which makes it easy to configure, train and evaluate machine learning models. TF Learn is a simplified interface for TensorFlow, to get people started on predictive analytics and data mining. The API covers a variety of needs: from linear models to Deep Learning applications like text and image understanding.

Why TensorFlow Learn?

To smooth the transition from the scikit-learn world of one-liner machine learning into the more open world of building different shapes of ML models. You can start by using fit/predict and slide into TensorFlow APIs as you are getting comfortable.

To provide a set of reference models that will be easy to integrate with existing code.

Learning from Small Data

Iris flower data set is a small data, but it’s perhaps the best known data set in the machine learning history:

The Iris flower data set or Fisher’s Iris data set is a multivariate data set introduced by Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis. It is sometimes called Anderson’s Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. Two of the three species were collected in the Gaspé Peninsula “all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus”.

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.

]]>http://textminingonline.com/dive-into-tensorflow-part-vi-beyond-deep-learning/feed0Dive Into TensorFlow, Part V: Deep MNISThttp://textminingonline.com/dive-into-tensorflow-part-v-deep-mnist
http://textminingonline.com/dive-into-tensorflow-part-v-deep-mnist#respondFri, 28 Oct 2016 15:53:49 +0000http://textminingonline.com/?p=769Continue reading →]]>This is the fifth article in the series “Dive Into TensorFlow“, here is an index of all the articles in the series that have been published to date:

Convolutional Neural Networks (CNN) are biologically-inspired variants of MLPs, which also called CNNs or ConvNets and made up of neurons that have learnable weights and biases. According wikipedia, Convolutional Neural Network is described like the following:

In machine learning, a convolutional neural network (CNN, or ConvNet) is a type of feed-forward artificial neural network in which the connectivity pattern between its neurons is inspired by the organization of the animal visual cortex. Individual neurons of the animal cortex are arranged in such a way that they respond to overlapping regions tiling the visual field, which can mathematically be described by a convolution operation. Convolutional networks were inspired by biological processes and are variations of multilayer perceptrons designed to use minimal amounts of preprocessing. They have wide applications in image and video recognition, recommender systems and natural language processing.

This chapter will use TensorFlow to build a multilayer convolutional network for MNIST database task. First load the MINST data from the exist dir:

Flattens the filter to a 2-D matrix with shape [filter_height * filter_width * in_channels, output_channels].
Extracts image patches from the input tensor to form a virtual tensor of shape [batch, out_height, out_width, filter_height * filter_width * in_channels].
For each patch, right-multiplies the filter matrix and the image patch vector.
In detail, with the default NHWC format,

input: A Tensor. Must be one of the following types: half, float32, float64.
filter: A Tensor. Must have the same type as input.
strides: A list of ints. 1-D of length 4. The stride of the sliding window for each dimension of input. Must be in the same order as the dimension specified with format.
padding: A string from: “SAME”, “VALID”. The type of padding algorithm to use.
use_cudnn_on_gpu: An optional bool. Defaults to True.
data_format: An optional string from: “NHWC”, “NCHW”. Defaults to “NHWC”. Specify the data format of the input and output data. With the default format “NHWC”, the data is stored in the order of: [batch, in_height, in_width, in_channels]. Alternatively, the format could be “NCHW”, the data storage order of: [batch, in_channels, in_height, in_width].
name: A name for the operation (optional).
Returns:

A Tensor. Has the same type as input.

For the example showed before, we can get the convolution result by the tensorflow conv2d like this:

Pooling is also called subsampling or downsampling, it reduce the feature map dimension and keep the important information. There were some different types of polling, like Max, Average, Sum and etc. For example, max pooling take the largest element from the feature map within the window，following shows an example of Max Pooling with a 2×2 window:

The official tensorflow deep mnist guide use the tf.nn.max_pool for the max pooling operation:

value: A 4-D Tensor with shape [batch, height, width, channels] and type tf.float32.
ksize: A list of ints that has length >= 4. The size of the window for each dimension of the input tensor.
strides: A list of ints that has length >= 4. The stride of the sliding window for each dimension of the input tensor.
padding: A string, either ‘VALID’ or ‘SAME’. The padding algorithm. See the comment here
data_format: A string. ‘NHWC’ and ‘NCHW’ are supported.
name: Optional name for the operation.
Returns:

A Tensor with type tf.float32. The max pooled output tensor.

For the example showed before, we can get the max pooling result by the tensorflow max_pool like this:

The generated values follow a normal distribution with specified mean and standard deviation, except that values whose magnitude is more than 2 standard deviations from the mean are dropped and re-picked.

Args:

shape: A 1-D integer Tensor or Python array. The shape of the output tensor.
mean: A 0-D Tensor or Python value of type dtype. The mean of the truncated normal distribution.
stddev: A 0-D Tensor or Python value of type dtype. The standard deviation of the truncated normal distribution.
dtype: The type of the output.
seed: A Python integer. Used to create a random seed for the distribution. See set_random_seed for behavior.
name: A name for the operation (optional).
Returns:

A tensor of the specified shape filled with random truncated normal values.

Based on the TensorFlow operation rules, we can test the two initial functions as below:

Now we can define the convolutional filter and the bias variable. The convolutional will compute 32 features for each 5×5 patch. Its weight tensor will have a shape of [5, 5, 1, 32]. The first two dimensions are the patch size, the next is the number of input channels, and the last is the number of output channels. We will also have a bias vector with a component for each output channel.

To apply it as the input of the first convolutional layer, we reshape x to a 4d tensor, with the second and third dimensions corresponding to image width and height, and the final dimension corresponding to the number of color channels:

In [54]: x_image = tf.reshape(x, [-1, 28, 28, 1])

Then convolve x_image with the weight tensor, add the bias, apply the ReLU function, and finally max pool:

Now that the image size has been reduced to 7×7, we will add a fully-connected layer with 1024 neurons to allow processing on the entire image. So reshape the tensor from the pooling layer into a batch of vectors, multiply by a weight matrix, add a bias, and apply a ReLU:

The DBpedia ontology classification dataset is constructed by picking 14 non-overlapping classes from DBpedia 2014. They are listed in classes.txt. From each of thse 14 ontology classes, we randomly choose 40,000 training samples and 5,000 testing samples. Therefore, the total size of the training dataset is 560,000 and testing dataset 70,000.

The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 3 columns in them, corresponding to class index (1 to 14), title and content. The title and content are escaped using double quotes (“), and any internal double quote is escaped by 2 double quotes (“”). There are no new lines in title or content.

First we write a script to preprocess the positive and negative movie reviews data to output_train and output_test data:

head -n 10 output_train

pos,Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as “Teachers”. My 35 years in the teaching profession lead me to believe that Bromwell High’s satire is much closer to reality than is “Teachers”. The scramble to survive financially, the insightful students who can see right through their pathetic teachers’ pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ……… at ………. High. A classic line: INSPECTOR: I’m here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn’t!
……

head -n 10 output_test

pos,I went and saw this movie last night after being coaxed to by a few friends of mine. I’ll admit that I was reluctant to see it because from what I knew of Ashton Kutcher he was only able to do comedy. I was wrong. Kutcher played the character of Jake Fischer very well, and Kevin Costner played Ben Randall with such professionalism. The sign of a good movie is that it can toy with our emotions. This one did exactly that. The entire theater (which was sold out) was overcome by laughter during the first half of the movie, and were moved to tears during the second half. While exiting the theater I not only saw many women in tears, but many full grown men as well, trying desperately not to let anyone see them crying. This movie was great, and I suggest that you go see it before you judge.
……

Then we copy and modify classification-example.sh to classification-lmdb-example.sh:

Here we used the same normalize_text() for lmdb train data and test data and add __label__ for the pos or neg tag:

textminer@textminer:~/text/fastText/lmdbdata$ wc -l lmdb.train
25000 lmdb.train
textminer@textminer:~/text/fastText/lmdbdata$ head -n 1 lmdb.train
__label__neg , a female vampire kills young women and paints with their blood . she has an assistant who doesn ‘ t want to be a vampire , so he has to do what she orders or be turned into a blood sucker . after a few kills , the assistant gets remorse and falls in love with a homeless girl . what can i say about this movie ? that its pacing is over-slow , that it has some strange sound effects ( never a bite sounded so strange ) and ambiance ( new jazz here i come ) and that lights don ‘ t seem to be included on the set . it looks like an auteur horror movie with all the self-sufficiency inside . the plot is completely stupid and as you can guess , it ‘ s the female vampire who explains how to kill her even if she doesn ‘ t have to do it of course , crosses , light , garlic and sticks don ‘ t work . it ‘ s not even a funny lousy movie . perhaps with some friends and a lot of beers , it can ‘ t have its funny sides ( to be honest , it ‘ s funny during 10 – 15 minutes near the end of the movie ) . don ‘ t be fooled by the troma sticker , it ‘ s one the bad movie they present .
textminer@textminer:~/text/fastText/lmdbdata$ wc -l lmdb.test
25000 lmdb.test
textminer@textminer:~/text/fastText/lmdbdata$ head -n 1 lmdb.test
__label__pos , this is the true story of the great pianist and jazz singer/legend ray charles ( oscar , bafta and golden globe winning jamie foxx ) . he was born in a poor african american-town , and he went blind at 7 years old , but with his skills of touch and hearing , this is what would later in life would lead him to stardom . by the 1960 ‘ s he had accomplished his dream , and selling records in millions , and leading the charts with songs and albums . but the story also showed his downfalls , including the separation from his wife and child , because of his affair with a band member , his drug and alcohol use , and going to prison because of this . also starring regina king as margie hendricks , kerry washington as della bea robinson , clifton powell as jeff brown , harry j . lennix as joe adams , bokeem woodbine as fathead newman , aunjanue ellis as mary ann fisher , sharon warren as aretha robinson , c . j . sanders as young ray robinson , curtis armstrong as ahmet ertegun and richard schiff as jerry wexler . it is a great story with a great singer impression , the songs , including hit the road jack , are the highlights . it won the oscar for best sound mixing , and it was nominated for best costume design , best director for taylor hackford , best editing and best motion picture of the year , it won the bafta for best sound , and it was nominated for the anthony asquith award for film music for craig armstrong and best original screenplay , and it was nominated the golden globe for best motion picture – musical or comedy . it was number 99 on 100 years , 100 cheers . very good !

]]>http://textminingonline.com/fasttext-for-fast-sentiment-analysis/feed0Dive Into TensorFlow, Part IV: Hello MNISThttp://textminingonline.com/dive-into-tensorflow-part-iv-hello-mnist
http://textminingonline.com/dive-into-tensorflow-part-iv-hello-mnist#respondSun, 31 Jul 2016 13:55:50 +0000http://textminingonline.com/?p=650Continue reading →]]>This is the fourth article in the series “Dive Into TensorFlow“, here is an index of all the articles in the series that have been published to date:

Hello MNIST
Like programming language has “Hello World”, machine learning has “Hello MNIST”. The MNIST database (Mixed National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for machine learning model training and testing.

Every MNIST data point has two parts: an image of a handwritten digit and a corresponding label. We call the images “xs” and the labels “ys”. Both the training set and test set contain xs and ys, for example the training images are mnist.train.images and the train lables are mnist.train.labels:

Note that the mnist.train.images is a tensor (an n-dimensional array) with a shape of [55000, 784]. The first dimension indexes the images and the second dimension indexes the pixels in each image. Each entry in the tensor is the pixel intensity between 0 and 1, for a particular pixel in a particular image.

The mnist.train.labels is a [55000, 10] array of floats, and the labels is represent as “one-hot vecotrs”, in this case, the nth digit will be represented as a vector which is 1 in the nth dimensions. For example, label 5 would be [0,0,0,0,0,1,0,0,0,0].

In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes. That is, it is a model that is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables (which may be real-valued, binary-valued, categorical-valued, etc.).

We get a 92% accuracy, is this good? In fact, it’s pretty bad. This is because we’re using a very simple model, just use it here as a tensorflow case study. You can check the MINST classify result here: who is the best in MNIST ?

We will get a better result in the next chapter, base on a convolutional neural network, just wait.

]]>http://textminingonline.com/dive-into-tensorflow-part-iv-hello-mnist/feed0Dive Into NLTK, Part IX: From Text Classification to Sentiment Analysishttp://textminingonline.com/dive-into-nltk-part-ix-from-text-classification-to-sentiment-analysis
http://textminingonline.com/dive-into-nltk-part-ix-from-text-classification-to-sentiment-analysis#respondSun, 24 Jul 2016 15:02:36 +0000http://textminingonline.com/?p=693Continue reading →]]>This is the ninth article in the series “Dive Into NLTK“, here is an index of all the articles in the series that have been published to date:

Sentiment analysis (also known as opinion mining) refers to the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information in source materials. Sentiment analysis is widely applied to reviews and social media for a variety of applications, ranging from marketing to customer service.

Generally speaking, sentiment analysis aims to determine the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document. The attitude may be his or her judgment or evaluation (see appraisal theory), affective state (that is to say, the emotional state of the author when writing), or the intended emotional communication (that is to say, the emotional effect the author wishes to have on the reader).

Generally speaking, sentiment analysis can be seen as one task of text classification. Based on the movie review data from NLTK, we can train a basic text classification model for sentiment analysis:

We get the best sentiment analysis performance on this case, although there were some problems, such as the punctuations and stop words were not discarded. Just as a case study, we encourage you to test on more data or more features, or better machine learning models such as deep learning method.