Motivation:

Natural Language Processing (NLP) is a field of study that focuses on “computers to process and understand human languages.” I am working on a CAPSTONE project for data science specialization with the objective of building web app that will predict the word following the one typed before it, just like cell phone SMS apps. The motivation behind this is swiftkey a company that write software for smart phones texting apps.

When building data product the first step is to get the data and the second is to examine (explore) it. Exploring the data is necessary to gain better understanding of the content and prepare it for modeling and application building. As such, the text mining process follows specific set of steps to get ready for further statistical analysis and developing data mining applications.”

The objective of this blog is to show the steps taken to explore unstructured text. Since the data is made available in zip format we are not going to look at how to scrape the data from the web here, nor will we see how to develop the product. The raw text data used here was randomly scraped from the web and includes news snippets, blog texts and twitter messages.To explore the data we will be using several text mining software packages in R programming.

Examing the data and its content

This is a reproducible document that shows each of the steps starting from downloading the data from the source, import the data to R memory, sample the data, import all data into corpus, tidy/cleanup the corpus, generate three ngram models (unigram, bigram, trigram) using text mining packages, explore the data by looking at frequency distribution of words, visualize the data with wordcloud, ggplot and drawing conclusion of what is observed.

Import the data

The first step is to import the data. A combination of R and UNIX command lines(sampling) are used to import the data into R and create a “Corpus”. A corpora is a container framework of several different data from various sources, similar to an SQL database that holds several tables. Except that the data in Corpa is often unstructured, meaning does not’t necessarily fit neatly into rectangular data frame, as required by structured databases. No specific definition of variables, rows and columns. On our case we are going to dump the news snippets, the blog text and twitter messages into a software container - Corpa. This provides a common interface to all the documents that reside int he Corpa, making it efficient to work with thousands even millions of documents at once.

Examine the data to list number of bytes, character and total number of words for the entire raw data. We are starting with 102.4 million words.

library(stringi)
library(stringr)
# -c The number of bytes
# -l The number of lines
# -m The number of characters
# -w The number of words
# system("wc -clmw final/en_US/en_US.*.txt")
stri_stats_general(blog) # Blog stats

Spliting sentence boundries

Before importing the three files into a corpus we split lines into sentences. For that we are going to use a regex expression that will identify sentences that end with a “.”, “!” and “?”. IT is supposed to not treat the period in “St. Louis” or “Lyndon B. Johnson”" as end of sentence.

Preprocess and cleaning the Corpus

Now that we have our collection of text neatly seating in the Corpa, we will need to do preliminary cleaning up of the data (or tidy the data). This includes removing profane words, white spaces, numbers and computations. When cleaning the corpous, it is important to follow specific set of steps in order avoid loosing certain key words.

A fucntion that will clean the data

The first function removes none English letters and punctuations including ones used to create emoji on tweets. We also import list profane words saved in a file so they can be used to match and remove profane words. We then create function that contains the tm_map function to remove the content.

# the following commands can be used to examine the corpus. For brevity, it is commented out.
#inspect(data.corpus_clean)
#summary(data.corpus_clean)
#as.character(data.corpus_clean[[2002]])

Tokenize the data (ngram)

We start tokenizing the data with the following commands to see the distribution of word frequency in the corpus. We can also see the sparsity (ho much of the data are zero) percentile and maximum term length in the dataset.

Histograms of tokens

If we take a look at the histogram for the word data set, we can see vast majority of words used in news, twitter and blogs are repeated. So, few words represent the vast majority of the frequency.

Relative Term Frequency of the tokens for unigram

To examine what percentile of the total words are represented in the frequency distribution, we extract the frequency counts, classify them into chunks (top 5, top 10, top 15 etc) and calculate the percentile. As we can see in the table the top 5% words account for 86% of the total frequency, and next five words account for additional 7.1%. combined the top 10% of words account for 93% of the frequency.