More than a List of Words

When indexing text based word frequency / relevance which may be applicable for web searches, one of the procedures used is to create a term frequency (tf) array followed by an inverse document frequency (idf) one. You can read more about this here.

In a previous post I experimented with some text in order to build hashmaps with the words of sentences (to keep things in perspective for a blog post). In that post I used a string that I copied from a course I took some years ago. The sting was already preprocessed. The text had already been stripped off punctuation marks.

Today I was experimenting with MongoDB using Java and decided to look up this project. I found it in one of my Eclipse workspaces and decided to add some preprocessing and polishing the operations. If you are working with Big Data or AI you would probably be working with Python. On my next post I will touch on some Python.

For sanity I will make some comments on the data followed by the output of the program and ending with the actual code. That way I only insert one piece of text from the Eclipse console and the Java code from the Eclipse IDE.

The original pre processed text was obtained from Alice’s Adventures in Wonderland, by Lewis Carroll. For this post I navigated here (http://sabian.org/alice_in_wonderland1.php) and decided on a paragraph that contained some punctuation marks. I chose:

“Well!” thought Alice to herself “After such a fall as this, I shall think nothing of tumbling down-stairs! How brave they’ll all think me at home! Why, I wouldn’t say anything about it, even if I fell off the top of the house!” (which was very likely true.)

For our purpose we need to remove all punctuation marks and convert the text to lower case. As usual, you need to start with a reasonable / well thought approach and then fine tune it. If there are a lot of tweaks in your initial approach, then you should consider an alternative one.

We start by displaying some information from the method / function that displays the text that we will be processing. Of course, if this would be an actual project, we would have to process the entire book; not just a single paragraph.

After the information for the paragraph and the width we wish to use for display the text, the actual initial text is displayed. You can see why I chose this paragraph. It contains several punctuation marks (e.g., “!,-() etc).

The next part displays the text in lowercase. The idea behind this is to treat different versions of the same word (e.g., She, she, SHE, etc.) as none. As expected, the number of words in the paragraph has not changed.

The next step is to replace all punctuation marks with spaces. We could have used an empty string but in some cases some of the resulting text might not be actual words. For example “down-stairs” was converted to “down stairs” which is reasonable. If we would have encountered “down-stairs-dungeon” we would have generated “downstairsdungeon” which is not a valid word.

As expected the number of words went up from 48 to 63. You can easily find and count them and verify that all is well so far.

The next step which may or may not be taken is to remove the excess of white spaces. If we do not do it now, we could postpone the task and include it in a later step while we build an array of words by just skipping entries with a space “ “. This is one of the items one would consider in production. One approach might be more efficient than the other. In our case efficiency is not that important because we are just processing one paragraph and it takes a fraction of a second. In production one might be dealing with processing hundreds or thousands of paragraphs per second.

Now that we have the text in lower case and without punctuation marks we will start processing it to build our set of sorted words and associate frequencies. You will note that “t” is not really a word, or “wouldn”. In production at this point you might replace / delete words or just use additional software that would remove some words based on the language. Such step falls beyond the scope of this post. That said; I will probably go over such mechanism in a future Natural Language Processing (NLP) post.

For our next step we will build an array of words with one word per entry. In our example it seems that we end up with an array of 51 words.

The next step is to sort the words in ascending alphabetical order. In practice this step is not required, but for a human is nice and reassuring that all is well so far. Not only that, but you can easily locate words that have been repeated. We can see that “I” among others, has been repeated 3 times.

Now we are ready to build a histogram. A histogram is a data structure that holds unique words and associate with each word is the number of times that word appears in our sample paragraph. As we expected the word “I” has an associated count of 3 and “think” a value of 2. So far things seem to be going well.

The next step is also something that I did to better observe the values of each key. The keys / words are in alphabetical order but they are also sorted by value frequency. As expected very few words are used more than once and the words that are appear towards the bottom of the display.

The next step was just done for fun. At the time of the initial post I wanted to put the results in a set. That is the reason we end displaying a set with our data.

Hope you enjoyed reading this post. The main reason I experiment with concepts and then generate a blog post is to make sure that what I learn / observe is correct.

If you have comments / questions regarding this or any other post in this blog or if you need my services to help with software related issues / concerns please leave me a note on the bottom of this post or send me a private message (john.canessa@gmail.com).