An online community for showcasing R & Python tutorials. It operates as a networking platform for data scientists to promote their talent and get hired. Our mission is to empower data scientists by bridging the gap between talent and opportunity.

Building Wordclouds in R

In this article, I will show you how to use text data to build word clouds in R. We will use a dataset containing around 200k Jeopardy questions. The dataset can be downloaded here (thanks to reddit user trexmatt for providing the dataset).

We will require three packages for this: tm, SnowballC, and wordcloud.

Now, we will perform a series of operations on the text data to simplify it.
First, we need to create a corpus.

jeopCorpus <- Corpus(VectorSource(jeopQ$Question))

Next, we will convert the corpus to a lowercase.

jeopCorpus <- tm_map(jeopCorpus, content_transformer(tolower))

Then, we will remove all punctuation and stopwords, and convert it to a plain text document.. Stopwords are commonly used words in the English language such as I, me, my, etc. You can see the full list of stopwords using stopwords('english').

Next, we will perform stemming. This means that all the words are converted to their stem (Ex: learning -> learn, walked -> walk, etc.). This will ensure that different forms of the word are converted to the same form and plotted only once in the wordcloud.

jeopCorpus <- tm_map(jeopCorpus, stemDocument)

Now, we will plot the wordcloud.

wordcloud(jeopCorpus, max.words = 100, random.order = FALSE)

This will produce the following wordcloud:

There are a few ways to customize it.

scale: This is used to indicate the range of sizes of the words.

max.words and min.freq: These parameters are used to limit the number of words plotted. max.words will plot the specified number of words and discard least frequent terms, whereas, min.freq will discard all terms whose frequency is below the specified value.

random.order: By setting this to FALSE, we make it so that the words with the highest frequency are plotted first. If we don’t set this, it will plot the words in a random order, and the highest frequency words may not necessarily appear in the center.

rot.per: This value determines the fraction of words that are plotted vertically.

colors: The default value is black. If you want to use different colors based on frequency, you can specify a vector of colors, or use one of the pre-defined color palettes. You can find a list here.

That brings us to the end of this article. I hope you enjoyed it! As always, if you have questions, feel free to leave a comment or reach out to me on Twitter.

Edit: After struggling a lot, I resorted to StackOverflow for a fix. I forgot to convert the document to lower case. As explained in the StackOverflow thread, a lot of the words start with “The”, with an uppercase “T”, whereas stopwords has “the” with a lowercase “t”. This is what was causing the words “the” and “this” to appear in the wordcloud.

Note: I learnt this technique in The Analytics Edge course offered by MIT on edX. It is a great course and I highly recommend that you take it if you are interested in Data Science!

Also- why is “the” showing up in the first place? Isn’t it one of the contained stopwords in stopwords(‘english’)?

Teja K

I was just informed that I forgot to convert it to lower case. A lot of the questions start with “The”, with an uppercase T, and “the” in stopwords has lowercase t, which is why it wasn’t removed. I have updated the code to reflect this.

Ryan

Hey Teja,

Is there a way to specify which column you want to perform the word cloud on? For example, if you only wanted to create a wordcloud on the “Answer” column and not the whole dataframe?

Thanks.

Teja K

Hello Ryan, if you look at the first line of code where we create the corpus, you can see that we created it using the question column. I’m afraid that the only way to do it with the answer column is just to use jeopQ$Answer instead of jeopQ$Question and repeat the rest of the code.

141xgc

Nice post…I ran into what seems to be a problem reported by others too though. Running tm under R 3.2.2 the removeWords step results in…

It’s really interesting. The various OS and R updates I ran since I posted this question seem to have taken care of the problem. Teja’s suggestion was to try adding lazy=true to the tm_map statements but that wasn’t necessary in the end. Thanks!!!

Wayne Gray

Nice. But no matter what I do, I cannot get rid of the words “this” and “the”. I have gone so far as to quit and restart R (so I would have nothing extra loaded in my environment) as well as to break your: line: jeopCorpus <- tm_map(jeopCorpus, removeWords, c('the', 'this', stopwords('english')))

Hello Wayne, thank you for bringing this to my attention. It seems that running the commands in the wrong order is causing this to happen. I just ran the Corpus and the removeWords command (with only the word ‘the’) and I managed to make the word ‘the’ go away. I need to do some more tests, and I will update the article with a more concrete solution tomorrow.

Ryan

I’m confused. Why is “the” showing up in the first place? Isn’t it one of the contained stopwords in stopwords(‘english’)?

Syed Ali

Nice article

Syed Ali

Did you learn how to use the term_score function in the course? Would you recommend it?