What are Word Clouds?

Word Clouds are a popular way of displaying how important words are in a collection of texts. Basically, the more frequent the word is, the greater space it occupies in the image. One of the uses of Word Clouds is to help us get an intuition about what the collection of texts is about. Here are some classic examples of when Word Clouds can be useful:

Take a quick peek at the word distribution of a collection of texts

Clean the texts and want to see what are some frequent stopwords you want to filter out

See the differences between frequent words between two or more collections of texts

Let’s suppose you want to build a text classification system. If you’d want to see what are the different frequent words in the different categories, you’d build a Word Cloud for each category and see what are the most popular words inside each category.

As you probably expected, there’s a Python library that does building Word Clouds very easy: word_cloud

Build a simple Word Cloud

Let’s build a simple word cloud for the reuters corpus:

1

2

3

4

5

6

7

8

9

10

import matplotlib.pyplot asplt

from nltk.corpus import reuters

from wordcloud import WordCloud

wc=WordCloud().generate(' '.join(reuters.words()))

plt.imshow(wc,interpolation='bilinear')

plt.axis("off")

plt.show()

Reuters word cloud

As you can notice, the Reuters dataset seems to be composed of finance related articles. Let’s do the same for nltk.corpus.brown:

1

2

3

4

5

6

7

8

9

10

import matplotlib.pyplot asplt

from nltk.corpus import brown

from wordcloud import WordCloud

wc=WordCloud().generate(' '.join(brown.words()))

plt.imshow(wc,interpolation='bilinear')

plt.axis("off")

plt.show()

Brown word cloud

As you can see, brown is more general. The most frequent words seem to be the ones you’d expect.

Let’s check out a few more features:

1

2

3

4

5

6

7

8

9

from nltk.corpus import stopwords

stop_words=stopwords.words('english')

# Set a max number of words, set a list of stopwords and set the max font size