General Questions

Q: What is this?

Semantics-preserving word cloud visulization tool. A word cloud consists of the most important
words in that document. Each word is printed in a given font and scaled by a factor roughly proportional to its importance.
The printed words are arranged without overlap and tightly packed into rectangular shape. Your word clouds can be tweaked with
various fonts, layouts, and color schemes. Read the description for more details.

How is it different from Wordle?

Our clouds are semantics-aware. In Wordle (or other similar tool), the placement of the words is
completely independent of their context and the coloring is random. However, when visualizing the given text with a word cloud,
it is possible to automatically identify groups of semantically related words,
or the major topics in the input text. Our system places similar words close to each other and assigns them the same color.
This simplifies visual analysis of the text.
See an example of a random (left) and semantic (right) word placement.

How do you make a word cloud?

The input text is parsed and tokenized into a collection of words. The common stop-words ("a", "the", "is")
are removed, while the remaining words are grouped by their stems. Then the words are ranked in order of their importance
in the text and the font size for every word is calculated. Next semantic similarities between pairs of the words are
computed, based on co-occurrence in the same sentences. Then similar words are grouped together and the groups get different colors.
Finally, we layout the words with an algorithm, which employs a theoretical model
for computing semantic word clouds.

Basic usage

Q: Can I save the word cloud as jpeg/gif/png/svg/pdf?

A cloud is generated as SVG file, which can be downloaded by using a link at the top-right corner of the cloud.
You may also download PNG and PDF versions of the file.
In order to convert it to another format, use a vector graphics editor, like Inkscape.

Q: Is there a way to edit the cloud once it is created?

Yes! You may drag and reposition the words on the screen. Actually, you can even remove words,
by right-clicking on them and using the popup menu. It is also possible to re-layout the cloud by
pressing Apply New Options button.

Q: Can you visualize a webpage or a blog?

You may enter any text, URL of a blog or a webpage. Maximum size for pasting is 500 kilobytes but mostly depends on your
browser and computer.

Q: What about my document? How about Twitter? Reddit? Google? YouTube?

We can create a cloud using several sources of the input text:

URL of a document - Link to the file posted on the web. Plain text and PDF documents are accepted.

Twitter - Enter "twitter: search_query" (without quotes) to parse and create a word cloud for 100 most recent tweets returned by
Twitter Search for a given query. You may customize the results by specifying search operators. The query
"twitter: graph drawing size:500 type:popular lang:en include:retweets" builds a word cloud for
500 most popular tweets and retweets written in English containing the words 'graph' and 'drawing'.

YouTube video comments - Use a link to a YouTube video to get a cloud for the comments.

Reddit comments - Give us a link to a Reddit thread and we'll parse, extract, and visualize the top 500 comments.

Q: Does it work for non-English languages?

The tool supports Unicode and many languages. You may choose the language of your source text
in the Advanced Options section. The chosen language affects the way how the input text is parsed: splitting into sentences, tokenization
into words, and stop-word removal. Please note that some of the features are not supported for non-English languages; for example,
the TF-IDF ranking can only work for English, as it utilizes Brown corpus.

Q: How do you determine sizes of the words?

We rank the words in order of their importance in the input text. We use three different
ranking functions.
Term Frequency is the
most basic ranking function and one used in most traditional
word cloud visualizations. Term frequency tends to rank highly
many semantically meaningless words. Term Frequency-Inverse Document Frequency addresses this
problem by normalizing the frequency of a word by its frequency in a larger text collection. The third ranking function
is based on the LexRank algorithm,
which is a graph-based method for computing relative importance of textual units using eigenvector centrality.

Note that the last two algorithms always produce provably near-optimal results,
though sometimes the clouds are not as pleasant as the clouds generated with heuristics.

Q: How do you color the words?

We cluster the words according to their semantic meaning,
and then use different colors for the computed clusters. Thus, semantically related groups of words (e.g., gene, protein, disease, metabolism)
are likely to have the same coloring. This is an intuitive way to visually identify major topics in the input text.
To identify the clusters, we employ the modularity-based algorithm.

Q: I want to know more about your algorithms!

The system relies on a geometric problem behind drawing word clouds, which is described in a series of research papers:

Q: May I see the source code?

Yes! Source code for the entire system is available on Github.
The system also comes as a command-line tool: download cloudy.jar and invoke "java -jar cloudy.jar [options] [input file]" (without quotes).
The available options can be printed by running "java -jar cloudy.jar -?".

Troubleshooting

Q: Who are the people behind the system?

The tool is supported by a group of researchers and developers at the University of Arizona.