Tag clouds are histograms

The tag cloud is one of the most pleasing new charts to appear in recent years. Here is an example from Flickr, the on-line photo site, which is especially pretty:

When a Flickr user uploads a photo, she has the option of assigning one or more labels ("tags") to it. Flickr then produces a frequency count for each tag and then plots the top 120 (?) tags; the font size of each tag is proportional to its frequency of use. Tagging is hailed as a massively distributed and participative method of classifying information, and I think it works brilliantly.

The data itself is nothing more than a frequency table ("wedding" 132,356; "party" 120,222; etc.) but this presentation is visually appealing and aptly functional. Compare with this typical histogram presentation:

The Flickr version is ordered alphabetically whereas the histogram is by frequency and therefore it serves both people who are looking for the most popular categories as well as those who are looking for a specific term.

Flickr uses a clean interface without excessive underlining, highlighting, dots and so on. No chartjunk! To see chartjunk, go here and here.

Here are some ideas for extension:

Be flexible in selecting the underlying population of tags: clicking on wedding will give a list of all photos that were labelled "wedding": it is the most popular tag overall but will lead to too many results and too little relevance. Flickr has little tag clouds for each user

Be flexible with the metric being plotted: aside from frequency of use, the size of the words can vary with other measurements such as recency of use and frequency of clicks

Introduce a hierarchy of tags: for example, clicking on "wedding" leads to another tag cloud so users can drill down. This can be implemented using a hierarchical clustering algorithm, for example

P.S. The idea to write this post came to me while chatting with Scott Matthews, who has created an interesting browser add-on, found at www.bitty.com

Indeed histograms are usually plotted for continuous variables, then you get into bin widths and kernels and so on.

This is like a histogram for categorical variables. If I remove the white space between the bars, and turn the chart 90 degrees, it would look like a histogram.

Bin width is also a relevant concept here if we modify the interpretation a bit. A key issue (which I avoided bringing up in the post) is that of relevance; in all likelihood, we have a long-tailed distribution so that the most popular keywords are thousands of times more frequent than the rare keywords but there would be thousands of rare keywords. The decision then is how many categories should be combined into one bar so that there is optimal smoothing.

I'll have to do a separate post on this with some histograms to make this clear.

There is another distinction, being that there is no canonical ordering in the categorical case. Histograms are not ordered by frequency. I'd argue that histograms are a special case of barcharts, and it's confusing to call the plots above histograms.

I also wonder if the numbers in the frequency table underlying the flickr tag cloud aren't transformed in some way to avoid very popular tags overwhelming the display.