Thursday, May 3rd, 2012...11:08 am

Getting Graphic in the Data Deluge

April was Math Awareness Month with the theme, “Mathematics, Statistics, and the Data Deluge.” Massive amounts of data are collected every day. Decisions inferred from analyzing the data can impact our daily life. What you are wearing today may be such an example. Indeed, you got dressed with math. One particular way this is true relates to the clothes you buy in a store. Your choices are constrained by the retailer’s selections. To get a sense of what is “popular,” opinion groups rate products via surveys and polls. From the data, decisions are made as to what line a designer or retailer will offer.

From a history of credit card purchases to a stream of measurements from a satellite to Twitter activity, our technological world offers amounts of data almost beyond comprehension. The field of data mining searches for information and insight in such troves of bits and bytes. Using data to rank results in the order of web pages that Google returns when you conduct a search. Neflix mines its ratings data to predict how many stars you’ll give a movie.

Note, however, that mining through the data deluge does not necessarily involve considering all the data. At times, an important step is choosing a representative sample. Suppose a company wants to monitor Twitter as to the public’s opinion of a new product. What words should be included in a tweet or taken as a positive versus negative comment? Even if one cannot track every posting, if a representative sample is found, such data can return meaningful results.

Political polling is an example of conclusions from data drawn from a sample. In fact, Gallup Polls achieved national recognition by correctly predicting that Franklin Roosevelt would defeat Alf Landon in the 1936 presidential election. This was in direct contradiction of the Literary Digest, a widely respected magazine. George Gallup polled 50,000 whereas the magazine’s conclusions came from over two million responses. Choosing a representative sample was the key of Gallup’s successful prediction.

Data can yield a wealth of information. But, first, we must be able to interpret it. An important decision is how to sift through the data. For example, will we group the data by similarities or rank the elements? Taking Netflix user ratings, one could group movies into mathematical genres such that each group contains movies that users tend to view in a similar way. Alternatively, someone else might want to rank the movies by user ratings to create a top 100 movies on Netflix list.

Once the data is processed, visualization often offers added insight. For example, how might you interpret the wind data for the continental United States? Below we see an example from the IVRG-gallery of a 3D graph containing the Kevin Bacon number of a social network.

Visualization is an active field of research. Choosing how to graph a dataset is, in itself, an important decision. One may have meaningful results after computation and appropriate visualization may make the associated insight obvious. However, some data is more appropriate for graphing than others. For instance, do you see the following pie chart as a helpful display of information?

While this graph is intended to make the point with some humor, choosing how and when to graph data is an important part of many applications of data mining.

Returning to presidential politics, let’s view two word clouds, which scale the size of words proportionally to their frequency in a source text. Below, we see two word clouds created with Wordle from transcripts gathered from a LexisNexis search of “Obama OR McCain” from September 1, 2008 until November 4, 2008 (Election Day). One word cloud comes from MSNBC transcripts and the other from Fox. Do you know which is which?

The images were created by Greg Newman, a political science major at Davidson College, who is using mathematical clustering algorithms to further analyze such data. While the pictures cannot capture all the inferences that Newman hopes to make with his research, they can give helpful information of these datasets prior to the work involved in data mining. (The image on the left is from MSNBC and the one on the right is Fox.)

As we march further into May, keep in mind that data surrounds us. While you may or may not have expertise in the field of data mining, graphically displaying data, in itself, can lead to useful insight. Like an artist with a blank canvas, you must choose what tools to use to paint your picture of the data. What choice you make is often both a science and an art.