Data for dummies: 6 data-analysis tools anyone can use

If you care only about the cutting edge of machine learning and how to manage petabytes of big data, you might want to quit reading now and just come to our Structure:Data conference in March. But if you’re a normal person dealing with mere normal data, you’ll probably want to stick around. Although your data might not be that big or complex, that doesn’t mean it isn’t worth looking at in a new light.

With that in mind, here are six of the best free tools I’ve come across for helping we mere mortals analyze our data without having to know too much about, well, anything (I’d keep an eye on the still-under-wraps Datahero, too). I’ve gathered some personal data and tracked down some interesting public data sets to help demonstrate what a novice can do with them. Someone with more skills can certainly do a lot more, and larger datasets will provide greater statistical significance.

Advertisement

BigML

BigML is to machine learning what Blue Moon is to Belgian ales: a simple approach to something generally more complex — but also rather accessible and good enough to do the job in a pinch. I explained the service more thoroughly in recent post about it being used to generate predictions of Kickstarter success, but here’s how it works, in a nutshell: Users upload and format data (which is actually pretty easy), BigML discovers the myriad relationships between the variables and creates a predictive model, and users enter hypothetical data and receive a prediction.

I’m pretty bad when it comes to entering my data into Fitbit (see disclosure), but I was relatively good for a month this summer as I prepped for the Warrior Dash, and that’s the data I used to demonstrate BigML. This prediction of how many calories I can expect to burn in a day would work a lot better if I had a bigger sample size and hadn’t occasionally forgotten to log calories and hours slept, but you get the point. The first image is the model the service generated; the second is the prediction interface.

Google Fusion Tables

The user interface for Google Fusion Tables (s goog) isn’t what I’d call pretty (“sparse” is probably a better description), but the still-in-experimental-mode visualization tool sure is easy if your data is nicely formatted. I created this interactive map simply by uploading a publicly available dataset about gun violence and clicking the button to create a map:

For this simple comparison of gun ownership and gun homicide rates, I just checked the countries by which I wanted to filter the chart. Easy.:

Infogram

If you have really simple data — like a few columns and a handful of rows — Infogram might be the easiest to use of the bunch. The company launched last year with a variety of infographic templates, but it has since expanded to include a large number of charts and graphs, too (including line, pie, pictorial, treemap and bubble). Furthermore, it gives sample data, which you can use as an example to enter your own or format the table you want to upload, and the interactive charts embed nicely into web pages (ours, at least).

Here are the top 10 things I ate during the time I was logging food via Fitbit, excluding copious amounts of beer, water, coffee and Diet Pepsi that I didn’t record.

Many Eyes

Many Eyes is a free web service run by IBM (s ibm) that includes a wide variety of visualizations ranging from maps to pie charts to scatter plots. But what makes it stand apart from the others is the suite of text-analysis tools it offers — not only are they fairly novel, but all they require users to do is paste a page of plain text into the web interface and press a button to visualize it. I used it to analyze the last 15 posts I’ve written for GigaOM.

What did I find? For starters, I use the words “data,” “Facebook” and “users” a lot.

When it comes to two-word combinations, “big data,” “data centers” and “hard drives” are among the biggies.

This one is particularly interesting, showing how I tend to form phrases around certain words with common conjunctions, or just a space, in between.

Apparently, out of 10,013 words, I only used “cloud” 20 times. I usually followed it up with “provider,” “servers,” “computing,” “-based” and “providers.”

For fun, I also made a word cloud based on couple month’s worth of Fitbit food logs. It turns out, you can take the boy out of Wisconsin, but …

Statwing

Statwing might be my favorite of the bunch, if only because it’s so simple yet actually tries to teach users about statistics. You upload data, check the variables you’re concerned with, and it plots their relationship. (It also can describe the variables by highlighting the sample size, minimum, maximum, mean, median and standard deviation.) Graphs are accompanied by explanations as to how strong the correlation is based on various statistical metrics, as well as the results of a linear regression model.

To demonstrate Statwing, I went back to the Fitbit data. Of the variables that Fitbit tracks, some correlations are easy to predict (e.g., steps and calories burned), but I was kind of surprised to see that the 86 minutes a day I spent being fairly active really weren’t that good of an expenditure of my time.

Tableau Public

Tableau Public, the only free version of the popular business-intelligence software, was clearly designed with business users in mind. It expects a lot of structure in the data, and although you can edit almost every aspect of it within the application to get it into usable shape, the service doesn’t allow much guidance if you don’t speak the language of BI (it also requires Windows (s msft)). But the software is very good at deciphering the characteristics of different variables, the drag-and-drop operation makes it kind of easy to experiment and the wide array of visualizations look really nice.

Using my Fitbit data (and here’s where you see how lax I am at data entry), I created a line graph comparing the calories I ate each day with the calories I burned. Assuming I didn’t go crazy eating on the days I forgot to make entries, the good news is I never ate more calories than I burned. (Note: Although these are static images, Tableau Public actually lets you embed interactive charts, which I’ve used in the past on several occasions, but they don’t always fit well within our pages.)

Here’s one I played around with a while back charting Amazon’s “Other” revenue againt the number of objects stored in Amazon (s amzn) S3.

There is, however, one disclaimer that applies to all of these tools: I didn’t get into cleaning and formatting data, which can be a somewhat arduous process. Many tools expect some sort of structure to the data — the X axis to be in columns and the Y axis in rows, measurements without units (e.g., grams), etc. — that just isn’t present if you’re downloading an Excel (s msft) or CSV file rather than creating it yourself. Sometimes, with comprehensive datasets like your Fitbit Premium data, you’ll have to separate or combine the relevant data into new spreadsheet tables before uploading it to a service. But once you have the data ready to go, these tools can help you analyze it, visualize it and hopefully glean some insights from it.

Disclosure: Fitbit is backed by True Ventures, a venture capital firm that is an investor in the parent company of this blog, Giga Omni Media. Om Malik, founder of Giga Omni Media, is also a venture partner at True.

Reblogged this on Change Meme and commented:
I’ve spent about an hour playing with these tools, I’m loving Statwing, and will use it to analyse some of the data we’ve got on adoption of new technology. The Infogram tool also has potential to help present data in a more appealing way.

Question: from what I gleaned from BigML’s site, the reason “everybody” can use it is simply that the limit themselves (and their users) to a single modeling technique, decision trees, with limited possibilities for adjusting its characteristics. Is limiting choice really the best way to go?

oh if you had only mentioned real analytics tools instead of the drivel you’ve posted. any real user should look into Graylog2, Kibana, or Splunk for real data analysis tools: all of which require no more technical skill than a monkey on crack.

Reblogged this on Marketing Online Updates and commented:
Here are a few data analysis tools as demonstrated by Derrick Harris of GigaOM.
The importance of knowing the statistical data of your websites, blogs, social media activities and your overall social influence cannot be understated. This information is what will guide you in moving your business forward.
And in the case of anyone looking to sell their Internet Marketing skills to local businesses, this type of information, in regard to their business, could determine whether or not you get a new client!

Reblogged this on Niki.V.all.ways.My.way. and commented:
OK … I’m not giving it a LIKE because it has like 3x the number of words I would EVER read about the iNet, but seriously, the GRAPHIC gives up all the information. Excellent work on that so, Reblogging the picture, really. <3 important 411!