Archive for the ‘data analysis’ Category

I recently saw this announcement for an open source tool and thought it might be interesting to some folks that deal with messy data sets.

Google Refine provides an interesting take on grouping and filtering data and then getting it cleaned up. It also does some pretty interesting stuff using web APIs to transform data (see video 3, in particular).

The tool focuses on the data clean-up side of things, rather than analysis and reporting. You may end up running into some trouble with larger data sets, as, I believe, the processing needs to be performed entirely in memory.

However, for data geeks out there, it’s definitely worth a look and might even be a nice complement for Kirix Strata at times.

If you have a chance to play with it, feel free to let us know what you think in the comments below.

So, I was looking at our sales tax report in Quickbooks and, like a good accountant, just quickly checked to make sure it matched up against the total revenues in the income statement. They didn’t match.

Hmm… funny thing about accounting, things really ought to balance.

It was a small discrepancy, but after searching unsuccessfully for the difference, it was clear that the issue involved more than one transaction. And, unfortunately, there were just far too many transactions to try and come up with a solution manually.

So, since I happened to have this data browser laying around, I exported both reports as CSV files and opened them up in Kirix Strata™.

The Quickbooks CSVs were obviously meant for spreadsheet export (as it included subtotals and odd record breaks), so I quickly did some clean up and then did a few gymnastics to compare the tables. Turns out there were a few manual journal entries that weren’t mapped to the sales tax codes required by Quickbooks. And here I was hoping to blame Quickbooks… oh well.

Running through this process was a 5 minute affair, but it made me wonder about all these other small data manipulation tasks that are out there. There have got to be millions, nay, billions, of these things — 5 minute one-off, ad hoc data tasks that just can’t be solved with the help of a spreadsheet (in this case, grouping or relationships were needed to do this quickly).

What do people normally do in these situations? I fear that they probably spend hours working the problem manually. Got a similar story and/or solution? Feel free to share in the comments section below.

How do you access them? Well, there’s a slight hitch. You need to fire up an EC2 instance, hook into the set and then perform your analysis. You just pay for the cost of the EC2 service. Given how massive these tables are, it seems like a pretty good way to go. A step closer to the supercomputer in the cloud.

We’re devoted users of Amazon S3 here and have also done some work with EC2, which is quite impressive. Overall, this is another example of a nice trend where large data sets are becoming more easily accessible.

This is not a blog that delves into political issues, but I happened to notice that the Obama transition team released the names of all their donors today. However, inexplicably, they don’t have them in a CSV format for easy slicing and dicing in your favorite data analysis software.

Today I came across a survey on data mining by a consulting firm called Rexer Analytics. Their survey took into account 348 responses from data mining professionals around the world. A few interesting tidbits:

* Data miners spend only 20% of their time on actual modeling. More than a third of their time is spent accessing and preparing data.

* In selecting their analytic software, data miners place a high value on dependability, the ability to handle very large datasets, and quality output.

We’ve found these issues to hold true with our clients as well, particularly in various auditing industries. Auditors will get a hold of their client’s data, maybe in some delimited text file. The data set is inevitably too large for Excel to handle easily, so they may try Access (of course, once they are eternally frustrated, they give Kirix Strata™ a shot).

Once they can actually see that data set, they start exploring it to learn about what they’re looking at and then inevitably find out how dirty it is. Multiple fields are mashed together or individual ones are stripped apart. Company names appear multiple times in various forms (”I.B.M” vs. “IBM”). An important bit of numeric information is embedded in a text field. There is no end of time spent “purifying” the data set to make sure to avoid the “garbage in, garbage out” syndrome.

Often overlooked, data cleansing is really as important as the analysis itself. Only once this step is complete can you move on to your data mining or other data analysis.

Check out the survey summary yourself and let us know if it matches your experience.

Historically, business intelligence hasn’t been all that its cracked up to be. Very expensive data warehousing systems are put in place. Existing reports are re-created and all kinds of new objects/reports are added. Everyone is thoroughly trained on the system. Pretty 3D graphics are added to the dashboard. The project goes over budget. Users revert to using Excel.

Some would say that BI is just a fancy way to do organizational reporting. There’s a lot of truth to this; why else do people continue to rely on their spreadsheets when they need to do some quick and dirty analysis? I think the answer is that there is a substantial ad hoc component to the “intelligence” part of business intelligence that will never be captured by a large, centralized system.

Having a few BI gurus setting up reports for everyone just isn’t an efficient use of resources. Nor does it capture the collective brain power of the organization. And there is quite a bit of this power ready to be tapped, even in the deepest corners of a company.

For example, we’ve done a lot of work with folks in the accounts payable industry. AP is not what you’d call a very sexy part of the organization — however, billions of dollars flow through it each year, as the keepers of the company checkbook. There are efficiencies to be gained, analyses to be done and, in our experience, a whole slew of people eager to do a bang-up job. However, when an AP manager needs to get something from the legacy system or just wants to create a new type of report they have one of two options — either go to IT and hope they can get a report created within the next couple weeks or go to mattresses with Excel/Access and do what they need to do themselves.

Neil echoes this when comparing BI 1.0 to BI 2.0:

BI 1.0 Fallacy: Most users want to be spoon-fed information and will never take the initiative to create their own environment or investigate the best way to get the answers they need.

BI 2.0 Reality: The Consumer Web invalidates this idea. When given simple tools to do something that is important and/or useful to them, people find a way to “mash up” what they need.

We’ve seen people’s initiative on display time and again and are really happy that Kirix Strata is playing a part in making this type of ad hoc analysis, integration and reporting easier than ever.

So, give those articles a read and see what you think. Also, please consider joining us on Wednesday at 1 PM EDT for our free hour-long web seminar with Snaplogic called “BI 2.0: Data Visualization and Spreadsheets on Steroids.” All the pertinent details can be found here. Hope to see you then!

During our journey from the MySQL Conference last week to the Web 2.0 Expo this week, we’ve continued to see a common pattern. There are a tremendous number of back-end business systems and almost all of them seem to rely on the humble Web browser for the front-end.

Now there is a very good reason for this — the Web is a tremendous distribution platform and companies are taking advantage of the same protocols and tools on their intranet that have already propelled the growth of the Internet. And, bottom line, browsers are really, really easy to use.

Data Integration 2.0

As an example of this situation, we met some folks last week from Snaplogic. Their product makes it really simple to integrate data from back-end systems and enables you to access data from anywhere, mash it up and publish it. However, when they were demoing the software, there wasn’t particularly much to show off, since a normal browser can’t do much with the data except let you “Save As” to your desktop.

Thankfully, this type of front-end work is perfectly suited for a data browser. So, we decided to demo each other’s products at the Web 2.0 Expo this week. Snaplogic is able to easily show the value they bring to the table by actually displaying the data integration they can do. We put together a quick dashboard with Snaplogic data and were able to show off all the cool stuff you can do once you’ve got a data feed to work with. Indeed, this was like the serendipitous combination of chocolate and peanut butter.

The Last Mile

This reminded me of a great metaphor used in a post by Juice Analytics called “The Last Mile of Business Intelligence.” The idea is that businesses have built lots of large back end systems — from data integration and warehousing to BI and reporting systems. But, in the end, it still seems that actual users are left out in the cold. The “last mile” between the heavy machinery and the actual knowledge worker has not been adequately bridged.

So unfortunately, the typical pattern involves working with a back end system via a Web browser and then, ultimately, exporting a CSV file to work with in Excel.

This is not terribly efficient. A data browser enables you to open up those same intranet pages and perform your normal browsing tasks. However, because it’s a data analysis tool at heart, you can actually open up data sets and start working with them directly. Then you can do fun things like bookmark tables, build calculations, create reports and write queries and have it refresh the next day with the latest data. Because of this, it plays very nicely in the worlds of of exploration and reporting per Juice’s “data value framework.”

Why Incorporate the Web?

In the past we’ve shown off some of the fun things Kirix Strata™ does — like open up HTML tables natively or showing RSS feeds in a table view — but often we’ll get a response like, “that’s cool, but what do I do with it?” There is no question that tools like these can be very useful for some tasks (e.g., integrating data from the web with local tables), but they’re really just icing on the cake.

The important thing is how tightly the web is integrated with the data analysis tool. This opens up all kinds of possibilities for direct interaction with both the Web infrastructure and the database systems that power business intelligence in the enterprise. Add some Web APIs into the mix, and now you could have access to the contact information in your ACT database, the corporate newsletter contacts in a MySQL system and all the recent sales information from your Salesforce.com account. Explore and integrate at will.

Given that the meme “Enterprise 2.0″ resounds louder each day, we should only expect to see Web infrastructure reaching deeper and deeper into organizations. In the end, maybe part of the BI “last mile” solution is just giving businesses a browser with a bit more intelligence.

If you haven’t checked out what the fine folks at Many Eyes are doing with “community” data visualization, it is well worth a peek. I took a look at one of their recent blog posts today regarding the new map visualizations they are offering. Very nice stuff indeed.

However, the thing that really caught my [many] eye[s] was the mention of a data set denoting “the regional slang for those odd little bugs that curl into balls.” This is definitely not something that keeps me up at night, but I’ve always wondered about the monikers of these benign little creatures. I grew up calling them “pill bugs.” However, probably due to some deep psychological desire to be accepted in elementary school, I eventually started referring to them as roly-polies, since that is what my friends called them.

I dug up the visualization in question and was pleased to see that I probably wasn’t the only kid in Chicagoland that may have struggled with these entomological naming conventions — in fact, there are a bunch of other names given to these li’l guys across the US. For posterity, Illinoisans call this thing a Pill Bug, Roly-Poly, Potato Bug, Sow Bug and Doodlebug 15%, 41% 7%, 6%, and 3% percent of the time, respectively.

The one downside that I’ve encountered with Many Eyes is they only seem to provide the underlying data set in a .txt file (albeit in tab delimited format). Kirix Strata™ doesn’t yet recognize the tsv-in-txt’s-clothing just yet, so you need to save the file to the desktop and then open it up in Strata (selecting a text-delimited file type with a tab delimiter). But, once you get it in there, fire up your analytical skills and manipulate away.

This afternoon I was doing some analysis on our web logs and thought it may make for a good screencast and blog post. We currently use a combination of AWstats and Google Analytics for our web stats but are increasingly using Kirix Strata™ to dig deeper into the raw web logs for the more customized things that aren’t readily available otherwise.

Also, honestly, it is kind of fun to plow through almost a million records on your own. Hmmm, maybe I should get out more.

The topic of the screencast below are the search terms people enter to find things in our phpBB3 support forums. These terms are embedded in the “request” field of the apache logs and I couldn’t find a way to get them without digging into the logs themselves (NOTE: I wouldn’t doubt that there is some way to do this via a mod to phpBB or a filter in Google Analytics… but since I couldn’t find anything via a quick Google search, using Strata just ended up being a lot faster).

An example of a search string we’re dealing with is:

GET /forums/search.php?keywords=proxy HTTP/1.1

So the trick was to parse the search keywords out of the field and then group them together to see what people were searching for… and in turn give us the chance to improve our support area by targeting some of these search terms and expanding our documentation accordingly.

TECHNICAL NOTE:

I downloaded the Apache logs from the server and, due to the file size, decided to import them into Strata rather than open the file and work with it directly. To import your logs, go to Import, select text-delimited files, and then import as space delimited with quotation marks as the text qualifier. Update: You can now use a handy little log parsing extension to pull in your web log files without having to mess around with a straight text import.

TECHNICAL NOTE 2:

For posterity, here are the functions that were used in this screencast:

About

Data and the Web is a blog by Kirix about accessing and working with data, wherever it is located. We have a particular fondness for data usability, ad hoc analysis, mashups, web APIs and, of course, playing around with our data browser.