Pages

In this blog I try to present analytic material for a non-analytic audience. I focus on point of sale and supply chain analytics: it's a complex area and frankly, it's far too easy whether writing for a blog or presenting to a management-team to slip into the same language I would use with an expert.

So, I was inspired by a recent post on Nathan Yau's excellent blog FlowingData to look at the "readability" of my own posts and apply some simple analytics to the results.

A couple of weeks ago, I ran a somewhat rough benchmark to show just how much faster large database queries can run if you use better tools.

The right tools for (structured) BIG DATA handling Here's the scenario: you are a business analyst charged with providing reporting and basic analytics on more data than you know how to handle - and you need to do it without the combined resources of your IT department being placed at your disposal. Sounds familiar?

I looked at the value of upgrading hard-drives (to make sure the CPU is actually busy) and the benefit of using columnar storage which let's the database pull back data in larger chunks and with fewer trips to the hard-drive. The results were ..staggering. A combined 4100% increase in processing speed so that I could read and aggregate 10 facts from a base table with over 40 million records on my laptop in just 37 seconds.

At the time I promised an update on a significantly larger data-set to see whether the original results scaled well. I also wanted to see whether query times scaled well to fewer facts. Ideally querying against 5 facts should take about 50% of the original 10 fact aggregation queries.

Here's the scenario: you are a business analyst charged with providing reporting and basic analytics on more data than you know how to handle - and you need to do it without the combined resources of your IT department being placed at your disposal. Sounds familiar?

Let's use Point of Sale data as an example as POS data can easily generates more data-volume than the ERP system. The data is simple and easily organized in conventional relational database tables - you have a number of "facts" (sales-revenue, sales-units, inventory, etc.) defined by product, store and day going back a few years and then some additional information about products, stores and time stored in master ("dimension") tables,

The problem is that you have thousands of stores, thousands of products and hundreds (if not thousands) of days - this can very quickly feel like "big data". Use the right tools and my rough benchmarks suggests you can not only handle the data but see a huge increase in speed.

Whether your analytic tool of choice is Excel or R or Access or SQL Server or ... whatever, if you've worked a reasonable range of analytic problems I will guarantee that at some point you have tried to make your preferred tool do a job it is not intended for or that it is ill-suited for. The end result is an error-prone, maintenance nightmare and there is a better way.