The Ladybird Guide to Business Intelligence

Recently I have found myself frustrated by the lack of a very simple overview for Business Intelligence explaining what problems it solves, and how.

For example, the Pentaho BI platform FAQ has a promising first question: “What is a business intelligence (BI) platform?” The answer is typical of BI overviews I have seen:

A comprehensive development and runtime environment for building complete solutions to business intelligence problems. The Pentaho BI Platform is the infrastructure and core services that integrate business intelligence components to complete the BI Suite. This includes the infrastructure necessary to build, deploy, execute and support applications.

I don’t know about you, but that gives me more questions than answers. What type of problems are business intelligence problems? What are the core services provided by a BI platform? What are BI components, and a complete BI suite? In short, what does it do?

I think I have figured out what Don Norman calls a conceptual model which is about right, so for those who have struggled as I have recently, here is the Ladybird guide to Business Intelligence.

What problems does BI solve?

Let’s say you are the CEO of a company, and you want to track what the costs of the company are, across payroll, purchasing, marketing and sales, overall and by division. You also want to track revenues by division, product line, market and month. For each variable, you’d like to drill down when you see a figure that looks odd. Payroll in Asia increased 20% this year – did we buy a company? Are there savings to be made?

All of this information spans dozens of different computer systems, applications, databases. What you want is one application to rule them all, from which you can get nice graphical clickable data.

Let’s say you’re a free software project manager or community manager. You have lots of infrastructure for people working on the project – source control, mailing lists, forums, translation infrastructure, documentation, bug tracking, downloads, …

You want to know if your community is growing, shrinking or stagnant. You’d like to know if translators are up, and spot when something is up – we lost 3 Thai translators last cycle, does the Thai translation team have a problem? Is there a problem with wiki spam? A correlation between people active on the forums and commits to the project? Some of these questions span different applications, systems, and databases. What you want is one application to rule them all, where you can get a quick overview of what’s happening in the community, and click on something to drill down into the data, or create complex queries to spot correlations and patterns across different apps.

BI software is ideally suited to helping in both of these situations.

How does it work?

Very simply, a BI platform is a web application that allows you to create queries and visualise the results across a variety of data sources. At its simplest, you bring big lumps of data together and extract some useful numbers from it. If you’ve ever used a pivot table in a spreadsheet application, you’ve written a BI query.

Now we get into the acronyms and the jargon. Here’s a quick lexicon of commonly used BI terms:

ETL

Extract/Transform/Load – An ETL module allows you to script and automate the extraction of data from a funky data source (say, CSV files on a server, an auto-generated spreadsheet, or screen-scraping data from a HTTP query, or just an SQL database), and transform it into some other format (typically basic transformations like joins, mapping inputs to database fields, or applying simple arithmetic to convert to an agreed unit), and then store the result in a database.

OLAP

Online Analytical Processing – a fancy name for “queries”. There is a de facto standard query format called MDX and the database needs to be optimised for “multidimensional queries” (aka joins – like pivot tables in a spreadsheet).

Data Warehouse

A fancy name for database.

Reporting

The presentation of the results of queries in a graphical way.

In brief, then, a BI suite provides you with a way to suck in data from a variety of sources, store the data (if you need to) in a custom database which is optimised for querying across different data sources, a nice way to define the queries in which you are interested, and then present the results of those queries in a nice graphical way.

If you don’t need to do any transformation of data, and you can operate directly on SQL databases, then you can typically provide the BI platform access to them directly. If you have any unusual data sources, or want to transform data, you will need an ETL module. If you are dealing with a lot of data and want to optimise query time, you might need a specific OLAP server. A query editor will help you create queries to get the information you want out of your data. You will need a reporting module to convert query results from raw tabular form to pie charts, bar charts and the like. And the BI server provides hooks for all of these various modules to work together, sucking in, storing, manipulating and presenting data in interesting ways.

Is this all right?

I would love to know if my mental model is flawed – so if I’m missing anything important, or I’ve said something which is a pile of rubbish, please do add a comment and let me know.

I know how hard it can be to cut through the jargon in an area where it’s ubiquitous and the first step in enterprise software is usually the hardest, so hopefully this will be useful to someone other than myself.

6 Responses

* “Reporting” may also mean the input, by users, of data. Well, either manual input or automatic loading.

* Data may sometimes be aggregated. But one point of BI is also more complex association of data than simple sum : different currencies may be converted (at different rates) ; some data may cancel other data ; some data is detail of other data… and so on.

You are 100% correct … in fact you can very often mimic “business intelligence” applications with a spreadsheet that can query RDBMS the problem is these home cooked approaches do not approach the 100k$ cost of “BI solutions” from vendors with “support” and “training” for IT personnel. Even with a very ambitious “Hurdle Rate” the most modest BI projects always are good for the bottom line.

The problem as I see it is if you’re the head of IT in a midsized company what would you do: 1) cook up a few very useful spreadsheets that give your company useful BI; 2) start a big project to review vendor offerings with a travel budget, high profile inside the company, and practically guaranteed success (if you have no BI and then implement something it always feels like success).

I personally like Forrester’s definition: Business Intelligence is a set of methodologies, processes, architectures, and technologies that transform raw data into meaningful and useful information used to enable more effective strategic, tactical, and operational insights and decision-making.

But that’s the problem with the term “Business Intelligence”, there are many definitions that people like. I think the problem space that BI can handle is so broad that it doesn’t really mean anything beyond “using data to answer questions.” Do the questions really have to be business related? I use Pentaho BI technology to scrape and aggregate grades from my son’s high school web site. It answers questions like what are the best case grade scenarios for this semester? Yeah it sucks to have a dad with BI tools. We use Pentaho internally for things like forum moderation notification and to generate reports of which versions of what third party libraries are checked into SVN for a specific release. Not much Intelligence there.

I think we need articles like this one to give people an intuitive idea of what kinds of problems BI can help solve. After reading enough of these use cases, people figure out where it can help their situation. As far as having a precise, non self referencing definition that doesn’t compare BI to something else, I think it would be too general to be useful.

The next logical question is “why use the term BI?” So companies and products can be identified with a well known market and people looking for that market can find it.

Interesting – I didn’t realise there was a separate problem/solution space for – you have disparate sources of data and want to try and reconcile them coherently.

We’ve been sniffing around at open source ERP systems to support Collabora, and they’re all either a) designed for warehouses/manufacturing/etc and with consultancy as a trivial afterthought, b) utterly terrible, c) both (the most common). None of them have made me want to trust them with our most crucial business processes like invoicing and tracking finances.

We’re getting pretty disheartened and close to giving up on ERP for the moment, even though our business requirements in terms of gathering data and generating invoices are actually quite simple they seem really poorly catered to.

The alternative for us seems to be throwing a bit more money at our timesheet tool so it gathered the additional expenses/vacation/etc information we needed into one place, with appropriate approval steps, and then gluing spreadsheets on top of RDBMS models to give live analysis. Which seems to be what comment #2 is about.

I usually think of data warehouses as highly denormalized databases optimized for read-only queries.

One of the things that crops up when you go to design such a database is that a lot of things we normally model as static actually turn out to be things that change very slowly – things like postal codes, company names, personal names, addresses, and even calendars change over a period of years.

For instance, it’s common in a data warehouse to need to know whether a given date was a holiday or not, because that affects customer behaviour as well as tax due dates and the maturation dates for different kinds of financial instruments. However a government can, at any time, for any reason, declare a new holiday or retire an old one. It happens very infrequently (in most jurisdictions) but it happens. So a data warehouse will typically have a table just for dates, mapping from the Gregorian calendar to any other calendars (such as the Arabic, Chinese, or Julian), along with columns for the day of week, what month and financial quarter it is part of, tax period in various jusrisdictions, whether it is a work day in different jurisdictions, whether it is a holiday in different jursidictions, what week it is considered part of (depends on cultural norms in some parts of the world).

Once you need this level of detail and try and figure out how to write useful queries that account for all of these variables as well as the performance impact of said queries, you begin to understand why there’s value is having a separate system.

As a different issue, exchange rates are a good example of something where for a normal database you just capture the current value at transaction time, but for a data warehouse you need access to historical data so you can correlate data across time periods.

You can look for info on star schemas to get an idea of the kind of data modelling we used to do when SQL databases sat at the heart of most data warehouses; nowadays things like RDF (which gives you a beautiful, extensible data model) and NoSQL (which is typically all about denormalization to achieve its goals) are opening up new kinds of models and analysis.