January 2, 2010

I was reading an article The Decade of Data: Seven Trends to Watch in 2010 this morning and found it a fitting retrospective and perspective piece. I have been working in data analytics for the past 15 years, so naturally I went searching for similar articles with more of a focus on analytics, but came back empty handed 😦

I wish I could write a similar post, but feel the task is too big to take. A systematic review with vision into the future would require much more dedication and effort than I could afford at this point. However, I do have a couple of thoughts and went ahead to gather some evidence to share. I’d love to hear your thoughts; please comment and provide your perspectives.

The above chart shows search volume indices for several data analytics related keywords over the last six years. There are many interesting patterns. The one caught my eyes first is the birth of Google Analytics: Nov 14, 2005. No only did it cause a huge spike in the search trend for “analytics”, the first day “analytics” surpass “regression”, it become the driving force behind the growth of web analytics and analytics discipline in general. Today, more than half of all “analytics” searches are associated with “Google Analytics”. Anyone who writes the history of data analytics will have to study the impact of GA seriously.

I wish I could do a chart on the impact of SAS and SPSS on data analytics in a similar fashion, but unfortunately it is hard to isolate SAS searches for statistics software vs other “SAS” searches. When limited to the “software” category, it seems that SAS has about twice the volume of SPSS, so I used SPSS instead.

Many years ago, before Google Analytics and the “web analyst” generation, statistical analysis and modeling dominated the business applications of data analytics. Statistician and their predictive modeling practice were sitting in their ivy tower. Since the early years of the 21st century, data mining and machine learning became a strong competing discipline to statistics – I remember the many heated debates between statistician and computer scientists about statistical modeling vs data mining. New jargons came about, such as decision tree, neural network, association rule and sequence mining. To whomever had the newest, smartest, most math grade, efficient and powerful algorithm went the spoils.

Google Analytics changed everything. Along with data democratization came the democratization of data intelligence. Who would’ve guessed that today, for a large crowd of (web) analysts, analytics would become near-synonymous with Google Analytics and building dashboard, tracking and reporting the right metrics the holy grail of analytics? Those statisticians may still inhabit the ivy tower of data analytics, but the world is already owned by others – the people – as democracy would dictate.

No question about it, data analytics is trending up and flourishing as never before.

Well, what could be with the question? A standard setup for a conversion model is to use conversion as the dependent variable for the model with banner and search as predictors. The problem here is, we only have convertor cases but no non-convertor cases. We simply can’t perform a model at all. We need more data such as the number of click on banner but did not convert.

The sampling bias issue is actually deeper than this. We want to know if the coverage of banner and search are “biased” for the data we are using, an example is when banner were national while search were regional. We also need to ask, if the future campaigns will be run in ways similar to what happened before – the requirement of modeling setup mimicking the applying context.

2) Encoding sequential pattern

The data for micro attribution is naturally in the form of collection of events/transactions:User1: banner_id time1.1 search_id time1.2 search_id time1.3 conversion time1.4User2: banner_id time2.1User3: search_id time3.1 conversion time3.2
Some may think that this form of data makes predictive modeling infeasible. This is not the cases. There are many predictive modeling are done with transaction/event type of data: fraud detection, survival model, to name a couple. In fact, there are sophisticated practice in mining and modeling sequential patterns that are way beyond what being thought about in common attribution problem discussion. The simple message is: this is an area that is well researched and practices and there have been great amount of knowledge and expertise related to this already.

3) Separating model development from implementation processes

Again, the common sense from the predictive modeling world can shed some light on how our web analytics industry should approach attribution problem. All WA vendors are trying to figure out this crucial question: how should we provide data/tool service to help clients solve their attribution problem. Should we provide data, should we provide attribution rules, or should provide flexible tools so that clients can specify their own attribution rules.

The modeling perspective says that there is no generic conversion model that is right for all clients, very much like in Direct Marketing we all know there is no one right response model for all clients – even for clients in the same industry. Discover Card will have a different response model than American Express, partly because of the differences in their targeted population, their services, and partly because of the availably of data. Web Analytics vendors should provide data sufficient for client to build their own conversion models, but not building ONE standard model for clients (of course, they can provide separate modeling services, which is a different story). Web Analytics vendors should also provide tools so that clients’ modeling can be specified/implemented once it’s been developed. Given the parametric nature in conversion models, none of the tools from current major Web Analytics vendors seem sufficient for this task.

That is all for today. Please come back to read the next post: conversion model – not what you want but what you need.

Is there anyone out there as frustrated as me with the many different terms and concepts around “attribution”? For those who haven’t thought about this yet, here’s a sample of the terms related to the discussion:

In this and a few follow up posts I will discuss a few topics that I hope will bring some clarity to this.

Let me be upfront with my main point: attribution problem is a data analytics problem. I know that few people would argue with me on this, but I think few people have taken this seriously all its implications.

Since it is fundamentally a data analytics problem, we should start with data. What is the underline business question requiring an attribution analytical solution? What kinds of data we have, or we need to have, to answers attribution question?How to translate the business questions into a data analytics questions that match the type of data we have. What questions are not answerable given the limitation of data, or available analytics tools?How rigorous is the proposed data analytics strategy: a heuristic, a rule of thumb, a well-specified model? Are we over or under in our use of data? Are we over design the analytics and making it more complex than necessary? What are all the limitations and disclaimers associated with an approach?

My sense is that we have not taken a serious look of the attribution problem from a data analytics perspective yet. We know the business problems, but most of us are not expert in data analytics methodology.

Eric Peterson has been busy with webinars and presentations, all over attribution problem. He use a new metric as a foundation for attribution analysis: “Appropriate Attribution Ratio”.

We all know that all the major web analytics vendors are working hard trying to figure out the right attribution modeling tool to offer. My recent meeting with a major web analytics vendor also convinced me; it is all about attribution data service and attribution modeling.

Attribution is also the most talked about topic in SEM/SEO today. I can’t think of a search conference does not have sessions focus on attribution; and every SEM tool makers are tooting its solution for atrribution management.

“the common limitation of analytics is that it lacks the data insights of offline behavior …”

The misuse of the word “Web Analytics” is corrupting the commnication – it may make sense in the little corner of some people and only within their little corner.

Web Analytics is not refering to a set of software/tools (which includes Google Analytics), it refers to a subdiscipline of data analytics. Please, stop labeling “Best Web Analytics” when just you really mean just a comparison of web analytics tools.