Simo Ahava

Husband | Father | Analytics developersimo (at) simoahava.com

The Schema Conspiracy

A schema is something that data processing platforms such as Google Analytics apply to the raw hit data coming in from the data source (usually a website). The most visible aspect of Google Analytics’ schema is how it groups, or stitches, the arbitrary, hit-level data coming in from the website into discrete sessions, and these are actually grouped under yet another aggregate bucket: users.

But you already know this. You’re looking at metrics like Sessions, Bounce Rate, Conversion Rate, and you’re using them or variations of them as KPIs in your dashboards and whatnot. Right?

Take a look at the group of metrics below:

These are some of the go-to metrics people use to assign meaning to the data stream coming in from the website. The thing about these metrics is that they are very heavily sessionized. They are entirely dependent on an arbitrary schema, which many fail to understand or to even question. Change the definition of a session even a little, and every single one of these metrics will have a different value.

And herein lies the problem I now dub the Schema Conspiracy. I know, I know, it’s a tad dramatic. But the implications are dramatic as well.

When you use Google Analytics, or any schema-applying data processing platform, you are subscribing to the schema imposed by the platform. You don’t have a say in it. In GA, you can make minor changes to the definition of a session, using tools like Session timeout and Referral Exclusion List, but the fact remains that the schema for sessions in Google Analytics remains universal, generic, and completely arbitrary; three qualities that should not exist when using data to optimize for business growth.

In Google Analytics, a session can be defined roughly as an uninterrupted browsing experience, which expires after 30 minutes of inactivity. So you enter a website, do stuff there, and 30 minutes after the last interaction the session expires. Naturally, it’s more complex than this, but as a rough description this should suffice.

Now ask yourself this: how does this mirror anything that happens in the real world? Not really, right? Shouldn’t the concept of a session be grounded in something less ephemeral than a completely arbitrary sequence of hits on the website, combined with a strange, inexplicable 30 minute timeout?

You might not see the relevance of any of this, and you might be completely satisfied with Google Analytics’ concept of a session, and you are, of course, entitled to this.

But consider Conversion Rate, for example. Conversion Rate is the ratio of sessions with a conversion to all sessions. Sessions, sessions, sessions. If you’re using Conversion Rate as a KPI, you must realize you’re optimizing against a completely fictitious metric.

Think of it like this. You might need 14 sessions to convert when buying a new boat. You might need only 6 sessions to convert when buying a new computer. But in the end you’re still just one user that converted, regardless of the number of sessions it took to do so. The key here is that you had a singular intent: to buy a boat or a new computer. This intent spanned a number of sessions, highlighting the disconnect between sessions and behavior even more.

I think this is very problematic indeed. Companies optimize against a metric that is very superficial and ephemeral, and completely unrelated to the intent of the visitor. You shouldn’t be interested in the number of sessions that converted, you should be interested in increasing the number of customers you have, by understanding intent and nurturing it into a purchase.

Now, I’m cynical enough to see the justification for this arbitrary sessionization: granularity of attribution. That’s why a change in campaign source initiates a new session, even if the session hasn’t expired yet. Your advertising channels need the attribution for successful conversions, which is why this sessionization logic has been honed to give a nice, big, fat number for your acquisition metrics.

Don’t get me wrong, I think it’s valuable to see all the channels that turned a non-converted user into a new customer. But the reality is that sessions don’t convert, users do. Attribution, too, should be balanced between the touch-points that led me to fulfil some intent I had. Following an ad starts a new session on the website, but my intent might be the same as before. The ad might have made the intent more targeted, more specific, but I’m still very much a single user on the path to conversion.

Overcoming the problem at hand

You’re pretty much out of luck if you want to apply your own schema to your Google Analytics data. Even though Google Analytics Premium boasts hit-level data through BigQuery, it’s still sessionized. The data tables stitch the hit-level data into sessions before you can access the data. This, I think, sucks big time.

(UPDATE: Check Carmen Mardiros’ comment and Pedro Avila’s comment in the comments of this article for workarounds to getting hit-level data through the API and BigQuery.)

I get why the UI shows a sessionized data set, as applying your own, complex sessionization schema would require an astounding amount of processing power. But why not provide raw data through the API?

So, there’s nothing you can do with GA’s schema. That’s just how it is. You can’t even see proper user-level data, either, since that’s sessionized as well. Consider the following Custom Segment:

It looks like it should show data for all users that have converted at some point in the past, right? Right. And wrong.

The segment above shows me a cohort of users who have converted during the selected timeframe. But that’s not what I should be interested in. I should be able to segment between converted and not-converted visitors, regardless of the timeframe!

No, a user-scoped Custom Dimension won’t help either, as if I’m looking at a timeframe before the user converted, it will show me the user as a non-converter.

Things like this drive me crazy. If I had access to raw, hit-level data, and if I could build my own stitching schema on top of that, I would be able to bend the processing and reporting aspects of GA to my will, improving the quality of data for my business alone. That’s what my dashboards should be showing! That’s what should be driving my business!

Final thoughts

So what is a perfect schema? There’s no such thing. Just as each business is different, each schema should be different as well.

Optimally, the schema should be a living thing, constantly in flux, because your visitors are living things, constantly in flux. An intelligent schema would mirror this, perhaps even learning autonomously along the way.

Optimally, the schema wouldn’t be satisfied with just your website data. Your visitors are multi-dimensional, so the schema should be multi-dimensional as well.

Optimally, the schema would let you optimize against metrics that are relevant for your business, and for your business alone. Your visitors are your business, so the schema should be optimized against visitors as well.

Finally, the schema used by Google Analytics is perfectly fine. Just don’t interpret it as something it’s not. Google Analytics’ sessionization does not reflect the real world, the Conversion Rate metric should not be an indicator of the state of your business, and completely sessionized metrics like Bounce Rate, Session Duration, etc. should never be used as KPIs alone.

Using a single, sessionized, flawed metric as a KPI is like only telling the punchline of a joke.

There are tools out there that bridge the gap between Business Intelligence and web analytics. They let you build the interpretations for your raw data in any way you choose. It does require effort, however. A custom schema requires that you understand your audience behavior on a completely new level.

Do you agree with me? Or do you think I’m making mountains out of molehills? I’m not advocating for an upheaval of how these tools work, but I am campaigning very strongly for critical thinking.

So, the next time you use Google Analytics’ Conversion Rate metric for anything, just pause for a second and think about what this metric means for your business. Try to come up with a sentence like: “The uplift we’re seeing in Conversion Rate means our business is…” and then finish with what the change in Conversion Rate means for your business.