Month: January 2017

Over the last couple of months we have been implementing AMP on many of our site’s. I’m not going to discuss pro’s and con’s of AMP, nor the wider trend of dis-intermediation of content off site to across various platforms and ecosystems. Instead this is a more practical little (but potentially a lot bigger) pitfall we have noticed in how Google Analytics handles this traffic.

The story begins when our main anomaly detection system (Anodot) recently started picking up spikes and increasing referral traffic from “cdn.ampproject.org”.

As we looked into this we could see the following example source, medium and referral path information in Google Analytics:

So it seems like the mysterious new referrer “cdn.ampproject.org” is actually just a result of users clicking through to our sites from content that happens to be hosted on AMP.

In the examples above we can see that in the first case its actually not really a referral as such but just the result of someone clicking through from one of our own AMP pages.

The other two examples are indeed referrals but the traffic source of “cdn.ampproject.org” misses the fact that actually these are referrals from specific third parties (refinery29.com and gizmodo.com).

So the fact that this traffic really came from different places, and that “cdn.ampproject.org” is not really a proper referrer in a traditional sense, is now hidden in the referral path in Google Analytics. So all out of the box reports and dashboards that typically revolve around the traffic source field in GA will miss this new complication.

This is not really a bug in Google Analytics, it’s more of a potential unintended consequence of the way AMP works.

In a world where all publishers 100% use AMP then you can imagine how bad this could get with “cdn.ampproject.org” in GA becoming one of your main sources, hiding the fact that beneath this is a much more complicated sea of actual third parties who linked to your content.

Potential Work Arounds

We will probably create a new field in our data mart that reads the true source from the referral path and so overwrites “cdn.ampproject.org” with the ‘proper’ source. This would not however fix things on the front end of GA for business users and would would fix our internal downstream reporting that builds on the backend raw data we have as a 360 customer.

Another option that would surface a fix to the front end could be the use of a custom dimension to house this cleaned version of the traffic source. The downside here however is that using this would require creating custom reports anywhere you wanted to use the cleaned traffic source.

There may be other word around’s i’ve missed here – if you think of any please add them into the comments.

We have reached out to Google to point this out to them. As it’s not a bug i’m not sure if or how they might deal with this, is a tricky one for sure as implementing some sort of override for AMP traffic might be a little too ad hoc. There may very well be other use cases where the current behavior is exactly what someone wants. As a publisher though i can’t really see any from our point of view.

Anyway, if you use GA to understand your web traffic go check it out for yourself to see if you see the same thing. Feel free to share your story in the comments below as we are keen to hear other’s affected by this.

Overview

As a data scientist you often come across beliefs, views and opinions in your organisation of how things are. As a data scientist you also usually want to figure out ways to find evidence in the data to either back them up or possibly add a more nuanced interpretation if one might be useful.

I’m always very hesitant to ignore or disregard such views because usually the people behind them typically know a lot more then me about the business and how it works, but it’s often locked away in a gut feeling or implicit understanding through deep domain specific experience. So i love when i get the chance to find some data to help illustrate and sometimes expand on such views and beliefs. This is a little story about one such recent example…

Content lifecycle

In online media we are pretty aware of the ‘lifecycle’ of a piece of content. Most articles tend to get the majority of the pageviews they will ever receive in the couple of weeks after they are published. However we sometimes also see content that gets picked up again and again over a longer time frame and so has a much longer lifecycle.

We wanted to understand these dynamics a bit more and also see if the type of content itself had any bearing on it’s typical lifecycle.

Below is a stylized picture of the way we will frame this problem before looking at some data (there are of course any number of other ways to approach this which is one of the great things about doing data science and probably one of the things that will be harder to automate once AI inevitability puts us out of a job too).

Data preparation

As always deciding what data to use, how to represent it, how to transform it etc. is key to giving the analysis the best chance to find anything interesting.

To be concrete, the data after pre-processing looks something like the below table. Each % represents the share of lifetime pageviews a piece of content has received up to that week.

So in the example above, post A has already received 70% of the pageviews it ever will within the first week of it’s publish. Post B on the other hand, has not received 70% of its lifetime pageviews until week 6, so it seems to have grown and picked up momentum while post A has fizzled out by week 4. This is not to say post B has performed better then post A or vice versa, we are more interested in understanding the different dynamics within the way content is consumed over time.

A few of things worth noting are:

We looked at posts published from the window of -180 days to -60 days. The idea here being to go far back to get more data but also make sure each post has been ‘alive’ for at least 60 days. So some posts will have been alive for longer time periods than others but this is ok for our purposes.

We only look at weeks 0-12 in this analysis. We had looked at longer time frames but the noise to signal ratio in the data increases the longer out you go and adding those additional dimensions did more harm than good to the quality of the clustering (it’s often best to cluster with as few dimensions as you can reasonably get away with otherwise your distance measure can become increasingly meaningless, interpretation can get very complicated, and the pretty cool sounding ‘Curse of dimensionality’ can kick in – ooohh scary…).

We decided to use a cumulative representation of the data as opposed to just using the actual share of pageviews that landed each week. Taking the non-cumulative approach also made the data a bit more noisy and resulted in much more messy clustering. The intuition here is that often a lot of the patterns here are more like one or two week offsets of each other. So when you don’t take a cumulative representation of the data these offset but similar ‘looking’ patterns can have very different distance measures. For example, if 50% of the pageviews for two posts came in week 4 for one and week 5 for the other then in a non-cumulative representation the distance measure in the clustering would judge these two post to be very different. Taking a cumulative % per week smooths this out and gives the clustering a bit more flexibility. Another way to think of this is that the cumulative approach builds in the desired correlation among the variables that we want to explore.

We used %’s instead of raw pageviews – this is because it was the trend or behavior we were more interested in. It could be worthwhile to just use the raw pageview counts themselves but this would in effect be trying to extract both the trendsand the different typical levels of traffic in the data. So using %’s is a natural way to normalize the data when primarily concerned with trends in data.

We only looked at posts that had more than 100 pageviews to disregard any obvious rubbish or dirty data.

So with all the data prep done, some decisions and assumptions made and some other implicit assumptions we might not have even realized we made, made, we can look at some of the raw data (note: for everything below we used hollywoodlife.com data).

Each line represents a different piece of content. We can see that it looks a bit jumpy and jagged, we see some lines with different slopes and angles so it does look like there are some differences at least between these lines.

If we keep going and plot all the data:

Okay great – i can’t see anything and it’s a mess (maybe the only thing we can see is that it looks like there is a lot of variation in lifecycle paths – this will be important later).

Anyway this mess is what i was hoping to show and is exactly why it might be useful to use clustering to try get some sense out of it.

Clustering

From here on in it’s a pretty straight forward application of any standard clustering approach to the data.

As the data was not that big (about 12,000 posts * 13 weeks) i used R on my laptop and the pam function from the cluster package (shout out’s also to bigrquery to get the data, some tidyr to go from long to wide format and of course ggplot2 for the, hopefully, pretty pictures you see).

Side note: I’ve recently being building bigger models with google compute engine (8 cores, 52gb), rstudio server, and h2o as the engine to build the models. I’m finding the h2o R package really cool and easy to use – it even cross validates most hyperparameters for you!

I ended up picking k=3 after a bit of messing around and trial and error. There are much better and more rigorous ways to do this, but the good thing about the time series nature of this data is that it’s easy enough to visualize the results and understand if they make sense or not.

So if we overlay the results of the clustering on our original crazy messy plot we see:

Still pretty messy but we can see the three clusters try to cover different parts of the data. Although cluster 2 and 3 have a lot of overlap so probably could be merged further.

For the purposes of what we are doing, cluster 1 which covers 23% of the posts, is the most interesting. These seems to be that subset of the content that tends to have much longer lived lifecycles while also being subject to lots of variability in exactly how those lifecycles play out.

To better summarize the clusters we can take their means and medians (as well as various percentile ranges to keep an eye on the variability within each cluster – is it so variable as to be meaningless?) and plot them as below (in this case plots for mean and median were very similar so just showing the median).

Here we can see pretty clearly that cluster 2 (green) and 3 (blue) are very similar. The shaded regions on the plot represent the 25th and 75th percentiles – so this region is generally where the middle 50% of the data sits.

As with any clustering analysis you have to come up with snazzy name’s for each cluster. Best i could come up with was ‘Long Lived’ for cluster 1. I kinda gave up there.

We can see the shaded area is much wider for the ‘Long Lived’ red cluster which hints at the larger variation within this cluster that we noted earlier.

So summing up so far, we have found evidence that there does indeed seem to be two distinct types of content lifecycles in this data – one that burns up quickly after publish and another that represents more of a slow burn. The next question is if the type of the content itself has anything to do with this?

The underlying prior belief was that galleries tend to be more long lived than articles. We were also wondering if articles with video content behaved more like galleries or not in terms of lifecycle dynamics.

To take a look at the potential impact of content type on cluster membership we took a look at the cluster distributions within each content type.

Above we can see that gallery content tends much more than the others to be in our ‘Long Lived’ cluster 1 (about 55% of galleries were in this cluster).

We can also see that articles with videos and articles without videos look pretty similar.

Another look at the data…

To validate this initial finding that it is indeed galleries that are much more likely to be long lived and articles with and without video content tend to behave similarly we took another look at the raw data.

In particular, if we take a look at the distributions of % of total lifetime pageviews reached by week 1 we see again more evidence to back up our findings.

We can see a high peak to the far right for articles (the red and green lines) that suggests most of them have already received around 80% or more of their lifetime pageviews by week 1.

We now also begin to see a more nuanced view for galleries – a large share of them also are at 75% or more by week 1. But there is also a much more flat and uniform, fat tail to the left which indicates a much higher probability for galleries for lifetime pageviews to be at less than 75% by week 1.

If we look at this picture again but jump forward to week 4 we see galleries starting to follow a similar shape to articles but still with a very long tail where some have still not even reached 25% of lifetime pageviews.

Some nuance

The above distribution plots along with the clustering analysis suggest that galleries do indeed have a higher likelihood of being ‘longer lived’ but the majority of galleries still have received most of their lifetime pageviews by week 4. This sounds like a contradiction but it’s not.

So a more nuanced interpretation is that although some galleries end up with a longer lifecycle this does not mean all galleries are still active beyond 4 weeks, in fact it’s the opposite.

One way to try bring this out more clearly is if we do the same clustering exercise above but first filter the data to be just galleries.

Here we see clearly that 69% (53%+16%) of galleries follow the more normal lifecycle of getting most of their pageviews in the first couple of weeks but that about 32% of them (cluster 1 in red here) follow a much more steady lifecycle almost receiving a steady constant flow of pageviews per week.

As this is hollywoodlife.com it could be that these are mega galleries that relate to key celebs who continually appear on the site’s content and so get continually linked to (e.g. “Check out more pics of Kimye !!!”).

This raises the question of understanding and interpreting why we see these patterns and what their potential drivers are.

Further work

A logical next step would be to use the cluster labels as inputs into a classification problem where we see what features are predictive of cluster membership (example features could be related to content topic, how the content is promoted, gallery specific features, maybe even some semantic level of understanding of what’s in the pictures themselves using tools like google vision api, and pretty much any other features we can dream up and try quantify).

This could help generate insights which could be useful to feedback to editorial and the wider business. Examples here could relate to how we handle the content once it’s created or even any insights we can find that indicate actions that can be taken at the time of creation to extend the chances of the content becoming long lived and ultimately attracting more eyeballs. Being able to predict the most likely lifecycle of content after a week or so could also be useful in helping to decide what content to promote over others and how best to place it.

So as is typical with this sort of work, it both raises more questions and additional avenues to explore. Back to work i guess.