With all the fanfare and triumph both deep learning and artificial intelligence get these days one aspect i find often gets overlooked in popular accounts is the central role embeddings play. This is probably because they can be a little bit abstract and hard to explain to someone for the first time. Regardless, embeddings can help turn a sea of unstructured text into the raw numbers that fuel deep learning and many other approaches to machine learning and AI.

What’s an embedding you say? Well i’m glad you asked….In this post we will dip our toes into the world of embeddings with a somewhat silly hands on example and illustratration using about 100,000 articles from one of our celebrity entertainment sites hollywoodlife.com.

I’ll give you a warning up front – there will be some talk of numbers, dimensions, arrays, and vectors, it might get a little dry. But to keep you interested i promise by the end we might have answers to some very important questions like:

How can we measure the Justin Bieberness of someone?

What is Kim Kardashian – Kanye West + Brad Pitt?

Can we isolate and identify a Kardashian gene in our data?

Brief Background

I am by no means an expert in any of this so i’ll be brief here and try link to some good resources as we go rather than attempt long winded explanations. All the code behind this post can be found here on github and the ipython notebook fully rendered here on nbviewer (renders the plotly charts that way too).

There are many different ways to represent text as numbers for machine learning tasks that try to extract information (in various ways) from things like blobs of raw text.

My TL;DR; version is this. Let’s say we have three tweets:

“I love Justin Bieber”.

“I adore Justin Bieber”.

“Omg, I love love love Justin Bieber!”.

If we wanted to do any sort of knowledge extraction from this bunch of tweets we’d need to settle on a method to represent them in some way to the computer. Ideally this representation might be something we can then do math on and use in various machine learning techniques.

Bag of words art installation at CMU.

Perhaps the most intuitive approach here is the bag of words (BOW) approach. This approach is to basically represent each sentence as a vector of counts for the number of times each word appears. So:

Words

Original Text

I

love

justin

bieber

adore

omg

BOW Vector Representation

I love Justin Bieber.

1

1

1

1

0

0

[1,1,1,1,0,1]

I adore Justin Bieber.

1

0

1

1

1

1

[1,0,1,1,1,1]

Omg, I love love love Justin Bieber!

1

3

1

1

0

1

[1,3,1,1,0,1]

So this is one way we could represent the tweets as numbers to try figure out how much someone loves Justin Bieber.

In this case we might flag both “love” and “adore” as positive word’s so maybe we could measure the amount of love for Bieber from each as 1, 1, and 3.

Basically, each tweet shows love for Bieber but maybe the third one is stronger in some sense given the repetition of “love” and presence of an emotive “omg”. These are the sorts of ‘rules’ a computer could then learn if given enough example data.

This approach actually turns out to be surprisingly successful in practice, especially when you consider the fact that it really does not really try to understand each word in any way – it’s just count’s.

So for example, if we had not flagged “adore” as a positive word then we would have scored tweet 2 as having no love for Bieber. The bag of words approach makes no attempt to try capture in the representation the fact that “love” and “adore” are fairly similar in terms of semantic meaning.

Word Vectors

Taken from https://www.tensorflow.org/tutorials/word2vec

What if instead, we said for each word let’s just represent it as 100 random numbers (btw – the choice of 100 is an arbitrary parameter here). And let’s then figure out a way to find the best value for each of those 100 numbers such that similar words end up having a similar set of 100 numbers and so end up ‘near’ each other in this new 100 dimensional space we are now in.

If we could do this then we should be able to have a representation that would easily capture the fact that “love” and “adore” are similar as their vector representations would have similar sets of numbers and so a high correlation in this 100-D space.

The challenge here is how do we tune and pick the specific set of numbers for each word. This is where the word2vec approach comes in. If we just start with a random set of numbers as the representation for each word we can look through our text line by line (adjusting our numbers as we go) and basically see how good we can get at predicting neighbouring words given a specific vector representation of a middle target word.

The intuition behind this is the famous(ish) saying (i like to put on a traditional british accent when reading this):

“You shall know a word by the company it keeps”– John Rupert Firth (Famous english linguist)

The idea being that words that are in some meaningful way related or informative of each other will tend to co-occur nearby in natural raw text.

So to be concrete here. If we see a sentence like:

“Canadianheart-throbJustin Bieber is at it again, the popstar was recently filmed blah blah…”

We can see from this that by looking at the words that tend to occur around Justin Bieber we can actually learn a lot (in a particularly specific and narrow sense perhaps) about Justin Bieber.

The word2vec approach leverages this (and a lot of data) to essentially learn a really good combination of those 100 numbers. Where ‘good’ here is defined in terms of ability to use that vector of numbers (as our representation of a specific word) to predict neighbouring word’s (or vice versa).

I have simplified and glossed over a lot here. So here are some links to resources that do a much better and detailed job at explaining.

Some light string cleaning like setting all to lowercase, removing various funny characters, artifacts and things like that.

(Note: In the notebook some of this is very ugly and even specific to phrases and terms commonly used on hollywoodlife.com – but is a good real world example of raw text data you might tend to come across in the wild).

Next we do some phrase creation (here). For our use case we are mainly interested in names of people, so we need to try find a way to treat “Justin Bieber” as one word. Otherwise the two separate words “Justin” and “Bieber” could end up looking quite different from each other make it much harder on the model to capture the essence of “Justin Bieber”.

Running phrase detection using gensim does a pretty good job of creating new words such as “justin_bieber” and “selena_gomez”. The idea here (oversimplifying as usual) is to basically look at all the documents and if you see pairs of co-occurring words very often (above some threshold you can pick) then join them together into one token (fancy name for word) such as “justin_bieber”.

Finally we do some specific gensim related preprocessing to get it into the format required to build a model on (here).

(Note: I found that actually passing the whole document as a ‘sentence’ to gensim, along with a wider window, gave much better results than training on random sentences pulled from the corpus. However this can be task and data specific so is something worth playing around with yourself).

Let’s Play

Visualising Vectors

So once we have our model built we can begin to look at the word vectors it has created and see if they make sense and what sort of knowledge, if any, we can extract from all this (here).

To be really concrete, here is Justin Bieber’s word vector as trained on the hollywoodlife.com corpus:

Wow – amazing no?

Ok, yeah it’s just 100 random looking numbers.

But that’s exactly the point – this particular set of numbers is now a representation of the word “justin_bieber” that captures different aspects of it’s meaning based on the type of words it tends to occur with.

So we should be able to use it in various ways that might be more powerful than the bag of words approach we looked at earlier.

Anyway, to help show that really there is nothing particularly scary or fancy about the output word vectors we get, here is a Tableau Public workbook where you can go and play around with these vectors for yourself.

Here are some interesting things i came across when playing with this.

If we are careful about a specific set of vectors we look at then we might be able to infer some interesting relationships and actually ‘see’ all this a bit more visually.

So, as an example, if we filter to just look at the Kardashians – we can see some specific vector elements that are of a similar magnitude across all of the family. A sort of Kardashian DNA marker if you will allow me the indulgence 🙂

Highlighted in yellow are some vector dimensions that seem to be similar magnitude across all Kardashians.

Of course, as usual with these types of algorithms, you rarely get a really clear single number of measure that captures what it is actually doing. More often it’s the combinations of the numbers and measures that the computer can pick up on much easier then we might be able to by looking at it.

To try express this fact we can see that the chart below shows more visually how the direction and magnitudes of each vector element tend to move together for all the Kardashians.

The colors all tend to be either positive or negative together and sometimes even have a similar area to each other.

It’s this correlation among all 100 vector dimensions that would be a much stronger fingerprint of kardashian’s to a computer but which is harder for us to perceive.

A counter example might also help here – if we take a group we expect not to have much similarity, “hillary_clinton”, “justin_bieber”, and “eminem” then we see:

Everything here looks a lot more random in terms of correlations across vector dimensions.

So it seems like maybe we actually can have some level of interpretability here depending on the specific question you ask and how you frame it. That said, the main goal of word2vec is a flexible representation that can be useful in other tasks as opposed to interpretability as to what key attributes or characteristics it has selected for in its representation.

Vector Arithmetic

One of the most interesting and well known findings from the word2vec approach was that you could do arithmetic on the resulting vectors and the results of that arithmetic implied a pretty impressive level of semantic understanding.

A typical example here (from the vectors originally published by Google and so trained on a larger more generic dataset) is if you take the vector for “King” subtract the vector for “Man” and add the vector for “Woman” you end up nearest to the vector for “Queen”. So:

King – Man + Woman = Queen

This is to say that the representation somehow figured out that an important part of what it means to be a king is to be male, male and female are in some way opposite, and so opposite of a King is a Queen. (It would also have similarly figured out that a large part of what it means to be a queen is to be female, again i’m being a little simplistic perhaps).

Looking at our trained models vectors we can see an equivalent equation of:

kim_kardashian – kanye_west + brad_pitt = ??

kim_kardashian – kanye_west + brad_pitt = angelina_jolie

Yay it works!

It’s figured out that “kim_kardashian” is to “kanye_west” as “angelina_jolie” is to “brad_pitt” i.e. a marriage relationship.

Ok so cool but to be honest the truth is always is a little more complicated and messy and it won’t work for every example.

But generally if you just throw these types of equations at it, within the top few results you do tend to see things that make sense even if they are not as clean and perfect as you might like.

So here is a nice example.

taylor_swift – harry_styles + zayn_malik :

“selena_gomez”, “calvin_harris” – wth?

But actually we do still see “perrie_edwards” in the top group which is reasonable as they used to date, and we see “gigi_hadid” who (a google search later) i believe he is still dating.

As an aside, a quick investigation revealed devastating rumors that Zayn Malik may have actually cheated on Perrie Edwards with Selena Gomez! Also it seems Zayn and Calvin Harris have had major beef in the past (yes i typed that) that extended in various ways to each others significant others. Juicy! I’m….going to stop this now. The point is that it actually seems like there is a sort of relationship theme running though these connections so it’s actually not surprising that this seems to have been, to some extent, encoded into this specific group of vectors.

Bieber’s Network

Another way to explore the resulting vectors is to pick a seed word like “justin_bieber”, find the N nearest neighbours, and for each neighbour find their N nearest neighbours. Take S of these such steps and the end result will be some definition of a network centered on the seed word.

So if we use a seed of “justin_bieber” with 10 nearest neighbours (n=10) and take 3 steps (S=3) we get a resulting graph that look’s like (with a little bit of cleaning to remove most non person type words):

From this we can see some initial direct connections as well as easily pick out connections in the graph that themselves have many edges. (btw – who are Cody Simpson and Austin Mahone – should i know?)

Once we have our network of relationships there are many different ways to visually lay out the graph itself. I’ve used igraph, plotly and networdD3 libraries in R to plot all the networks as i sometimes just find it easier to produce plots using R over Python. After playing with various layouts i found the below force directed layout to be useful. Colors are based on community detection within the graph – so essentially we can think of these as one way to see sub clusters within the network as well as highlight nodes with higher betweenness for example.

Obligatory Heatmap

An alternative way to explore Bieber’s network is to pull out the vectors corresponding to each member of the network and then do some clustering on the resulting matrix of numbers. The idea here being to find any potential relationships between the various members of the network.

Heatmaps are sexy and all but it can be hard to really visually see anything beyond the most and least correlated cells.

little_mix and demi_lovato seem to stand out here for some reason.

A more useful approach here would be to use hierarchical clustering to build a dendrogram which will visually place closest vectors together which can then be grouped into clusters of varying size.

From the above dendrogram we can see that clustering on the vectors does actually give us some nice results. We see the yellow “young supermodel” cluster, we see the green “one direction” cluster, and we also see a sort of “popstar” cluster with Demi Lovato, Ed Sheeran, Nick & Joe Jonas and some others. And generally all words that are next to each other seem reasonable.

One of the most useful aspects of the hierarchical clustering approach (over k-means say) is that it let the data speak for itself a bit more and gives us a way to easily see if we agree with it or not. As such, it might be more useful to use this approach if we were taking a more general or higher level view and wanted to cluster a sample of words to see if the results more generally make sense.

So as an example here is a dendrogram from a random sample of words.

In general we can see that related words tend to be placed next to each other which is as a result of word2vec capturing quite well various notions of ways in which the words are related.

t-SNE

Another useful tool to visualise our vectors and their relationships to each other is t-SNE. This gives us a way (with some PCA in between) to move from the 100 dimensional space of our vectors to a 2-D space where points nearer to each other in the higher space tend to be nearer in our projected down 2-D space such that it can be easier to visualise related points on a scatter plot.

So when you plot a sample of words you get what first looks like a mess. But if you zoom in on certain sections you can see each part of the space tends to be home to words in some way related.

In the example below there seemed to be a part of the space made up mostly of beauty related words while also one made up of sports and NFL related words.

Really this is best explored in an interactive plot which is one of the best things about plotly. Here is an interactive version of the above plot hosted on plotly to explore.

So What?

So from playing around and exploring our vector space we see aspects of our word2vec vectors that in hindsight seem quite interesting and potentially useful in many downstream tasks.

I find it pretty impressive that word2vec was able to extract all this structure and insight from essentially just ‘reading’ 100k articles focusing of how words tend to co-occur and capturing that in a flexible numeric representation.

Key to this is the semantic aspects we seem have captured. This could be very useful in providing a more nuanced way for us to represent our content as features of numbers to feed into downstream predictive models of things like CTR and Pageview’s. This would enable any model avoid getting too hung up on if an article is specifically about Liam Payne or Zayn Malik, if what really matters is the fact that this content is essentially “One Direction” related. The more traditional bag of words approach would not give us this flexibility.

So, all in all, embeddings and tools like word2vec, doc2vec, lda2vec etc. are more and more becoming foundational approaches very useful when looking to move from bags of unstructured data like text to more structured yet flexible representations that can be leveraged across many problem domains.

Here’s a list of reference links I gathered in preparing my talk for WordCamp for Publishers 2017. Some of this material didn’t make it into the final talk.

Respondents often complained about ads taking up “space” or “covering” the desired content.
Examples of user comments:
“I hate disruptive advertising. An online ad should be no more intrusive than a standard magazine ad. That is, it should not flash, demand action, or be placed in the middle of an article. Off to the side or below is best.”

When the WSJ Weekend Edition is released on August 26, available on newsstands and sent to approximately 1 million print subscribers, it will contain the WSJ. Magazine’s biggest issue yet. Ad pages are up 5 percent year-over-year to 110 pages, according to publisher Anthony Cenname. Revenue is up 5 percent, too.

In online media we are pretty aware of the ‘lifecycle’ of a piece of content. Most articles tend to get the majority of the pageviews they will ever receive in the couple of weeks after they are published.

My Confession

I have a confession….my crontab is a mess and it’s keeping me up at night….don’t worry, it’s not really keeping me up….but you might know what i mean 🙂

We have been using Google BigQuery as our main data mart (lake or whatever its now called) for almost two years. We have been loving this as it’s super powerful with very little overhead in terms of management and infrastructure.

However one thing i think is missing from Google BigQuery is a tool for managing and orchestrating your ETL jobs. Something like MS Data Factory but for Google Cloud.

Over the last year or so we have built up a few shell scripts to wrap around the bq command line tool and basically take a series of templated sql files, chain them together into a job and then schedule it all via cron. This has actually been working better then i would have expected, mainly because we try to build a lot of redundancy into these jobs where failures won’t kill any live tables and will usually be resolved the next time the job runs in a few hours or so.

However it’s always bugged me a little that this is not really a great way to go about things and as the number of jobs and pipelines has grown it has got a little painful to manage.

Apache Airflow

Recently we have been playing around with Apache Airflow. This post is more about a concrete example of one way we have got it working for a specific use case that i did not really find any obvious existing examples of (there is actually great documentation and lots of examples but there is a layer of Airflow specific concepts and terminology one needs to nerd up on first).

So i won’t talk much about Airflow in general except to give some references i found very good for beginners:

The specific use case i was trying to figure out was around creating a paramaterised pipeline that takes the templated sql files we already have and can run them for each of our lines of businesses or “lob’s” (e.g. variety.com, hollywoodlife.com, wwd.com etc). This is a pretty common ETL design pattern so hopefully others might find this useful.

A concrete example at PMC would be some post processing we do of raw Google Analytics data we get exported into BigQuery each day. The pipeline basically takes the raw [<view_id>.ga_sessions_yyyymmdd] tables for each day, applies some transformations and enrichment’s and then creates a table in BigQuery for each lob for each day. So for example [wwd.ga_data_20170322] would be the end result which is a raw hit level table that we have enriched in various ways. This then is essentially one of our main analytical tables that powers lots of downstream analytics across all lob’s.

Parameterized DAG Example

Below is an example of one way we got this working by having a single collection of SQL template files (to represent the actual steps of processing we want to do in BigQuery) and a .py script to build and define the expanded DAG with parallelism across all lobs while maintaining dependencies between each lob specific task.

Disclaimer: I’m pretty new to Airflow and one thing i’ve found is that due to the programmatic nature of how we build DAGs there is lots of ways to do things (this is actually one of its strongest selling points). I’m sure there are better ways to go about what we are showing here (feel free to add suggestions in the comments 🙂 ).

So here is an example DAG definition python script which lives in it’s own sub folder in our Airflow DAGs folder. (Prettier formatting on Github here). I’ve tried to go overboard on the commenting for line by line clarity.

"""
### My first dag to play around with airflow and bigquery.
"""
# imports
from airflow import DAG
from datetime import datetime, timedelta
# we need to import the bigquery operator - there are lots of cool operators for different tasks and systems, you can also build your own
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
# create a dictionary of default typical args to pass to the dag
default_args = {
'owner': 'airflow',
'depends_on_past': False, # does this dag depend on the previous run of the dag? best practice is to try have dags not depend on state or results of a previous run
'start_date': datetime(2017, 3, 22), # from what date to you want to pretend this dag was born on? by default airflow will try backfill - be careful
'email_on_failure': True, # should we send emails on failure?
'email': ['andrew.maguire@pmc.com'], # who to email if fails i.e me :)
'retries': 1, # if fails how many times should we retry?
'retry_delay': timedelta(minutes=2), # if we need to retry how long should we wait before retrying?
}
# define the dag
dag = DAG('my_bigquery_dag', # give the dag a name
schedule_interval='@once', # define how often you want it to run - you can pass cron expressions here
default_args=default_args # pass the default args defined above or you can override them here if you want this dag to behave a little different
)
# define list of lobs we want to run for
lobs = ["lob001","lob002","lob003"]
# loop through the lob's we want to use to build up our dag
for lob in lobs:
# define the first task, in our case a big query operator
bq_task_1 = BigQueryOperator(
dag = dag, # need to tell airflow that this task belongs to the dag we defined above
task_id='my_bq_task_1_'+lob, # task id's must be uniqe within the dag
bql='my_qry_1.sql', # the actual sql command we want to run on bigquery is in this file in the same folder. it is also templated
params={"lob": lob}, # the sql file above have a template in it for a 'lob' paramater - this is how we pass it in
destination_dataset_table='airflow.'+lob+'_test_task1', # we also in this example want our target table to be lob and task specific
write_disposition='WRITE_TRUNCATE', # drop and recreate this table each time, you could use other options here
bigquery_conn_id='my_gcp_connection' # this is the airflow connection to gcp we defined in the front end. More info here: https://github.com/alexvanboxel/airflow-gcp-examples
)
# add documentation for what this task does - this will be displayed in the Airflow UI
bq_task_1.doc_md = """\
Append a "Hello World!" message string to the table [airflow.<lob>_test_task1]
"""
# define the second task, in our case another big query operator
bq_task_2 = BigQueryOperator(
dag = dag, # need to tell airflow that this task belongs to the dag we defined above
task_id='my_bq_task_2_'+lob, # task id's must be uniqe within the dag
bql='my_qry_2.sql', # the actual sql command we want to run on bigquery is in this file in the same folder. it is also templated
params={"lob": lob}, # the sql file above have a template in it for a 'lob' paramater - this is how we pass it in
destination_dataset_table='airflow.'+lob+'_test_task2', # we also in this example want our target table to be lob and task specific
write_disposition='WRITE_TRUNCATE', # drop and recreate this table each time, you could use other options here
bigquery_conn_id='my_gcp_connection' # this is the airflow connection to gcp we defined in the front end. More info here: https://github.com/alexvanboxel/airflow-gcp-examples
)
# add documentation for what this task does - this will be displayed in the Airflow UI
bq_task_2.doc_md = """\
Append a "Goodbye World!" message string to the table [airflow.<lob>_test_task2]
"""
# set dependencies so for example 'bq_task_2' wont start until 'bq_task_1' is completed with success
bq_task_2.set_upstream(bq_task_1)

So the above file makes reference to our sql template files which actually hold the business logic of what we really want to do in BigQuery, everything above is really just plumbing.

Below is what is in the ‘my_qry_1.sql’ file – its just dummy code for this example. (Again Github link and apologies I don’t know how to properly format these blocks as code).

-- this is just some dummy bql to create a little example table
SELECT
string('{{ params.lob }}') as lob, -- here we define the jinja template for lob
string('{{ ds }}') as airflow_execution_date, -- here we leverage predefined variables airflow has more info: http://airflow.readthedocs.io/en/latest/code.html?highlight=ds_nodash#default-variables
string('{{ ds_nodash }}') as airflow_execution_date_yyyymmdd, -- here we leverage predefined variables airflow has more info: http://airflow.readthedocs.io/en/latest/code.html?highlight=ds_nodash#default-variables
string('{{ ts }}') as airflow_execution_timestamp, -- here we leverage predefined variables airflow has more info: http://airflow.readthedocs.io/en/latest/code.html?highlight=ds_nodash#default-variables
current_timestamp() as bq_timestamp, -- get the current time from bigquery so we can see any differences between airflow execution date (in the case of backfill's) as opposed to when we actually ran this code
'Hello Word!' as msg -- just a dummy field

With the above code in a folder within our specified airflow DAGs folder we can see how Airflow picks up this DAG.

Gratuitous Airflow UI screenshots coming right up…

We can see the parallel nature of the same tasks but just broken out for each lob in the graph view.

If we look at the tree view we can see this in another way along with execution status over time.

And finally if we look at the gantt view we can see that we do indeed have the parallelism we were after with task 1 being run concurrently for each lob and then similar concurrency for task 2.

So if we trigger this DAG from the airflow cli with:

$ airflow trigger_dag my_bigquery_dag

We can see the resulting data and tables in BigQuery.

And that’s pretty much it. It’s still early days but i’m hoping the fact we already have templated .sql files at the core of our existing approach should make porting over to Airflow easy enough (especially as most of jobs run in BigQuery so LocalExecutor should be enough).

We might do a follow up post in a month or two to share some deeper learning’s once we’ve used it more. I’m particularly excited about using Airflow pretty much anywhere and everywhere i can. One neat example that jumps to mind would be in machine learning pipelines where we tend to use BigQuery for the data crunching and H2O for the model building and learning, Airflow seems like a great way to more cohesively stitch it all together.

Over the last couple of months we have been implementing AMP on many of our site’s. I’m not going to discuss pro’s and con’s of AMP, nor the wider trend of dis-intermediation of content off site to across various platforms and ecosystems. Instead this is a more practical little (but potentially a lot bigger) pitfall we have noticed in how Google Analytics handles this traffic.

The story begins when our main anomaly detection system (Anodot) recently started picking up spikes and increasing referral traffic from “cdn.ampproject.org”.

As we looked into this we could see the following example source, medium and referral path information in Google Analytics:

So it seems like the mysterious new referrer “cdn.ampproject.org” is actually just a result of users clicking through to our sites from content that happens to be hosted on AMP.

In the examples above we can see that in the first case its actually not really a referral as such but just the result of someone clicking through from one of our own AMP pages.

The other two examples are indeed referrals but the traffic source of “cdn.ampproject.org” misses the fact that actually these are referrals from specific third parties (refinery29.com and gizmodo.com).

So the fact that this traffic really came from different places, and that “cdn.ampproject.org” is not really a proper referrer in a traditional sense, is now hidden in the referral path in Google Analytics. So all out of the box reports and dashboards that typically revolve around the traffic source field in GA will miss this new complication.

This is not really a bug in Google Analytics, it’s more of a potential unintended consequence of the way AMP works.

In a world where all publishers 100% use AMP then you can imagine how bad this could get with “cdn.ampproject.org” in GA becoming one of your main sources, hiding the fact that beneath this is a much more complicated sea of actual third parties who linked to your content.

Potential Work Arounds

We will probably create a new field in our data mart that reads the true source from the referral path and so overwrites “cdn.ampproject.org” with the ‘proper’ source. This would not however fix things on the front end of GA for business users and would would fix our internal downstream reporting that builds on the backend raw data we have as a 360 customer.

Another option that would surface a fix to the front end could be the use of a custom dimension to house this cleaned version of the traffic source. The downside here however is that using this would require creating custom reports anywhere you wanted to use the cleaned traffic source.

There may be other word around’s i’ve missed here – if you think of any please add them into the comments.

We have reached out to Google to point this out to them. As it’s not a bug i’m not sure if or how they might deal with this, is a tricky one for sure as implementing some sort of override for AMP traffic might be a little too ad hoc. There may very well be other use cases where the current behavior is exactly what someone wants. As a publisher though i can’t really see any from our point of view.

Anyway, if you use GA to understand your web traffic go check it out for yourself to see if you see the same thing. Feel free to share your story in the comments below as we are keen to hear other’s affected by this.

Overview

As a data scientist you often come across beliefs, views and opinions in your organisation of how things are. As a data scientist you also usually want to figure out ways to find evidence in the data to either back them up or possibly add a more nuanced interpretation if one might be useful.

I’m always very hesitant to ignore or disregard such views because usually the people behind them typically know a lot more then me about the business and how it works, but it’s often locked away in a gut feeling or implicit understanding through deep domain specific experience. So i love when i get the chance to find some data to help illustrate and sometimes expand on such views and beliefs. This is a little story about one such recent example…

Content lifecycle

In online media we are pretty aware of the ‘lifecycle’ of a piece of content. Most articles tend to get the majority of the pageviews they will ever receive in the couple of weeks after they are published. However we sometimes also see content that gets picked up again and again over a longer time frame and so has a much longer lifecycle.

We wanted to understand these dynamics a bit more and also see if the type of content itself had any bearing on it’s typical lifecycle.

Below is a stylized picture of the way we will frame this problem before looking at some data (there are of course any number of other ways to approach this which is one of the great things about doing data science and probably one of the things that will be harder to automate once AI inevitability puts us out of a job too).

Data preparation

As always deciding what data to use, how to represent it, how to transform it etc. is key to giving the analysis the best chance to find anything interesting.

To be concrete, the data after pre-processing looks something like the below table. Each % represents the share of lifetime pageviews a piece of content has received up to that week.

So in the example above, post A has already received 70% of the pageviews it ever will within the first week of it’s publish. Post B on the other hand, has not received 70% of its lifetime pageviews until week 6, so it seems to have grown and picked up momentum while post A has fizzled out by week 4. This is not to say post B has performed better then post A or vice versa, we are more interested in understanding the different dynamics within the way content is consumed over time.

A few of things worth noting are:

We looked at posts published from the window of -180 days to -60 days. The idea here being to go far back to get more data but also make sure each post has been ‘alive’ for at least 60 days. So some posts will have been alive for longer time periods than others but this is ok for our purposes.

We only look at weeks 0-12 in this analysis. We had looked at longer time frames but the noise to signal ratio in the data increases the longer out you go and adding those additional dimensions did more harm than good to the quality of the clustering (it’s often best to cluster with as few dimensions as you can reasonably get away with otherwise your distance measure can become increasingly meaningless, interpretation can get very complicated, and the pretty cool sounding ‘Curse of dimensionality’ can kick in – ooohh scary…).

We decided to use a cumulative representation of the data as opposed to just using the actual share of pageviews that landed each week. Taking the non-cumulative approach also made the data a bit more noisy and resulted in much more messy clustering. The intuition here is that often a lot of the patterns here are more like one or two week offsets of each other. So when you don’t take a cumulative representation of the data these offset but similar ‘looking’ patterns can have very different distance measures. For example, if 50% of the pageviews for two posts came in week 4 for one and week 5 for the other then in a non-cumulative representation the distance measure in the clustering would judge these two post to be very different. Taking a cumulative % per week smooths this out and gives the clustering a bit more flexibility. Another way to think of this is that the cumulative approach builds in the desired correlation among the variables that we want to explore.

We used %’s instead of raw pageviews – this is because it was the trend or behavior we were more interested in. It could be worthwhile to just use the raw pageview counts themselves but this would in effect be trying to extract both the trendsand the different typical levels of traffic in the data. So using %’s is a natural way to normalize the data when primarily concerned with trends in data.

We only looked at posts that had more than 100 pageviews to disregard any obvious rubbish or dirty data.

So with all the data prep done, some decisions and assumptions made and some other implicit assumptions we might not have even realized we made, made, we can look at some of the raw data (note: for everything below we used hollywoodlife.com data).

Each line represents a different piece of content. We can see that it looks a bit jumpy and jagged, we see some lines with different slopes and angles so it does look like there are some differences at least between these lines.

If we keep going and plot all the data:

Okay great – i can’t see anything and it’s a mess (maybe the only thing we can see is that it looks like there is a lot of variation in lifecycle paths – this will be important later).

Anyway this mess is what i was hoping to show and is exactly why it might be useful to use clustering to try get some sense out of it.

Clustering

From here on in it’s a pretty straight forward application of any standard clustering approach to the data.

As the data was not that big (about 12,000 posts * 13 weeks) i used R on my laptop and the pam function from the cluster package (shout out’s also to bigrquery to get the data, some tidyr to go from long to wide format and of course ggplot2 for the, hopefully, pretty pictures you see).

Side note: I’ve recently being building bigger models with google compute engine (8 cores, 52gb), rstudio server, and h2o as the engine to build the models. I’m finding the h2o R package really cool and easy to use – it even cross validates most hyperparameters for you!

I ended up picking k=3 after a bit of messing around and trial and error. There are much better and more rigorous ways to do this, but the good thing about the time series nature of this data is that it’s easy enough to visualize the results and understand if they make sense or not.

So if we overlay the results of the clustering on our original crazy messy plot we see:

Still pretty messy but we can see the three clusters try to cover different parts of the data. Although cluster 2 and 3 have a lot of overlap so probably could be merged further.

For the purposes of what we are doing, cluster 1 which covers 23% of the posts, is the most interesting. These seems to be that subset of the content that tends to have much longer lived lifecycles while also being subject to lots of variability in exactly how those lifecycles play out.

To better summarize the clusters we can take their means and medians (as well as various percentile ranges to keep an eye on the variability within each cluster – is it so variable as to be meaningless?) and plot them as below (in this case plots for mean and median were very similar so just showing the median).

Here we can see pretty clearly that cluster 2 (green) and 3 (blue) are very similar. The shaded regions on the plot represent the 25th and 75th percentiles – so this region is generally where the middle 50% of the data sits.

As with any clustering analysis you have to come up with snazzy name’s for each cluster. Best i could come up with was ‘Long Lived’ for cluster 1. I kinda gave up there.

We can see the shaded area is much wider for the ‘Long Lived’ red cluster which hints at the larger variation within this cluster that we noted earlier.

So summing up so far, we have found evidence that there does indeed seem to be two distinct types of content lifecycles in this data – one that burns up quickly after publish and another that represents more of a slow burn. The next question is if the type of the content itself has anything to do with this?

The underlying prior belief was that galleries tend to be more long lived than articles. We were also wondering if articles with video content behaved more like galleries or not in terms of lifecycle dynamics.

To take a look at the potential impact of content type on cluster membership we took a look at the cluster distributions within each content type.

Above we can see that gallery content tends much more than the others to be in our ‘Long Lived’ cluster 1 (about 55% of galleries were in this cluster).

We can also see that articles with videos and articles without videos look pretty similar.

Another look at the data…

To validate this initial finding that it is indeed galleries that are much more likely to be long lived and articles with and without video content tend to behave similarly we took another look at the raw data.

In particular, if we take a look at the distributions of % of total lifetime pageviews reached by week 1 we see again more evidence to back up our findings.

We can see a high peak to the far right for articles (the red and green lines) that suggests most of them have already received around 80% or more of their lifetime pageviews by week 1.

We now also begin to see a more nuanced view for galleries – a large share of them also are at 75% or more by week 1. But there is also a much more flat and uniform, fat tail to the left which indicates a much higher probability for galleries for lifetime pageviews to be at less than 75% by week 1.

If we look at this picture again but jump forward to week 4 we see galleries starting to follow a similar shape to articles but still with a very long tail where some have still not even reached 25% of lifetime pageviews.

Some nuance

The above distribution plots along with the clustering analysis suggest that galleries do indeed have a higher likelihood of being ‘longer lived’ but the majority of galleries still have received most of their lifetime pageviews by week 4. This sounds like a contradiction but it’s not.

So a more nuanced interpretation is that although some galleries end up with a longer lifecycle this does not mean all galleries are still active beyond 4 weeks, in fact it’s the opposite.

One way to try bring this out more clearly is if we do the same clustering exercise above but first filter the data to be just galleries.

Here we see clearly that 69% (53%+16%) of galleries follow the more normal lifecycle of getting most of their pageviews in the first couple of weeks but that about 32% of them (cluster 1 in red here) follow a much more steady lifecycle almost receiving a steady constant flow of pageviews per week.

As this is hollywoodlife.com it could be that these are mega galleries that relate to key celebs who continually appear on the site’s content and so get continually linked to (e.g. “Check out more pics of Kimye !!!”).

This raises the question of understanding and interpreting why we see these patterns and what their potential drivers are.

Further work

A logical next step would be to use the cluster labels as inputs into a classification problem where we see what features are predictive of cluster membership (example features could be related to content topic, how the content is promoted, gallery specific features, maybe even some semantic level of understanding of what’s in the pictures themselves using tools like google vision api, and pretty much any other features we can dream up and try quantify).

This could help generate insights which could be useful to feedback to editorial and the wider business. Examples here could relate to how we handle the content once it’s created or even any insights we can find that indicate actions that can be taken at the time of creation to extend the chances of the content becoming long lived and ultimately attracting more eyeballs. Being able to predict the most likely lifecycle of content after a week or so could also be useful in helping to decide what content to promote over others and how best to place it.

So as is typical with this sort of work, it both raises more questions and additional avenues to explore. Back to work i guess.

There is a constant need for a tool to replicate the production site with minimum data on a test setup with a completely functional theme. We all have come across a situation where we wanted to test a piece of code on a setup that is exactly like our production environment. And we start by setting up the theme and a test server and then import data from production into it.

Production databases are huge, having 3 – 10GB or more of data for each site, and if we have multiple sites it easily becomes approximately 60–100GB of data that we have to download. If we want to create a setup for each developer we are creating that many environments with such huge data occupying space and increasing redundancy. We always wanted to have a much smaller dataset that represents each type of content from the production server in our test site. Also, if we launch a new section or content type, we create dummy content for testing but most often than not its different than the way editors use that section in the wild to create content and we want to have a piece of that dataset on our local setup to understand the use of that section and test it properly.

Keeping in mind the above scenario, I was asked by Amit Sannad to develop a plugin using the WordPress REST API so that we can import data from production site into any WordPress setup of our choice which would give us a theme and a mini site that is completely functional and mirrors production environment. This concept of having a plugin to populate minimum required data is a good way to fill our test sites with all the features, new content types from custom post types and also it is a great idea that this plugin allows us to easily refresh data on a dev site.

I started with the JSON REST API plugin. It is a very useful plugin and very well documented here. It lets you import most of the built-in features such as posts, page, users, tags, categories etc… It also allows you to extend it and add custom endpoints to fetch any custom data type which is not available by default such as Custom Post Types and Custom Taxonomies. But soon we realised that it is not part of the WordPress.com core yet and is in beta release. It required us to install as a plugin and then develop, customize, and deploy the entire code.

We host our sites on WordPress.com VIP and did not want to develop against something that was not yet part of the WordPress.com core. So I came across WordPress.com’s REST API — it’s different than JSON REST API, it’s public, and allows authentication based retrieval of data. That became an important factor in choosing an API that is well supported by WordPress.com VIP and provides authentication when we want to pull sensitive data.

Using WordPress.com’s REST API, we can fetch posts, tags, categories, attachments, and comments. However we require OAuth 2 Authentication for pulling users and menus. When I got familiar with OAuth2 I learned that it requires 2 step authentication which would provide an access token that can be saved and used subsequently. It is a secure way of trying to connect to a production site and getting data. If we are on a familiar dev/QA environment and know the users that would use this site we can have the access token fetched once and saved in the database and use it in our request header as Authentication Bearer.

Then comes the data import for non built-in types. A key breakthrough here was when I learned from WordPress VIP that we have a filter to whitelist the custom post types and pull them using the post endpoint using the ‘type’ query string. We can whitelist all the custom post types we want to access. The private custom post types will only be available if the user is authenticated. This way we can get access to sensitive information without it being exposed to the general public while the public custom post types are exposed out without authentication. Here is a snippet that gets the allowed post types for import.

But there was another challenge. The API does not allow us to extend its functionality. That is all it could provide us for the moment and I was blocked when I learned that custom taxonomies, sidebar widgets, and admin settings had no endpoints through this API.

We needed a way to get these things to complete the theme setup. Amit Sannad suggested we use the WordPress.com’s XMl-RPC API to get those values imported. WordPress.com’s XML-RPC API provides default methods to fetch taxonomies and terms. But for sidebar settings and widgets I had to write a custom XML-RPC method by extending the wp_xmlrpc_server class to get the required dataset based on authentication.

One more feature of this plugin is the support for WP-CLI so that we can run the imports from command line and get the data imported on any start up scripts to setup or update the dev environment. This allows us to automate the import process. We can have credentials saved to a json file ( format of the json file is comitted to the github repository ) and pass the path to the file as an argument for authentication only when we run the CLI for first time on a site and the access token will be saved to database for any subsequent use. To import using WP-CLI we can use :

/*Import all data from production server for a given URL*/
wp --url=dev.local pmc-import-live import-all --file=/path/to/auth.json
/*Import only specific required data from production server for given URL*/
wp --url=dev.local pmc-import-live import-routes --routes=users,menus
--post-type=post,gallery,page --xmlrpc=taxonomies,options

This process completed the basic entities that were required for importing content in order to have a fully functional replica of a production site with minimum amount of data. And we could create as many setups and pull just the required data on a click and also refresh existing test site with new data as and when required without having to actually touch the production site. The PMC Theme Unit Test plugin would perform all that for us.

PMC Theme Unit Test has been a very challenging and satisfying plugin to develop. Thank you Gabriel Koenand Amit Sannad for conceptualising this wonderful plugin and letting me work on this. Thanks Corey Gilmore, Amit Gupta and Hau Vong for your valuable inputs throughout the development.