Month: August 2017

With all the fanfare and triumph both deep learning and artificial intelligence get these days one aspect i find often gets overlooked in popular accounts is the central role embeddings play. This is probably because they can be a little bit abstract and hard to explain to someone for the first time. Regardless, embeddings can help turn a sea of unstructured text into the raw numbers that fuel deep learning and many other approaches to machine learning and AI.

What’s an embedding you say? Well i’m glad you asked….In this post we will dip our toes into the world of embeddings with a somewhat silly hands on example and illustratration using about 100,000 articles from one of our celebrity entertainment sites hollywoodlife.com.

I’ll give you a warning up front – there will be some talk of numbers, dimensions, arrays, and vectors, it might get a little dry. But to keep you interested i promise by the end we might have answers to some very important questions like:

How can we measure the Justin Bieberness of someone?

What is Kim Kardashian – Kanye West + Brad Pitt?

Can we isolate and identify a Kardashian gene in our data?

Brief Background

I am by no means an expert in any of this so i’ll be brief here and try link to some good resources as we go rather than attempt long winded explanations. All the code behind this post can be found here on github and the ipython notebook fully rendered here on nbviewer (renders the plotly charts that way too).

There are many different ways to represent text as numbers for machine learning tasks that try to extract information (in various ways) from things like blobs of raw text.

My TL;DR; version is this. Let’s say we have three tweets:

“I love Justin Bieber”.

“I adore Justin Bieber”.

“Omg, I love love love Justin Bieber!”.

If we wanted to do any sort of knowledge extraction from this bunch of tweets we’d need to settle on a method to represent them in some way to the computer. Ideally this representation might be something we can then do math on and use in various machine learning techniques.

Bag of words art installation at CMU.

Perhaps the most intuitive approach here is the bag of words (BOW) approach. This approach is to basically represent each sentence as a vector of counts for the number of times each word appears. So:

Words

Original Text

I

love

justin

bieber

adore

omg

BOW Vector Representation

I love Justin Bieber.

1

1

1

1

0

0

[1,1,1,1,0,1]

I adore Justin Bieber.

1

0

1

1

1

1

[1,0,1,1,1,1]

Omg, I love love love Justin Bieber!

1

3

1

1

0

1

[1,3,1,1,0,1]

So this is one way we could represent the tweets as numbers to try figure out how much someone loves Justin Bieber.

In this case we might flag both “love” and “adore” as positive word’s so maybe we could measure the amount of love for Bieber from each as 1, 1, and 3.

Basically, each tweet shows love for Bieber but maybe the third one is stronger in some sense given the repetition of “love” and presence of an emotive “omg”. These are the sorts of ‘rules’ a computer could then learn if given enough example data.

This approach actually turns out to be surprisingly successful in practice, especially when you consider the fact that it really does not really try to understand each word in any way – it’s just count’s.

So for example, if we had not flagged “adore” as a positive word then we would have scored tweet 2 as having no love for Bieber. The bag of words approach makes no attempt to try capture in the representation the fact that “love” and “adore” are fairly similar in terms of semantic meaning.

Word Vectors

Taken from https://www.tensorflow.org/tutorials/word2vec

What if instead, we said for each word let’s just represent it as 100 random numbers (btw – the choice of 100 is an arbitrary parameter here). And let’s then figure out a way to find the best value for each of those 100 numbers such that similar words end up having a similar set of 100 numbers and so end up ‘near’ each other in this new 100 dimensional space we are now in.

If we could do this then we should be able to have a representation that would easily capture the fact that “love” and “adore” are similar as their vector representations would have similar sets of numbers and so a high correlation in this 100-D space.

The challenge here is how do we tune and pick the specific set of numbers for each word. This is where the word2vec approach comes in. If we just start with a random set of numbers as the representation for each word we can look through our text line by line (adjusting our numbers as we go) and basically see how good we can get at predicting neighbouring words given a specific vector representation of a middle target word.

The intuition behind this is the famous(ish) saying (i like to put on a traditional british accent when reading this):

“You shall know a word by the company it keeps”– John Rupert Firth (Famous english linguist)

The idea being that words that are in some meaningful way related or informative of each other will tend to co-occur nearby in natural raw text.

So to be concrete here. If we see a sentence like:

“Canadianheart-throbJustin Bieber is at it again, the popstar was recently filmed blah blah…”

We can see from this that by looking at the words that tend to occur around Justin Bieber we can actually learn a lot (in a particularly specific and narrow sense perhaps) about Justin Bieber.

The word2vec approach leverages this (and a lot of data) to essentially learn a really good combination of those 100 numbers. Where ‘good’ here is defined in terms of ability to use that vector of numbers (as our representation of a specific word) to predict neighbouring word’s (or vice versa).

I have simplified and glossed over a lot here. So here are some links to resources that do a much better and detailed job at explaining.

Some light string cleaning like setting all to lowercase, removing various funny characters, artifacts and things like that.

(Note: In the notebook some of this is very ugly and even specific to phrases and terms commonly used on hollywoodlife.com – but is a good real world example of raw text data you might tend to come across in the wild).

Next we do some phrase creation (here). For our use case we are mainly interested in names of people, so we need to try find a way to treat “Justin Bieber” as one word. Otherwise the two separate words “Justin” and “Bieber” could end up looking quite different from each other make it much harder on the model to capture the essence of “Justin Bieber”.

Running phrase detection using gensim does a pretty good job of creating new words such as “justin_bieber” and “selena_gomez”. The idea here (oversimplifying as usual) is to basically look at all the documents and if you see pairs of co-occurring words very often (above some threshold you can pick) then join them together into one token (fancy name for word) such as “justin_bieber”.

Finally we do some specific gensim related preprocessing to get it into the format required to build a model on (here).

(Note: I found that actually passing the whole document as a ‘sentence’ to gensim, along with a wider window, gave much better results than training on random sentences pulled from the corpus. However this can be task and data specific so is something worth playing around with yourself).

Let’s Play

Visualising Vectors

So once we have our model built we can begin to look at the word vectors it has created and see if they make sense and what sort of knowledge, if any, we can extract from all this (here).

To be really concrete, here is Justin Bieber’s word vector as trained on the hollywoodlife.com corpus:

Wow – amazing no?

Ok, yeah it’s just 100 random looking numbers.

But that’s exactly the point – this particular set of numbers is now a representation of the word “justin_bieber” that captures different aspects of it’s meaning based on the type of words it tends to occur with.

So we should be able to use it in various ways that might be more powerful than the bag of words approach we looked at earlier.

Anyway, to help show that really there is nothing particularly scary or fancy about the output word vectors we get, here is a Tableau Public workbook where you can go and play around with these vectors for yourself.

Here are some interesting things i came across when playing with this.

If we are careful about a specific set of vectors we look at then we might be able to infer some interesting relationships and actually ‘see’ all this a bit more visually.

So, as an example, if we filter to just look at the Kardashians – we can see some specific vector elements that are of a similar magnitude across all of the family. A sort of Kardashian DNA marker if you will allow me the indulgence 🙂

Highlighted in yellow are some vector dimensions that seem to be similar magnitude across all Kardashians.

Of course, as usual with these types of algorithms, you rarely get a really clear single number of measure that captures what it is actually doing. More often it’s the combinations of the numbers and measures that the computer can pick up on much easier then we might be able to by looking at it.

To try express this fact we can see that the chart below shows more visually how the direction and magnitudes of each vector element tend to move together for all the Kardashians.

The colors all tend to be either positive or negative together and sometimes even have a similar area to each other.

It’s this correlation among all 100 vector dimensions that would be a much stronger fingerprint of kardashian’s to a computer but which is harder for us to perceive.

A counter example might also help here – if we take a group we expect not to have much similarity, “hillary_clinton”, “justin_bieber”, and “eminem” then we see:

Everything here looks a lot more random in terms of correlations across vector dimensions.

So it seems like maybe we actually can have some level of interpretability here depending on the specific question you ask and how you frame it. That said, the main goal of word2vec is a flexible representation that can be useful in other tasks as opposed to interpretability as to what key attributes or characteristics it has selected for in its representation.

Vector Arithmetic

One of the most interesting and well known findings from the word2vec approach was that you could do arithmetic on the resulting vectors and the results of that arithmetic implied a pretty impressive level of semantic understanding.

A typical example here (from the vectors originally published by Google and so trained on a larger more generic dataset) is if you take the vector for “King” subtract the vector for “Man” and add the vector for “Woman” you end up nearest to the vector for “Queen”. So:

King – Man + Woman = Queen

This is to say that the representation somehow figured out that an important part of what it means to be a king is to be male, male and female are in some way opposite, and so opposite of a King is a Queen. (It would also have similarly figured out that a large part of what it means to be a queen is to be female, again i’m being a little simplistic perhaps).

Looking at our trained models vectors we can see an equivalent equation of:

kim_kardashian – kanye_west + brad_pitt = ??

kim_kardashian – kanye_west + brad_pitt = angelina_jolie

Yay it works!

It’s figured out that “kim_kardashian” is to “kanye_west” as “angelina_jolie” is to “brad_pitt” i.e. a marriage relationship.

Ok so cool but to be honest the truth is always is a little more complicated and messy and it won’t work for every example.

But generally if you just throw these types of equations at it, within the top few results you do tend to see things that make sense even if they are not as clean and perfect as you might like.

So here is a nice example.

taylor_swift – harry_styles + zayn_malik :

“selena_gomez”, “calvin_harris” – wth?

But actually we do still see “perrie_edwards” in the top group which is reasonable as they used to date, and we see “gigi_hadid” who (a google search later) i believe he is still dating.

As an aside, a quick investigation revealed devastating rumors that Zayn Malik may have actually cheated on Perrie Edwards with Selena Gomez! Also it seems Zayn and Calvin Harris have had major beef in the past (yes i typed that) that extended in various ways to each others significant others. Juicy! I’m….going to stop this now. The point is that it actually seems like there is a sort of relationship theme running though these connections so it’s actually not surprising that this seems to have been, to some extent, encoded into this specific group of vectors.

Bieber’s Network

Another way to explore the resulting vectors is to pick a seed word like “justin_bieber”, find the N nearest neighbours, and for each neighbour find their N nearest neighbours. Take S of these such steps and the end result will be some definition of a network centered on the seed word.

So if we use a seed of “justin_bieber” with 10 nearest neighbours (n=10) and take 3 steps (S=3) we get a resulting graph that look’s like (with a little bit of cleaning to remove most non person type words):

From this we can see some initial direct connections as well as easily pick out connections in the graph that themselves have many edges. (btw – who are Cody Simpson and Austin Mahone – should i know?)

Once we have our network of relationships there are many different ways to visually lay out the graph itself. I’ve used igraph, plotly and networdD3 libraries in R to plot all the networks as i sometimes just find it easier to produce plots using R over Python. After playing with various layouts i found the below force directed layout to be useful. Colors are based on community detection within the graph – so essentially we can think of these as one way to see sub clusters within the network as well as highlight nodes with higher betweenness for example.

Obligatory Heatmap

An alternative way to explore Bieber’s network is to pull out the vectors corresponding to each member of the network and then do some clustering on the resulting matrix of numbers. The idea here being to find any potential relationships between the various members of the network.

Heatmaps are sexy and all but it can be hard to really visually see anything beyond the most and least correlated cells.

little_mix and demi_lovato seem to stand out here for some reason.

A more useful approach here would be to use hierarchical clustering to build a dendrogram which will visually place closest vectors together which can then be grouped into clusters of varying size.

From the above dendrogram we can see that clustering on the vectors does actually give us some nice results. We see the yellow “young supermodel” cluster, we see the green “one direction” cluster, and we also see a sort of “popstar” cluster with Demi Lovato, Ed Sheeran, Nick & Joe Jonas and some others. And generally all words that are next to each other seem reasonable.

One of the most useful aspects of the hierarchical clustering approach (over k-means say) is that it let the data speak for itself a bit more and gives us a way to easily see if we agree with it or not. As such, it might be more useful to use this approach if we were taking a more general or higher level view and wanted to cluster a sample of words to see if the results more generally make sense.

So as an example here is a dendrogram from a random sample of words.

In general we can see that related words tend to be placed next to each other which is as a result of word2vec capturing quite well various notions of ways in which the words are related.

t-SNE

Another useful tool to visualise our vectors and their relationships to each other is t-SNE. This gives us a way (with some PCA in between) to move from the 100 dimensional space of our vectors to a 2-D space where points nearer to each other in the higher space tend to be nearer in our projected down 2-D space such that it can be easier to visualise related points on a scatter plot.

So when you plot a sample of words you get what first looks like a mess. But if you zoom in on certain sections you can see each part of the space tends to be home to words in some way related.

In the example below there seemed to be a part of the space made up mostly of beauty related words while also one made up of sports and NFL related words.

Really this is best explored in an interactive plot which is one of the best things about plotly. Here is an interactive version of the above plot hosted on plotly to explore.

So What?

So from playing around and exploring our vector space we see aspects of our word2vec vectors that in hindsight seem quite interesting and potentially useful in many downstream tasks.

I find it pretty impressive that word2vec was able to extract all this structure and insight from essentially just ‘reading’ 100k articles focusing of how words tend to co-occur and capturing that in a flexible numeric representation.

Key to this is the semantic aspects we seem have captured. This could be very useful in providing a more nuanced way for us to represent our content as features of numbers to feed into downstream predictive models of things like CTR and Pageview’s. This would enable any model avoid getting too hung up on if an article is specifically about Liam Payne or Zayn Malik, if what really matters is the fact that this content is essentially “One Direction” related. The more traditional bag of words approach would not give us this flexibility.

So, all in all, embeddings and tools like word2vec, doc2vec, lda2vec etc. are more and more becoming foundational approaches very useful when looking to move from bags of unstructured data like text to more structured yet flexible representations that can be leveraged across many problem domains.

Here’s a list of reference links I gathered in preparing my talk for WordCamp for Publishers 2017. Some of this material didn’t make it into the final talk.

Respondents often complained about ads taking up “space” or “covering” the desired content.
Examples of user comments:
“I hate disruptive advertising. An online ad should be no more intrusive than a standard magazine ad. That is, it should not flash, demand action, or be placed in the middle of an article. Off to the side or below is best.”

When the WSJ Weekend Edition is released on August 26, available on newsstands and sent to approximately 1 million print subscribers, it will contain the WSJ. Magazine’s biggest issue yet. Ad pages are up 5 percent year-over-year to 110 pages, according to publisher Anthony Cenname. Revenue is up 5 percent, too.

In online media we are pretty aware of the ‘lifecycle’ of a piece of content. Most articles tend to get the majority of the pageviews they will ever receive in the couple of weeks after they are published.