Tag: python

With all the fanfare and triumph both deep learning and artificial intelligence get these days one aspect i find often gets overlooked in popular accounts is the central role embeddings play. This is probably because they can be a little bit abstract and hard to explain to someone for the first time. Regardless, embeddings can help turn a sea of unstructured text into the raw numbers that fuel deep learning and many other approaches to machine learning and AI.

What’s an embedding you say? Well i’m glad you asked….In this post we will dip our toes into the world of embeddings with a somewhat silly hands on example and illustratration using about 100,000 articles from one of our celebrity entertainment sites hollywoodlife.com.

I’ll give you a warning up front – there will be some talk of numbers, dimensions, arrays, and vectors, it might get a little dry. But to keep you interested i promise by the end we might have answers to some very important questions like:

How can we measure the Justin Bieberness of someone?

What is Kim Kardashian – Kanye West + Brad Pitt?

Can we isolate and identify a Kardashian gene in our data?

Brief Background

I am by no means an expert in any of this so i’ll be brief here and try link to some good resources as we go rather than attempt long winded explanations. All the code behind this post can be found here on github and the ipython notebook fully rendered here on nbviewer (renders the plotly charts that way too).

There are many different ways to represent text as numbers for machine learning tasks that try to extract information (in various ways) from things like blobs of raw text.

My TL;DR; version is this. Let’s say we have three tweets:

“I love Justin Bieber”.

“I adore Justin Bieber”.

“Omg, I love love love Justin Bieber!”.

If we wanted to do any sort of knowledge extraction from this bunch of tweets we’d need to settle on a method to represent them in some way to the computer. Ideally this representation might be something we can then do math on and use in various machine learning techniques.

Bag of words art installation at CMU.

Perhaps the most intuitive approach here is the bag of words (BOW) approach. This approach is to basically represent each sentence as a vector of counts for the number of times each word appears. So:

Words

Original Text

I

love

justin

bieber

adore

omg

BOW Vector Representation

I love Justin Bieber.

1

1

1

1

0

0

[1,1,1,1,0,1]

I adore Justin Bieber.

1

0

1

1

1

1

[1,0,1,1,1,1]

Omg, I love love love Justin Bieber!

1

3

1

1

0

1

[1,3,1,1,0,1]

So this is one way we could represent the tweets as numbers to try figure out how much someone loves Justin Bieber.

In this case we might flag both “love” and “adore” as positive word’s so maybe we could measure the amount of love for Bieber from each as 1, 1, and 3.

Basically, each tweet shows love for Bieber but maybe the third one is stronger in some sense given the repetition of “love” and presence of an emotive “omg”. These are the sorts of ‘rules’ a computer could then learn if given enough example data.

This approach actually turns out to be surprisingly successful in practice, especially when you consider the fact that it really does not really try to understand each word in any way – it’s just count’s.

So for example, if we had not flagged “adore” as a positive word then we would have scored tweet 2 as having no love for Bieber. The bag of words approach makes no attempt to try capture in the representation the fact that “love” and “adore” are fairly similar in terms of semantic meaning.

Word Vectors

Taken from https://www.tensorflow.org/tutorials/word2vec

What if instead, we said for each word let’s just represent it as 100 random numbers (btw – the choice of 100 is an arbitrary parameter here). And let’s then figure out a way to find the best value for each of those 100 numbers such that similar words end up having a similar set of 100 numbers and so end up ‘near’ each other in this new 100 dimensional space we are now in.

If we could do this then we should be able to have a representation that would easily capture the fact that “love” and “adore” are similar as their vector representations would have similar sets of numbers and so a high correlation in this 100-D space.

The challenge here is how do we tune and pick the specific set of numbers for each word. This is where the word2vec approach comes in. If we just start with a random set of numbers as the representation for each word we can look through our text line by line (adjusting our numbers as we go) and basically see how good we can get at predicting neighbouring words given a specific vector representation of a middle target word.

The intuition behind this is the famous(ish) saying (i like to put on a traditional british accent when reading this):

“You shall know a word by the company it keeps”– John Rupert Firth (Famous english linguist)

The idea being that words that are in some meaningful way related or informative of each other will tend to co-occur nearby in natural raw text.

So to be concrete here. If we see a sentence like:

“Canadianheart-throbJustin Bieber is at it again, the popstar was recently filmed blah blah…”

We can see from this that by looking at the words that tend to occur around Justin Bieber we can actually learn a lot (in a particularly specific and narrow sense perhaps) about Justin Bieber.

The word2vec approach leverages this (and a lot of data) to essentially learn a really good combination of those 100 numbers. Where ‘good’ here is defined in terms of ability to use that vector of numbers (as our representation of a specific word) to predict neighbouring word’s (or vice versa).

I have simplified and glossed over a lot here. So here are some links to resources that do a much better and detailed job at explaining.

Some light string cleaning like setting all to lowercase, removing various funny characters, artifacts and things like that.

(Note: In the notebook some of this is very ugly and even specific to phrases and terms commonly used on hollywoodlife.com – but is a good real world example of raw text data you might tend to come across in the wild).

Next we do some phrase creation (here). For our use case we are mainly interested in names of people, so we need to try find a way to treat “Justin Bieber” as one word. Otherwise the two separate words “Justin” and “Bieber” could end up looking quite different from each other make it much harder on the model to capture the essence of “Justin Bieber”.

Running phrase detection using gensim does a pretty good job of creating new words such as “justin_bieber” and “selena_gomez”. The idea here (oversimplifying as usual) is to basically look at all the documents and if you see pairs of co-occurring words very often (above some threshold you can pick) then join them together into one token (fancy name for word) such as “justin_bieber”.

Finally we do some specific gensim related preprocessing to get it into the format required to build a model on (here).

(Note: I found that actually passing the whole document as a ‘sentence’ to gensim, along with a wider window, gave much better results than training on random sentences pulled from the corpus. However this can be task and data specific so is something worth playing around with yourself).

Let’s Play

Visualising Vectors

So once we have our model built we can begin to look at the word vectors it has created and see if they make sense and what sort of knowledge, if any, we can extract from all this (here).

To be really concrete, here is Justin Bieber’s word vector as trained on the hollywoodlife.com corpus:

Wow – amazing no?

Ok, yeah it’s just 100 random looking numbers.

But that’s exactly the point – this particular set of numbers is now a representation of the word “justin_bieber” that captures different aspects of it’s meaning based on the type of words it tends to occur with.

So we should be able to use it in various ways that might be more powerful than the bag of words approach we looked at earlier.

Anyway, to help show that really there is nothing particularly scary or fancy about the output word vectors we get, here is a Tableau Public workbook where you can go and play around with these vectors for yourself.

Here are some interesting things i came across when playing with this.

If we are careful about a specific set of vectors we look at then we might be able to infer some interesting relationships and actually ‘see’ all this a bit more visually.

So, as an example, if we filter to just look at the Kardashians – we can see some specific vector elements that are of a similar magnitude across all of the family. A sort of Kardashian DNA marker if you will allow me the indulgence 🙂

Highlighted in yellow are some vector dimensions that seem to be similar magnitude across all Kardashians.

Of course, as usual with these types of algorithms, you rarely get a really clear single number of measure that captures what it is actually doing. More often it’s the combinations of the numbers and measures that the computer can pick up on much easier then we might be able to by looking at it.

To try express this fact we can see that the chart below shows more visually how the direction and magnitudes of each vector element tend to move together for all the Kardashians.

The colors all tend to be either positive or negative together and sometimes even have a similar area to each other.

It’s this correlation among all 100 vector dimensions that would be a much stronger fingerprint of kardashian’s to a computer but which is harder for us to perceive.

A counter example might also help here – if we take a group we expect not to have much similarity, “hillary_clinton”, “justin_bieber”, and “eminem” then we see:

Everything here looks a lot more random in terms of correlations across vector dimensions.

So it seems like maybe we actually can have some level of interpretability here depending on the specific question you ask and how you frame it. That said, the main goal of word2vec is a flexible representation that can be useful in other tasks as opposed to interpretability as to what key attributes or characteristics it has selected for in its representation.

Vector Arithmetic

One of the most interesting and well known findings from the word2vec approach was that you could do arithmetic on the resulting vectors and the results of that arithmetic implied a pretty impressive level of semantic understanding.

A typical example here (from the vectors originally published by Google and so trained on a larger more generic dataset) is if you take the vector for “King” subtract the vector for “Man” and add the vector for “Woman” you end up nearest to the vector for “Queen”. So:

King – Man + Woman = Queen

This is to say that the representation somehow figured out that an important part of what it means to be a king is to be male, male and female are in some way opposite, and so opposite of a King is a Queen. (It would also have similarly figured out that a large part of what it means to be a queen is to be female, again i’m being a little simplistic perhaps).

Looking at our trained models vectors we can see an equivalent equation of:

kim_kardashian – kanye_west + brad_pitt = ??

kim_kardashian – kanye_west + brad_pitt = angelina_jolie

Yay it works!

It’s figured out that “kim_kardashian” is to “kanye_west” as “angelina_jolie” is to “brad_pitt” i.e. a marriage relationship.

Ok so cool but to be honest the truth is always is a little more complicated and messy and it won’t work for every example.

But generally if you just throw these types of equations at it, within the top few results you do tend to see things that make sense even if they are not as clean and perfect as you might like.

So here is a nice example.

taylor_swift – harry_styles + zayn_malik :

“selena_gomez”, “calvin_harris” – wth?

But actually we do still see “perrie_edwards” in the top group which is reasonable as they used to date, and we see “gigi_hadid” who (a google search later) i believe he is still dating.

As an aside, a quick investigation revealed devastating rumors that Zayn Malik may have actually cheated on Perrie Edwards with Selena Gomez! Also it seems Zayn and Calvin Harris have had major beef in the past (yes i typed that) that extended in various ways to each others significant others. Juicy! I’m….going to stop this now. The point is that it actually seems like there is a sort of relationship theme running though these connections so it’s actually not surprising that this seems to have been, to some extent, encoded into this specific group of vectors.

Bieber’s Network

Another way to explore the resulting vectors is to pick a seed word like “justin_bieber”, find the N nearest neighbours, and for each neighbour find their N nearest neighbours. Take S of these such steps and the end result will be some definition of a network centered on the seed word.

So if we use a seed of “justin_bieber” with 10 nearest neighbours (n=10) and take 3 steps (S=3) we get a resulting graph that look’s like (with a little bit of cleaning to remove most non person type words):

From this we can see some initial direct connections as well as easily pick out connections in the graph that themselves have many edges. (btw – who are Cody Simpson and Austin Mahone – should i know?)

Once we have our network of relationships there are many different ways to visually lay out the graph itself. I’ve used igraph, plotly and networdD3 libraries in R to plot all the networks as i sometimes just find it easier to produce plots using R over Python. After playing with various layouts i found the below force directed layout to be useful. Colors are based on community detection within the graph – so essentially we can think of these as one way to see sub clusters within the network as well as highlight nodes with higher betweenness for example.

Obligatory Heatmap

An alternative way to explore Bieber’s network is to pull out the vectors corresponding to each member of the network and then do some clustering on the resulting matrix of numbers. The idea here being to find any potential relationships between the various members of the network.

Heatmaps are sexy and all but it can be hard to really visually see anything beyond the most and least correlated cells.

little_mix and demi_lovato seem to stand out here for some reason.

A more useful approach here would be to use hierarchical clustering to build a dendrogram which will visually place closest vectors together which can then be grouped into clusters of varying size.

From the above dendrogram we can see that clustering on the vectors does actually give us some nice results. We see the yellow “young supermodel” cluster, we see the green “one direction” cluster, and we also see a sort of “popstar” cluster with Demi Lovato, Ed Sheeran, Nick & Joe Jonas and some others. And generally all words that are next to each other seem reasonable.

One of the most useful aspects of the hierarchical clustering approach (over k-means say) is that it let the data speak for itself a bit more and gives us a way to easily see if we agree with it or not. As such, it might be more useful to use this approach if we were taking a more general or higher level view and wanted to cluster a sample of words to see if the results more generally make sense.

So as an example here is a dendrogram from a random sample of words.

In general we can see that related words tend to be placed next to each other which is as a result of word2vec capturing quite well various notions of ways in which the words are related.

t-SNE

Another useful tool to visualise our vectors and their relationships to each other is t-SNE. This gives us a way (with some PCA in between) to move from the 100 dimensional space of our vectors to a 2-D space where points nearer to each other in the higher space tend to be nearer in our projected down 2-D space such that it can be easier to visualise related points on a scatter plot.

So when you plot a sample of words you get what first looks like a mess. But if you zoom in on certain sections you can see each part of the space tends to be home to words in some way related.

In the example below there seemed to be a part of the space made up mostly of beauty related words while also one made up of sports and NFL related words.

Really this is best explored in an interactive plot which is one of the best things about plotly. Here is an interactive version of the above plot hosted on plotly to explore.

So What?

So from playing around and exploring our vector space we see aspects of our word2vec vectors that in hindsight seem quite interesting and potentially useful in many downstream tasks.

I find it pretty impressive that word2vec was able to extract all this structure and insight from essentially just ‘reading’ 100k articles focusing of how words tend to co-occur and capturing that in a flexible numeric representation.

Key to this is the semantic aspects we seem have captured. This could be very useful in providing a more nuanced way for us to represent our content as features of numbers to feed into downstream predictive models of things like CTR and Pageview’s. This would enable any model avoid getting too hung up on if an article is specifically about Liam Payne or Zayn Malik, if what really matters is the fact that this content is essentially “One Direction” related. The more traditional bag of words approach would not give us this flexibility.

So, all in all, embeddings and tools like word2vec, doc2vec, lda2vec etc. are more and more becoming foundational approaches very useful when looking to move from bags of unstructured data like text to more structured yet flexible representations that can be leveraged across many problem domains.

My Confession

I have a confession….my crontab is a mess and it’s keeping me up at night….don’t worry, it’s not really keeping me up….but you might know what i mean 🙂

We have been using Google BigQuery as our main data mart (lake or whatever its now called) for almost two years. We have been loving this as it’s super powerful with very little overhead in terms of management and infrastructure.

However one thing i think is missing from Google BigQuery is a tool for managing and orchestrating your ETL jobs. Something like MS Data Factory but for Google Cloud.

Over the last year or so we have built up a few shell scripts to wrap around the bq command line tool and basically take a series of templated sql files, chain them together into a job and then schedule it all via cron. This has actually been working better then i would have expected, mainly because we try to build a lot of redundancy into these jobs where failures won’t kill any live tables and will usually be resolved the next time the job runs in a few hours or so.

However it’s always bugged me a little that this is not really a great way to go about things and as the number of jobs and pipelines has grown it has got a little painful to manage.

Apache Airflow

Recently we have been playing around with Apache Airflow. This post is more about a concrete example of one way we have got it working for a specific use case that i did not really find any obvious existing examples of (there is actually great documentation and lots of examples but there is a layer of Airflow specific concepts and terminology one needs to nerd up on first).

So i won’t talk much about Airflow in general except to give some references i found very good for beginners:

The specific use case i was trying to figure out was around creating a paramaterised pipeline that takes the templated sql files we already have and can run them for each of our lines of businesses or “lob’s” (e.g. variety.com, hollywoodlife.com, wwd.com etc). This is a pretty common ETL design pattern so hopefully others might find this useful.

A concrete example at PMC would be some post processing we do of raw Google Analytics data we get exported into BigQuery each day. The pipeline basically takes the raw [<view_id>.ga_sessions_yyyymmdd] tables for each day, applies some transformations and enrichment’s and then creates a table in BigQuery for each lob for each day. So for example [wwd.ga_data_20170322] would be the end result which is a raw hit level table that we have enriched in various ways. This then is essentially one of our main analytical tables that powers lots of downstream analytics across all lob’s.

Parameterized DAG Example

Below is an example of one way we got this working by having a single collection of SQL template files (to represent the actual steps of processing we want to do in BigQuery) and a .py script to build and define the expanded DAG with parallelism across all lobs while maintaining dependencies between each lob specific task.

Disclaimer: I’m pretty new to Airflow and one thing i’ve found is that due to the programmatic nature of how we build DAGs there is lots of ways to do things (this is actually one of its strongest selling points). I’m sure there are better ways to go about what we are showing here (feel free to add suggestions in the comments 🙂 ).

So here is an example DAG definition python script which lives in it’s own sub folder in our Airflow DAGs folder. (Prettier formatting on Github here). I’ve tried to go overboard on the commenting for line by line clarity.

"""
### My first dag to play around with airflow and bigquery.
"""
# imports
from airflow import DAG
from datetime import datetime, timedelta
# we need to import the bigquery operator - there are lots of cool operators for different tasks and systems, you can also build your own
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
# create a dictionary of default typical args to pass to the dag
default_args = {
'owner': 'airflow',
'depends_on_past': False, # does this dag depend on the previous run of the dag? best practice is to try have dags not depend on state or results of a previous run
'start_date': datetime(2017, 3, 22), # from what date to you want to pretend this dag was born on? by default airflow will try backfill - be careful
'email_on_failure': True, # should we send emails on failure?
'email': ['andrew.maguire@pmc.com'], # who to email if fails i.e me :)
'retries': 1, # if fails how many times should we retry?
'retry_delay': timedelta(minutes=2), # if we need to retry how long should we wait before retrying?
}
# define the dag
dag = DAG('my_bigquery_dag', # give the dag a name
schedule_interval='@once', # define how often you want it to run - you can pass cron expressions here
default_args=default_args # pass the default args defined above or you can override them here if you want this dag to behave a little different
)
# define list of lobs we want to run for
lobs = ["lob001","lob002","lob003"]
# loop through the lob's we want to use to build up our dag
for lob in lobs:
# define the first task, in our case a big query operator
bq_task_1 = BigQueryOperator(
dag = dag, # need to tell airflow that this task belongs to the dag we defined above
task_id='my_bq_task_1_'+lob, # task id's must be uniqe within the dag
bql='my_qry_1.sql', # the actual sql command we want to run on bigquery is in this file in the same folder. it is also templated
params={"lob": lob}, # the sql file above have a template in it for a 'lob' paramater - this is how we pass it in
destination_dataset_table='airflow.'+lob+'_test_task1', # we also in this example want our target table to be lob and task specific
write_disposition='WRITE_TRUNCATE', # drop and recreate this table each time, you could use other options here
bigquery_conn_id='my_gcp_connection' # this is the airflow connection to gcp we defined in the front end. More info here: https://github.com/alexvanboxel/airflow-gcp-examples
)
# add documentation for what this task does - this will be displayed in the Airflow UI
bq_task_1.doc_md = """\
Append a "Hello World!" message string to the table [airflow.<lob>_test_task1]
"""
# define the second task, in our case another big query operator
bq_task_2 = BigQueryOperator(
dag = dag, # need to tell airflow that this task belongs to the dag we defined above
task_id='my_bq_task_2_'+lob, # task id's must be uniqe within the dag
bql='my_qry_2.sql', # the actual sql command we want to run on bigquery is in this file in the same folder. it is also templated
params={"lob": lob}, # the sql file above have a template in it for a 'lob' paramater - this is how we pass it in
destination_dataset_table='airflow.'+lob+'_test_task2', # we also in this example want our target table to be lob and task specific
write_disposition='WRITE_TRUNCATE', # drop and recreate this table each time, you could use other options here
bigquery_conn_id='my_gcp_connection' # this is the airflow connection to gcp we defined in the front end. More info here: https://github.com/alexvanboxel/airflow-gcp-examples
)
# add documentation for what this task does - this will be displayed in the Airflow UI
bq_task_2.doc_md = """\
Append a "Goodbye World!" message string to the table [airflow.<lob>_test_task2]
"""
# set dependencies so for example 'bq_task_2' wont start until 'bq_task_1' is completed with success
bq_task_2.set_upstream(bq_task_1)

So the above file makes reference to our sql template files which actually hold the business logic of what we really want to do in BigQuery, everything above is really just plumbing.

Below is what is in the ‘my_qry_1.sql’ file – its just dummy code for this example. (Again Github link and apologies I don’t know how to properly format these blocks as code).

-- this is just some dummy bql to create a little example table
SELECT
string('{{ params.lob }}') as lob, -- here we define the jinja template for lob
string('{{ ds }}') as airflow_execution_date, -- here we leverage predefined variables airflow has more info: http://airflow.readthedocs.io/en/latest/code.html?highlight=ds_nodash#default-variables
string('{{ ds_nodash }}') as airflow_execution_date_yyyymmdd, -- here we leverage predefined variables airflow has more info: http://airflow.readthedocs.io/en/latest/code.html?highlight=ds_nodash#default-variables
string('{{ ts }}') as airflow_execution_timestamp, -- here we leverage predefined variables airflow has more info: http://airflow.readthedocs.io/en/latest/code.html?highlight=ds_nodash#default-variables
current_timestamp() as bq_timestamp, -- get the current time from bigquery so we can see any differences between airflow execution date (in the case of backfill's) as opposed to when we actually ran this code
'Hello Word!' as msg -- just a dummy field

With the above code in a folder within our specified airflow DAGs folder we can see how Airflow picks up this DAG.

Gratuitous Airflow UI screenshots coming right up…

We can see the parallel nature of the same tasks but just broken out for each lob in the graph view.

If we look at the tree view we can see this in another way along with execution status over time.

And finally if we look at the gantt view we can see that we do indeed have the parallelism we were after with task 1 being run concurrently for each lob and then similar concurrency for task 2.

So if we trigger this DAG from the airflow cli with:

$ airflow trigger_dag my_bigquery_dag

We can see the resulting data and tables in BigQuery.

And that’s pretty much it. It’s still early days but i’m hoping the fact we already have templated .sql files at the core of our existing approach should make porting over to Airflow easy enough (especially as most of jobs run in BigQuery so LocalExecutor should be enough).

We might do a follow up post in a month or two to share some deeper learning’s once we’ve used it more. I’m particularly excited about using Airflow pretty much anywhere and everywhere i can. One neat example that jumps to mind would be in machine learning pipelines where we tend to use BigQuery for the data crunching and H2O for the model building and learning, Airflow seems like a great way to more cohesively stitch it all together.