Data Bloke

Menu

Category Archives: Data

I needed a simple and efficient unsupervised graph clustering algorithm. MCL is a bit heavy for my needs and I was after something that was available in pure Python (because of environment access and compatibility issues) pretty much immediately. There isn’t exactly a lot of choice! I stumbled across Chinese Whispers an elegant and simple solution. I couldn’t find a simple implementation in Python so I created one myself using the formulas on the original paper. It uses NetworkX (for convenience – you could easily implement without this) and is incredibly fast.

import networkx
# build nodes and edge lists
nodes = [
(1,{'attr1':1}),
(2,{'attr1':1})
...
]
edges = [
(1,2,{'weight': 0.732})
....
]
# initialize the graph
G = nx.Graph()
# Add nodes
G.add_nodes_from(nodes)
# CW needs an arbitrary, unique class for each node before initialisation
# Here I use the ID of the node since I know it's unique
# You could use a random number or a counter or anything really
for n, v in enumerate(nodes):
G.node[n]['class'] = v
# add edges
G.add_edges_from(edges)
# run Chinese Whispers
# I default to 10 iterations. This number is usually low.
# After a certain number (individual to the data set) no further clustering occurs
iterations = 10
for z in range(0,iterations):
gn = G.nodes()
# I randomize the nodes to give me an arbitrary start point
shuffle(gn)
for node in gn:
neighs = G[node]
classes = {}
# do an inventory of the given nodes neighbours and edge weights
for ne in neighs:
if isinstance(ne, int) :
if G.node[ne]['class'] in classes:
classes[G.node[ne]['class']] += G[node][ne]['weight']
else:
classes[G.node[ne]['class']] = G[node][ne]['weight']
# find the class with the highest edge weight sum
max = 0
maxclass = 0
for c in classes:
if classes[c] > max:
max = classes[c]
maxclass = c
# set the class of target node to the winning local class
G.node[node]['class'] = maxclass

Given its simplicity it’s a remarkably effective algorithm. The image below shows a Gephi visualisation using the ForceAtlas2 algorithm. The node colours show the clusters identified by CW.

As you can see, the two algorithms broadly agree. CW took seconds to run whereas Gephi taxed my CPU to the max for many minutes (actually I think it was 10’s of minutes).

If you can fit you data into a graphical form, this is a very viable alternative to K-means style clustering, made even more attractive (for certain tasks) by the fact that it’s parameter free and thus you don’t need to pre-define the number of clusters (the bane of many a data scientist). It just finds the clusters that are there. This is, of course, also a draw-back in some circumstances. For example, if your data is heavily interlinked (a high degree to cardinality ratio) with no natural subgraphs, CW may just find a single cluster where you can demand K-means go find some. You can get around this to extent by relaxing your edge weight threshold (i.e. induce a subgraph with only edge weights greater than a threshold, then cluster that) an approach prone to graph fragmentation, which may or may not be desirable. It’s also prone to finding micro-clusters which in many cases could be construed as noise.

For my purposes it works incredibly well and I assume scales well. So let’s all turn to the left and tell the next person all about it.

I played with Google’s magical word2vec neural network some time ago. I found it interesting but I had no immediate use for it, so I filed it in the ‘must remember to check this out further’ section of my disorganised brain. More recently I found myself wrestling with topic modelling resulting in a near terminal headlock. I have a data set of very short (say, 2 – 50 word) documents that I want to group thematically. LDA and it’s various cousins were struggling with this task. There are few of reasons for this. Firstly, even though LDA, LSI etc. perform a sort of vector dimensionality reduction, this only works to a point. A three word sentence contains barely enough information to derive a single topic from, let alone a distribution, and consequently my topic distributions were too sparse to do much with. Secondly, and exacerbating the first point, the topic space across the corpus is pretty limited and somewhat homogeneous. Thirdly, and exacerbating the other two points, there is marked lack of variety in the language used across the corpus, and from one topic to the next. Humans were struggling to distinguish one theme from another, so what chance did a computer have? I was just about to give it up as a lost cause when I remembered that word2vec has some similarity measures, and a vague recollection of someone suggesting it could be used for topic modelling. My basic theory here is that if I can compare sentences for similarity I should be able to group them via that similarity (I’m using a graph clustering model to do this). So as a last ditch effort I ran word2vec over my corpus and started playing around to see if it could make sense of my data. The results were phenomenal! The similarity graph created from a simple, untuned word2vec model outperformed the other models at unsupervised classification 10 fold at least – where before I saw only loose semantic groupings with many mis-grouped items, I now saw empirically cohesive and accurate groups. As pleased as I was by this turn of events, I didn’t understand why Google’s simple neural network worked at all for my purposes, let alone outperformed everything else. So I bathed myself in warm, welcoming, buoyant sea of word2vec’s vector space. As I did so, I started to appreciate word2vec’s spooky action.

It’s not my intention to repeat what’s already been said about word2vec but merely to state my own findings. I’ll start with the crux of my main confusion about the model. I’m using to to interpret and group customer feedback. My understanding of word2vec is that it groups words that are semantically similar, or at least proximal. You can explore the most similar words to any given word or words easily in word2vec. My model was trained on a million or so items of customer feedback from a website survey. I should be swift here to mention that they have an excellent site that works well for the vast majority of customers, but like every website, doesn’t preform well for everyone all the time. So two words that occur in close proximity a fair amount are “site” and “slow” (it’s also worth pointing out that “site” actually co-exists with “fast” more often in the same corpus, however, we’re looking for problems to solve, not praise to lap up). However, when I looked at the top 50 words in closest proximity with “site”, “slow” was nowhere to be found. I get loads of similes of the the word “site” (e.g. website). And and all the words most closely related to “slow” are other words that loosely mean slow (e.g. sluggish). It was obvious to me at the point what word2vec was actually doing in this instance, and that my expectation was not aligned with how it works, but this lead me to a bit of an epiphany – holy shit, word2vec understands actual semantic relationships between words without any formal teaching; purely by inference! To put it another way, it groups together words it rarely if ever sees together, even in the same document. That’s pretty clever for a simple neural network with only a single hidden layer. I also picked out similarities in misspelled words. This is deceptively helpful since one of my core frustrations with the data sets (at least for the purposes of supervised learning and approach I’m also using) was that people don’t tend to put much effort into spell checking what they enter into an online survey. That’s some pretty spooky action!

So, on the surface at least, it seemed that the reason my similarity scores were so on the mark was that word2vec was cleverly able to pair of similar words in a sentence and thus was able to create robust similarity scores. However, this explanation is a bit Newton to Einstein – a good explanation, but not the whole story. Word2vec’s spooky action is a lot more abstract and, dare I say it, mysterious. This deserves a little more probing. The model that word2vec actually produces is actually the single hidden neural network layer previously referenced. It consists of a n x m vector space where n is the number of words in your corpus (optionally pruned to get rid of infrequent or too frequent cruft) and m is the an arbitrary number of floating point dimensions usually some where between 100 and 700. These dimensions are somewhat intractable from a human perspective. They are quanta of a continuous abstract vector space that maps a territory of words in a sort of semantic relief map. Words of similar meaning exist in the same general area of the map. Tribes of words exist in a single area just like tribes of people do in the real world. This extends past similes to words of the same type however. I ingested the prebuilt vector space on the word2vec home page (the 300 dimension Google news one) and did some exploring. I discovered various bits of spooky action:

Names of musicians (Alice Copper, Ozzy Osbourne, and David Bowie) coexist together with bands (Metallica, Motorhead) all of which bare no relation to, say “cheese”

Names of US presidents occupy the same general space, with small offsets that seem to suggest political affiliations (more research needed here), as do scientists with a little evidence that they cluster with their respective disciplines

Parts of the brain (and indeed neurotransmitters) all occupy the same space, and a cursory appraisal suggests that closer proximity exists form those parts closer to each other (hypothalamus, nucleus accumbens and midbrain all cluster very closely). One assumes that this is the same for all anatomical parts

It has no sense of opposites – fast and slow cluster very closely together, black and white even more so

Words cluster together when they are notionally similar rather than the same type of word, so “black” and “blackest” cluster close together. There seems no definable continuous space for, say, nouns, proper nouns, adjectives etc.

There’s a sense that it clusters words when they seem interchangeable to a greater lesser degree. The mathematical offset described with the famous “king – man + woman = queen” example seems to reinforce this. The spacial significance comes further to light when you consider the two main ways to interrogate the vector space. The standard way (al la Gensim and others) is to scan the entire vector space for vectors with the closest cosine similarity to the vector of the target word (which is usually something in the same proximity). When you’re dealing with multiple words (e.g. n-grams, sentences or even whole documents) the approach is simply to find the exact vector for each word, then take a column level average across those words to create a new vector which then goes through the same cosine similarity malarchy. When comparing one sentence/document to another, we take the same average and get the cosine between the two. The second approach looks to take an actual numerical offset from one words, or collection of words, to another as per Kusner et al. In combination these two approaches make the topological aspect of the model more salient still suggesting that the word embeddings exist in some intangible semantic space-time of numerous dimensions and geometry which, in a very real sense, is exactly what it is. A simile generator may be a convenient way to describe it, but the reality is much more complex and elegant. So back to my original idea that word2vec was pairing off words in a sentence. This is not the case at all. Using the cosine similarity approach as I was, what was actually happening is that I was generating vector point constructed from an average of a collection words, then go find other word or words that are proximally (spatially) close. Words are never compared, we actually just go find the tribe that has the most DNA in common with my sentence, as it were.

So how does it figure this stuff out? Well, some much better mathematicians than me already concluded that they “don’t really know”, so what hope have I got? I can just try and make some sense of what I observe. Much has been made of the difference between CBOW and Skip-gram as way to evaluate the text, however there seems little to suggests that either contributes to the overall spooky action, rather than build upon the the mysterious workings of a simple neural network. The information is in there, in all the written text, and nothing is inferred that isn’t visibly available – there’s no extrapolation going on here, no logic. Word2vec doesn’t read it or understand the text, it just picks up patterns. Interpretation as an adjunct to semantic awareness is a job for a future, much more sophisticated algorithm or model or AI. The best analogy I can think of for word2vec is the very mechanism that neural networks try and emulate – the human brain. In particular long term memory. It’s easy to imagine that, as information flows in through our senses, it is brokered into similar abstract representations in the cerebral cortex, then either reinforced (learned) or forgotten. We know that the best way to remember something is to relate it to something to something we already have a good sense of. Then when you recall that thing, you also recall a sense of the other stuff that you squirrelled it away with. Thus when you recall Alice Cooper from memory, Ozzy Osborne sometimes emerges with him, along with a bat or a chicken maybe, but never a block of stilton.

I don’t know exactly how long it takes to read the 50 or so words attributed to me in a recent Guardian article, but I doubt that it equates to 5 minutes, probably more like 17 seconds, meaning that I still have the larger portion of my 5 minutes of fame to come. What wonders await is anyone’s guess, but in the meantime I will juxtapose those 17 seconds of written text with a note of clarification.

On enthusiastically posting a snippet of said article on Twitter (and while I sat back and basked in the adulation) an old friend and colleague and data guru @hankyjohn responded to one of my points with a contradiction. Specifically, I said:

But Loveless said he associated the idea of having a single customer view with “big, monolithic, old school, relational databases, which are horribly hard to manage and incredibly expensive”. Just collecting data on customers for its own sake is useless unless you can do something useful with it, he said: “You don’t need to understand everything about the customer, you don’t need to collect and structure everything about the customer, you just need to have a sense about them.” He said the new data management platforms do not promise a single customer view, just a general view of what that person likes and does.

To which @hankyjohn responded (quite correctly):

@alexmloveless good work. Can I disagree though? False dichotomy for me. Traditional data warehousing can coexist nicely with other stores.

There followed a brief exchange in which I heroically clarified my point. Rather than subject you to those stilted 140 character info-barks, I’ll summarise the crux of my points here.

Although I completely stand by the point illustrated in that article, it sits removed from a broader context that would have been apparent were you in the room at the time. The wider point is this: Since the days of when advertising was first invented (by the people on Mad Men) marketers and the like have endeavoured to understand their customers. Such understanding, for the vast majority of the intervening period, was derived from stuff we can learn from any detail we can collect about them (name, address, demos etc.) and performance data (what works and what doesn’t). The former data probably existed on bits of card in filing cabinets for a long time before eventually being diligently transmogrified into their digital equivalent when computers became a thing. These digital equivalents eventually required a structured form so that they could be easily accessed and queried for the purposes of selling us stuff that we don’t need. The medium for this structure was the humble database, of which for a long time there was really only one form worth talking about, the RDBMS – relational databases. Relational databases are marvellous. They impose structure on unruly data and make it easy to access, analyse and aggregate. Thus, modern marketing became used to using these things to store their customer data, which needed to be kept clean and tidy. This was how you knew who your customers were – you kept records of them in a big old RDBMS called “Customer DB” or “CRM Store” or something equally as enticing. Problem is, since there were many different sources of data, companies frequently ended up with multiple stores, often storing overlapping data sets. Quite rightly at some point marketers and IT people alike started saying things like “wouldn’t it be great if all this data was deduplicated and stored in once place” and thus was born the dreaded Single Customer View.

Roll on a decade and SCV projects that were started on the back of wishes from marketers are still incomplete and running up legacy costs of tens of millions. Meanwhile, while failing to deliver on the meagre requirements of the time, we now have all these bloody channels and social networks and mobile devices and internets-of-things and Bigness of Data. Asking IT to justbloodywell get me a dataset I can trust is trouble enough let alone incorporating twitter handles and cross device awareness. Yet marketers are still asking such things of an SCV thinking that this once-so-called magic data bullet is actually the right place for such things.

The belief is still widely held that customer data really only can live in a big ole monolithic relational data store. This comes from a lack of distinction perpetuated on both the marketers and IT people’s part. The distinction is between Master Data Management (MDM) and, well, all the other types of data. It’s a distinction between hard, indelible customer data for the purposes of hard, lofty uses, vs the sort of fuzzy profiling that proliferates across the web and haunts you with depressing display adverts for TV’s you had briefly considered buying before that whopping council tax bill came in.

Modern marketing data is not about coherent customer information, it’s about cookies and inferred data. When marketing to (or at) someone it’s more useful to know their gender than their name. A mobile geolocation is better than a postcode. A constantly evolving stream of inferred preference data is better than a mosaic classification. This is all achieved by a web of data collection technologies and services that use the humble cookie as their primary currency and couldn’t give a hoot what your name is. You could try and mash all this lovely data into your SCV but you’d end up changing your schema every two weeks and probably hit performance/scale issues pretty quickly. Plus it’ll take 6 years and countless more million quid when you could have invested in one of those mystical unicorn DMP thingies. In such a circumstance your beloved SCV data would mostly be flowing in the other direction and consequently making an anonymised cookie store the most complete view of your customer data. God forbid!

Now don’t go rushing off to Adobe or Oracle while instructing your IT team to delete that pesky SCV. You probably need it. Email comms would not be possible without it. And if you have a more tangible relationship with your customer (like, you sell stuff to them) you need a master record with accurate, non-volatile information about them that’s nicely structured, secure and private. This is Master Data Management, and only relates secondarily to marketing. And as the learned @hankjohn correctly points out, it sits happily and harmoniously in a mature data ecosystem with anarchic jonny-come-latelys like DMPs (and a bunch of other sinister data entities).

This was the thrust of my grumpy diatribe at the Guardian offices, which perhaps doesn’t come through too well in the article. I wasn’t misquoted as such, just underquoted. The moral of this story? Write more about me.