Data Recipes

mathjax

Thursday, November 14, 2013

It's a bit silly trying to motivate a discussion of linear regression. It's everywhere. Linear regression is typically the first pass step for understanding a dataset. Is there a linear relationship between my variables? Maybe. Let's try linear regression! In other, and less precise, words we just fit a damn line to it. Suffice it to say linear regression is one of those tools for data analysis that's not optional. You must know it.

In fact, I'm going to assume just that. That you've used linear regression in other contexts before and understand its utility. The issue at hand is how to parallelize linear regression. Why? Well, suppose you have billions of feature vectors in your data set, each with thousands of features (columns), and you want to use all of them because, why not? Suppose it doesn't fit on one machine. Now, there exists a project to address this specifically, vowpal wabbit, which you should most certainly check out, but that I'm not going to talk about. Instead, the idea is to use Apache Pig. The reason for implementing it with Pig, rather than using an existing tool, is mostly for illustration. Linear regression with pig brings up several design and implementation details that I believe you'll face when doing almost any reasonably useful machine learning at scale. In other words, how do I wire all this shit together?

linear regression

It's important that I do at least explain a bit of terminology so we're all together in this. So, rather than jump for the most general explanation immediately (why do people do that?) let's talk about something real. Suppose you've measured the current running through a circuit while slowly decreasing the resistance. You would expect the current to increase linearly as you decrease the current (ohms law). In other words,

\begin{eqnarray*}I=\frac{V}{R}\end{eqnarray*}

To verify Ohm's Law (not a bad idea, I mean, maybe you're living in a dream world where physics is different and you want to know for certain...) you'd record the current \(I\) and the resistance \(R\) at every measurement while holding the voltage \(V\) constant. You'd then fit a line to the data or, more specifically, to \(\frac{1}{R}\), and, if all went well, find that the slope of said line was equal to the voltage.

In the machine learning nomenclature the current would be called the response or target variable. All the responses together form a vector \(y\) called the response vector. The resistance would be called the feature or observation. And, if you recorded more than just the resistance, say, the temperature, then for every response you'd have a collection of features or a feature vector. All the feature vectors together form a matrix \(X\). The goal of linear regression is to find the best set of weights \(w\) that, when used to form a linear combination of the features, creates a vector that is as close as possible to the response vector. So the problem can be phrased as an optimization problem. Here the function we'll minimize is the mean squared error (the square of the distance between the weighted features and the response). Mathematically the squared error, for one feature vector \(x_{i}\) of length \(M\), can be written as:

\begin{eqnarray*}error^2=(y_{i}-\sum_{j=1}^Mw_{j}x_{i,j})^2\end{eqnarray*}
where \(x_{i,0}=1\) by definition.
So the mean squared error (mse), when we've got \(N\) measurements (feature vectors), is then:
\begin{eqnarray*}mse(w)=\frac{1}{N}\sum_{i=1}^N(y_{i}-\sum_{j=1}^Mw_{j}x_{i,j})^2\end{eqnarray*}
So now the question is, exactly how are we going to minimize the mse by varying the weights? Well, it turns out there's the method called gradient descent. That is, the mse decreases fastest if we start with a given set of weights and travel in the direction of the negative gradient of the mse for those weights. In other words:
\begin{eqnarray*}w_{new}=w-\alpha\nabla{mse(w)}\end{eqnarray*}
Where \(\alpha\) is the magnitude of the step size. What this gives us is a way to update the weights until \(w_{new}\) doesn't really change much. Once the weights converge we're done.

algorithm

Alright, now that we've got a rule for updating weights, we can write down the algorithm.

1. Initialize the weights, one per feature, randomly

2. Update the weights by subtracting the gradient of the mse

3. Repeat 2 until converged

implementation

Ok great. Let's get the pieces together.

pig

Pig is going to do most of the real work. There's two main steps involved. The first step, and one that varies strongly from domain to domain, problem to problem, is loading your data and transforming it into something that a generic algorithm can handle. The second, and more interesting, is the implementation of gradient descent of the mse itself.

Here's the implementation of the gradient descent portion. I'll go over each relation in detail.

Go ahead and save that in a directory called 'macros'. Since gradient descent of the mean squared error for the purposes of creating a linear model is a generic problem, it makes sense to implement it as a pig macro.

In lines 19-25 we're attaching the weights to every feature vector. The Zip UDF, which can be found on github, receives the weights as a tuple and the feature vector as a tuple. The output is a bag of new tuples which contains (weight,feature,dimension). Think of Zip like a zipper where it matches the weights to their corresponding features. Importantly, the dimension (index in the input tuples) is returned as well.

Something to notice about this first bit is the scalar cast. Zip receives the entire weights relation as a single object. Since the weights relation is only a single tuple anyway, this is great. This is a good thing. It prevents us from doing something silly like a replicated join on a constant (which works but clutters the logic) or, worse, a cross.

Next, in lines 32-39, we're computing a portion of the gradient of the mse. The reason for the Zip udf in the first step was so the nested projection and sum to compute the dot product of the weights with the features works out cleanly.

Then, on line 42, the full gradient of the mse is computed by multiplying each feature by the error. It might not seem obvious when written like that, but the partial derivatives of the mse with respect to each weight make it work out like this. How nice.

Lines 47-54 is where the action happens. By action I mean we'll actually trigger a reduce job since everything up to this point has been map only. This is where the partial derivative bits come together. That is, we're grouping by dimension and weight (which, it turns out is precisely the thing we're differentiating with respect to) and summing each feature vector's contribution to the partial derivative along that dimension. The factor bit, on line 48, is the multiplier for the gradient. It includes the normalization term (we normalize by the number of features since we're differentiating the mean squared error) and the step size. The result of this step is a new weight for each dimension.

An important thing to note here is that there are only a number of partitions equivalent to the number of features. What this means is that each reduce task would get a potentially very large amount of data. Fortunately for us COUNT and SUM are both algebraic and so Pig will use combiners, hurray!, drastically reducing the amount of data sent to each reduce task.

Finally, on lines 60-65, we reconstruct a new tuple with the new weights and return it. The schema of this tuple should be the same as the input weight tuple.

gory details

So now that we have a way to do it, we should do it. Right, I mean if we can we should. Isn't that how technology works...

I'll going to go ahead and do a completely contrived example. The reason is so that I can visualize the results.

I've created some data, called data.tsv, which satisfies the following:

\begin{eqnarray*}y=0.3r(x) + 2x\end{eqnarray*}

where \(r(x)\) is a noise term. And here's the plot:

So we have two feature columns, (1.0, \(x\)), that we're trying to find the weights for. Since we've cooked this example (we already know the relationship between \(x\) and \(y\)) we expect the weights for those columns to be (0.0,2.0) if all goes well.

Now that we've got some data to work with, we'll need to write a bit more pig to load the data up and run it through our gradient descent macro.

There's really not much exciting going on here. We're loading the features and weights and rearranging them to satisfy the schema that the gradient descent macro expects. The other details are related to the driver script.

The driver script is next. Since our algorithm is iterative and pig itself has no support for iteration, we're going to embed the pig script into a python (jython) program. Here's what that looks like:

There's a lot going on here (it's cluttered looking because Pig doesn't allow class or function definitions in driver scripts) that's not that interesting and pretty easy to understand so I'll just go over the high points:

We initialize the weights randomly and write a .pig_schema file as well. The reason for writing the schema file is so that it's unnecessary to write the redundant schema for weights in the pig script itself.

We want the driver to be agnostic to whether we're running in local mode or mapreduce mode. Thus we copy the weights to the filesystem using the copyFromLocal fs command. In local mode this just puts the initial weights on the local fs whereas in mapreduce mode this'll place them on the hdfs.

Then we iterate until the convergence criteria is met. Each iteration details like copying the schema, and pulling down the weights to compute distances is done.

A moving average is maintained over the past 25 weights. Iteration stops when the new weight is less than EPS away from the average.

Aside from that there's an interesting -bug?- that comes up as a result of this. Notice on line 70 how the new variables are bound each iteration? Well the underlying PigContext object just keeps adding these new variables instead of overwriting them. What this means is that, after a couple thousand iterations, depending on your PIG_HEAPSIZE env variable, the driver script will crash from an out of memory error. Yikes.

run it

That is, we use the pig command to launch the driver program. The arguments to the driver script itself follow. We're running the fit_line.pig script where the data dir (where intermediate weights will go) exists under 'data' and the input data, data.tsv, should exist there as well. The '2' indicates we've got two weights (w0, and w1). The pig heapsize is to deal with the bug mentioned in the previous section.

On my laptop, in local mode, the convergence criteria was met after 1527 iterations.

results

After 1527 iterations the weights ended up as (0.0040666759502969215, 2.0068029003414014) which is exactly what we'd expect. In other words:

\begin{eqnarray*}y=0.0041 + 2.0068x\end{eqnarray*}

which is the 'best fit' line to our original:
\begin{eqnarray*}y=0.3r(x) + 2x\end{eqnarray*}

Monday, September 9, 2013

Recently I've been wanting to get ahold of the logical plan (a graph representation) for a pig script without running it. The largest reason is that the logical plan is a fairly language and platform agnostic representation of a dataflow. Once you have the logical plan I can think of several fun things you could do with it:

Serialize it as JSON and send it to any number of arbitrary tools

Visualize it in a web browser

Edit it with a web app

Compile it into an execution (physical) plan for arbitrary (non-hadoop map-reduce) backend frameworks that make sense (storm, s4, spark)

Ok, so maybe those are the fun things I actually plan on doing with it, but what's the difference?

Problem

Pig doesn't make it easy to get this. After spending several hours digging through the way pig parses and runs a pig script I've come away somewhat shaken up. The parsing logic is deeply coupled with the execution logic. Yes, yes, this is supposed to change as we go forward, eg PIG-3419, but what about in the mean time?

Hack/Solution

So, I've written this little jruby script to return the LogicalPlan for a pig script. Right now all it does is exactly the same as putting an 'EXPLAIN' operator in your script. However, since it exposes the LogicalPlan, you could easily extend this to do whatever you like with it.

Monday, August 12, 2013

Hadoop. Hadoop has managed to insinuate itself into practically every company with an engineering team and some data. If your company isn't using it, you know a company that is. Hell, it's why you're reading this to begin with. That being said, what you're probably doing with Hadoop is boring and uninspired. It's not your fault of course. Pretty much every example out there pigeonholes Hadoop into default business use cases like etl and data cleaning, basic statistics, machine learning, and GIS.

You know what though? Sometimes it's good to explore things that don't have an obvious business use case. Things that are weird. Things that are pretty. Things that are ridiculous. Things like dynamical systems and chaos. And, if you happen to find there are applicable tidbits along the way (*hint, skip to the problem outline section*), great, otherwise just enjoy the diversion.

motivation

So what is a dynamical system? Dryly, a dynamical system is a fixed rule to describe how a point moves through geometric space over time. Pretty much everything that is interesting can be modeled as a dynamical system. Population, traffic flows, fireflies, and neurons can all be describe this way.

In most cases, you'll have a system of ordinary differential equations like this:

In this case \(v\) represents the potential difference between the inside of the neuron and the outside (membrane potential), and \(w\) corresponds to how the neuron recovers after it fires. There's also an external current \(I_{{\rm ext}}\) which can model other neurons zapping the one we're looking at but could just as easily be any other source of current like a car battery. The numerical constants in the system are experimentally derived from looking at how giant squid axons behave. Basically, these guys in the 60's were zapping giant squid brains for science. Understand a bit more why I think your business use case is boring?

One of the simple ways you can study a dynamical system is to see how it behaves for a wide variety of parameter values. In the Fitzhugh-Nagumo case the only real parameter is the external current \(I_{{\rm ext}}\). For example, for what values of \(I_{{\rm ext}}\) does the system behave normally? For what values does it fire like crazy? Can I zap it so much that it stops firing altogether?

In order to do that you'd just decide on some reasonable range of currents, say \((0,1)\), break that range into some number of points, and simulate the system while changing the value of \(I_{{\rm ext}}\) each time.

chaos

There's a a lot of great ways to summarize the behavior of a dynamical system if you can simulate its trajectories. Simulated trajectories are, after all, just data sets. The way I'm going to focus on is calculation of the largest lyapunov exponent. Basically, all the lyapunov exponent says is, if I take two identical systems and start them going at slightly different places, how similarly do they behave?

For example, If I hook a car battery to two identical squid neurons at the same time, but one has a little bit of extra charge on it, does their firing stay in sync forever or do they start to diverge in time? The lyapunov exponent would measure the rate at which they diverge. If the two neurons fire close in time but don't totally sync up then the lyapunov exponent would be zero. If they eventually start firing at the same time then the lyapunov exponent is negative (they're not diverging, they're coming together). Finally, if they continually diverge from one another then the lyapunov exponent is positive.

As it turns out, a positive lyapunov exponent usually means the system is chaotic. No matter how close two points start out, they will diverge exponentially. What this means in practice is that, while I might have a predictive model (as a dynamical system) of something really cool like a hurricane, I simply can't measure it precisely enough to make a good prediction of where it's going to go. A really small measurement error, between where the hurricane actually is and where I measure it to be, will diverge exponentially. So my model will predict the hurricane heading into Texas when it actually heads into Louisanna. Yep. Chaos indeed.

problem outline

So I'm going to compute the lyapunov exponent of a dynamical system for some range of parameter values. The system I'm going to use is the Henon Map:

I choose the Henon map for a few reasons despite the fact that it isn't modeling a physical system. One, it's super simple and doesn't involve time at all. Two, it's two dimensional so it's easy to plot it and take a look at it. Finally, it's only got two parameters meaning the range of parameter values will make up a plane (and not some n-dimensional hyperspace) so I can make a pretty picture.

What does Hadoop have to do with all this anyway? Well, I've got to break the parameter plane (ab-plane) into a set of coordinates and run one simulation per coordinate. Say I let \(a=[a_{min},a_{max}]\) and \(b=[b_{min},b_{max}]\) and I want to look \(N\) unique \(a\) values and \(M\) unique \(b\) values. That means I have to run \(N \times M\) individual simulations!

Clearly, the situation gets even worse if I have more parameters (a.k.a a realistic system). However, since each simulation is independent of all the other simulations, I can benefit dramatically from simple parallelization. And that, my friends, is what Hadoop does best. It makes parallelization trivially simple. It handles all those nasty details (which distract from the actual problem at hand) like what machine gets what tasks, what to do about failed tasks, reporting, logging, and the whole bit.

So here's the rough idea:

Use Hadoop to split the n-dimensional (2D for this trivial example) space into several tiles that will be processed in parallel

Each split of the space is just a set of parameter values. Use these parameter values to run a simulation.

Calculate the lyapunov exponent resulting from each.

Slice the results, visualize, and analyze further (perhaps at higher resolution on a smaller region of parameter space), to understand under what conditions the system is chaotic. In the simple Henon map case I'll make a 2D image to look at.

The important silly detail is this. The input data here is minuscule in comparison to most data sets handled with Hadoop. This is NOT big data. Instead, the input data is a small file with n lines and can be thought of as a "spatial specification". It is the input format that explodes the spatial specification into the many individual tiles needed. In other words, Hadoop is not just for big data, it can be used for massively parallel scientific computing.

implementation

Hadoop has been around for a while now. So when I implement something with Hadoop you can be sure I'm not going to sit down and write a java map-reduce program. Instead, I'll use Pig and custom functions for pig to hijack the Hadoop input format functionality. Expanding the rough idea in the outline above:

Pig will load a spatial specification file that defines the extent of the space to explore and with what granularity to explore it.

A custom Pig LoadFunc will use the specification to create individual input splits for each tile of the space to explore. For less parallelism than one input split per tile it's possible to specify the number of total splits. In this case the tiles will be split mostly evenly among the input splits.

The LoadFunc overrides Hadoop classes. Specifically: InputFormat (which does the work of expanding the space), InputSplit (which represents the set of one or more spatial tiles), and RecordReader (for deserializing the splits into useful tiles).

A custom EvalFunc will take the tuple representing a tile from the LoadFunc and use its values as parameters in simulating the system and computing the lyapunov exponent. The lyapunov exponent is the result.

running

I want to explore the Henon map over a range where it's likely to be bounded (unbounded solutions aren't that interesting) and chaotic. Here's my input file:

$: cat data/space_spec
0.6,1.6,800
-1.0,1.0,800

Remember the system?
\begin{eqnarray*}x_{n+1} & = & y_{n}+1-ax_{n}^{2}\\y_{n+1} & = & bx_{n}\end{eqnarray*}
Well, the spatial specification says (if I let the first line represent \(a\) and the second be \(b\)) that I'm looking at an \(800 \times 800\) (or 640000 independent simulations) grid in the ab-plane where \(a=[0.6,1.6]\) and \(b=[-1.0,1.0]\)

Now, these bounds aren't arbitrary. The Henon attractor that most are familiar with (if you're familiar with chaos and strange attractors in the least bit) occurs when \(a=1.4\) and \(b=0.3\). I want to ensure I'm at least going over that case.

The big swaths of flat white are regions where the system becomes unbounded. It's interesting that the bottom right portion has some structure to it that's possibly fractal. The top right portion, between \(b=0.0\) and \(b=0.5\) and \(a=1.0\) to \(a=1.6\) is really the only region on this image that's chaotic (where the exponent is non-negative and greater than zero). There's a lot more structure here to look at but I'll leave that to you. As a followup it'd be cool to zoom in on the bottom right corner and run this again.

conclusion

So yes, it's possible to use Hadoop to do massively parallel scientific computing and avoid the question of big data entirely. Best of all it's easy.

The notion of exploding a space and doing something with each tile in parallel is actually pretty general and, as I've shown, super easy to do with Hadoop. I'll leave it to you to come up with your own way of applying it.

Monday, October 8, 2012

An idea that's been rolling around in my head for a while now has been how to create a k-d tree with pig. I've been hesitant to post about it because it seemed like a ridiculous idea. That is, a k-d tree is typically generated recursively. So how can you parallelize its creation without reverting to some complicated iterative solution? Turns out it's an important topic in graphics rendering, particularly with respect to ray tracing, and some smart folks in that field have a pretty good parallel solution. Since we're here to talk about pig, and not graphics rendering, I'll go ahead and skip to the juicy bit. The idea is to first break the space your points lie within into non-overlapping partitions. A non-parallel k-d tree algorithm is then applied to each of the partitions in parallel. The resulting trees, one per partition, are then merged to create the final tree. That's it. Seems obvious in hindsight.

motivation

First off, why bother pigify-ing the creation of a k-d tree? Pig is pretty much limited to problems where high latency isn't a concern and so unless the set of things we're planning on indexing is huge it doesn't make much sense to talk about it other than as a theoretical exercise. Well. Let's see...

Suppose you wanted to create your own index of star brightness observations (of which this meager database has over 20 million). Given a set of galactic coordinates and time, what observations lie nearby in space and time? Maybe a k-d tree would help!

Pssh, star brightness observations, that's basically only 4 dimensions, what about indexing document vectors for all the web pages in the Common Crawl data set? You could try a k-d tree.

What about satellite data? There's an interesting (and old) project called the International Cloud Climatology Project (ISCCP). Want to know cloud properties for a region over time? Well, one of their data sets has broken the world into a grid of 30km x 30km 'pixels' and reports all sorts of interesting cloud properties for each those pixels for over 30 years! It's about 5GB of data per month... Yikes. I'm going to reach for a k-d tree to search through that beast.

problem outline

So now that we're sufficiently motivated (and come on, if stars, web pages, and clouds don't get you going then what the hell are you doing here?), let's work through a much smaller example that we can actually look at. I find with problems involving points it's most intuitive to use geo data as a starting point. Geo data is easy to think about and you can plot your results on a map for verification.

The example geo data we'll use is from the Geonames data base and is simply a list of all the cities in the world with population greater than 1000. We're going to use pig create a k-d tree containing these cities. Some ways of storing and querying this k-d tree will be discussed.

get data

Go ahead and download and unzip the data so you too can play along. I like wget.

This will result in a tab-separated-values file called 'cities1000.txt'. In the following analysis we're only using the coordinates and the ids of a point. You could write your pig script to load all the fields and project out the ones you don't want or if you don't like writing gory pig schemas like me, you can cut them out ahead of time on the command line.

$: cut -f1,5,6 cities1000.txt > cities1000_cut.tsv

Where fields 1, 5, and 6 (in the cut command above) correspond to the geonameid, latitude, and longitude respectively.

algorithm

Before we can write a pig script to actually deal with this we'd better have our algorithm solidly in hand. Here it is to the best of my ability to write it...

Break the space into non-overlapping partitions.

Run a non-parallel k-d tree generation algorithm on the points that fall within each partition.

Merge the k-d trees from (2) into the result k-d tree

Non-overlapping partitions wtf?

Basically we need to break the space our points fall inside into smaller pieces that don't overlap. This is important for merging the trees in the final step since, with non-overlapping pieces, a point can only belong to one partition and, consequently, exactly one tree. If a point was inside more than one partition then we'd have a heck of a time merging the trees from those partitions since there'd be collisions. No fun.

Now, which algorithm do we actually use to partition? It can get complicated since it really depends on the performance you're looking for and the distribution of your data. You don't want to send too many points to a single partition. On the other hand, processing empty or mostly empty partitions really sucks too. In the Shevtsov et al. paper I referenced at the beginning of this post they run an initial clustering algorithm over the points and choose how to partition the space dynamically based on that clustering. However, since this is a blog post and I am but one man, we'll use an ok way of partitioning based on the quadkey tiling scheme described here. This basically just chops the space up into an even grid.

Ok, so now that I'm using tiles, how do I know what size tile to use? The quadkey system allows you to use a zoom level that ranges from 1 to 23. To pick the best one for this example I sampled the cities and generated a distribution of the number of points per tile at various resolutions:

Eyeballing it, it looks like 7 is a good zoom level at which to partition the space. Mostly it's because the distribution for 7 has an alright spread and I don't see hot spots where one partition gets waaay to many points like at levels 5 and 6. But, please, by all means, experiment with this.

Since we're dealing with geodata and I'm in a visualization mood, here's what Texas looks like partitioned at zoom level 7. The points are the actual cities pulled from our example data set:

As you can see, what with West Texas being a wasteland and all, the partitioning makes it so that some partitions get too much data and some get very little. Marfa's got dust and Dallas has all the suburbs. You just can't win with a grid. But, we're going to march forward anyhow.

non-parallel k-d tree

Since we're going to be using pig the k-d tree generation code will need to be written as a user-defined-function (UDF). Pig is going to group the points by their quadkey which will result in a bag, one per quadkey (partition). Hence, the UDF will need to take a bag of points as input.

Here's also where it's important to think about how we're going to represent the k-d tree during the pig processing. A nested tree representation doesn't work because we need to operate on individual nodes from the pig script itself (filtering, etc). So, the UDF will yield a bag of points, exactly the same size as the input, only with the left and right children (from the k-d tree) attached as fields. Let's call the UDF "KDTree". Rather than paste a couple hundred lines of java, an implementation can be found in the sounder repo on github here. Here's the important bits:

udf signature: KDTree(points) where points is a bag with the following schema:

A few things. In the output, above_child and below_child correspond to the right and left children of the point. They're labeled as such since the k-d tree udf can operate on k-dimensional points, not just 2-dimensional points. is_root is an integer, 0 or 1, indicating whether the node is the root of the k-d tree or not. This is because we're returning the nodes as a completely flat structure and we need to be able to reconstruct our tree. Also, in the next step of the algorithm, we're going to need to separate the roots of each partition from the branches. axis is the axis that this node splits (either 0 or 1). This is so that, once we reconstruct our tree, we know how to search it.

merging trees

This part seems complicated but it's actually pretty simple. Remember, it's only possible because we broke our original space into non-overlapping pieces. In the previous step we've generated a k-d tree for each of those pieces. As a result of this each partition has its own root node. To merge the trees we first need to nominate one of those roots (say, the median point along the longitude axis) as the top level root. Then, taking the top level root's tree to start we simply insert the other roots into this tree. Importantly, this step cannot be done (as far as I can tell) in parallel. Instead, it relies on how we ultimately choose to store (and query) the tree. I'll describe a simple way using a hashmap in a bit which can easily be extended to work with a distributed key-value store.

Implementation

Implementing this in pig is actually pretty straightforward. I'll go over each part in turn.

First, as a convenience and to make the pig code more readable, we'll use the define keyword in to alias our udfs. GetQuadKey is the udf that generates the quadkeys and KDTree is the udf to generate a k-d tree from a bag of points.

Next, we simply load the data using pig's default PigStorage class with tabs as the field delimiter.

data = load '$POINTS' as (id:chararray, lat:double, lng:double);

Then we generate a quadkey for every point at resolution 7. We're also using pig's built in TOTUPLE operator since the k-d tree udf needs points to be specified as tuples of coordinates. This is so it can handle k-dimensional points and not just 2-dimensional points.

And here's the meat of it where we group by quadkey (effectively using hadoop to partition our space) which collects all the points for a given partition into a single bag. This is fed to the k-d tree udf which does the work of creating the k-d trees.

Finally, the trees are split into roots (remember, each partition will have its own root) and branches. These are flattened to pull the coordinates out of their tuple containers and stored separately. The roots and branches need to be dealt with differently when creating the final k-d tree so they're stored separately to reflect that.

This should generate only one map-reduce job and result in a directory on the hdfs which contains the roots and branches of the k-d trees.

As a sanity check, here's an image of what texas looks like, now with the roots:

Since the first splitting axis in the k-d tree algorithm is the x-axis (longitude) we expect the root of each k-d tree to be the median point along that axis in its respective partition. By eyeballing every partition (I sure do a lot of eyeballing...) this appears to be true. So yay, looks good.

Also, notice I'm getting ahead of myself a bit and showing the root I've nominated as the top level node.

Final Tree

This part takes some care. First, we need to decide how we're actually going to store and query the k-d tree. Directly from the hdfs in its current form is definitely not going to work. For illustrative purposes I'm going to use a ruby hash to store and query the tree. This, of course, only works because the data is small enough to fit into memory. However, most distributed key-value stores and other, more complex nosql stores like cassandra and hbase, would work as well. For the purposes of this example you can think of them as giant ruby hashmaps too. The same basic query interface would work, just with a different data store backend.

As steps:

Insert all the branches into the key-value store using the point id as the key and the rest of the point metadata as the value. Importantly the coordinates and the above_child and below_child are stored. How you do this is up to you.

Nominate one of the roots as the top level root. Ideally this will be the median point along either the x or y axis. The idea here is that the number of partitions, thus the number of roots, will be small enough that finding the median point and nominating it as the root should be a simple single process problem.

Insert the remaining roots into the key-value store using the standard k-d tree insertion procedure outlined in the wikipedia article

Of course you'll have to implement some basic k-d tree searching and insertion code, but that's a well understood problem. There are plenty of examples, eg here and here. My implementation can be found in the sounder repo on github here.

To use my code simply bring the cities1000/roots and cities1000/branches down to your local filesystem and run:

Wednesday, March 21, 2012

A common desire when working with natural language is topic discovery. That is, given a set of documents (eg. tweets, blog posts, emails) you would like to discover the topics inherent in those documents. Often this method is used to summarize a large corpus of text so it can be quickly understood what that text is 'about'. You can go further and use topic discovery as a way to classify new documents or to group and organize the documents you've done topic discovery on.

Latent Dirichlet Allocation

One popular method for topic discovery in a corpus is Latent Dirichlet Allocation (LDA). I won't pretend to be an expert on LDA but the main assumption is as follows. Each document is assumed to be a 'mixture' of topics. Going further, each topic is then assumed to be a distribution over terms. For example, say there is a topic in my corpus labeled 'apache hadoop'. It could be represented as a multinomial probability distribution with high probability of generating terms such as 'hadoop', 'data', 'apache', and 'map-reduce'. See the wikipedia article on LDA

Problem

I'm going to use Apache Pig and Mallet, a java based machine learning and natural language processing library to discover topics in the 20 newsgroups data set. This corpus is nice since each document already belongs to a newsgroup (a topic) and so it gives us a way of checking how well our topic discovery is doing.

The Data

The 20 newsgroups data set can be found on Infochimps here. Once you've got the data go ahead and place it somewhere on your hdfs. I put mine in my home directory under '20newsgroups/data'.

So it's tab separated where the first field is the document id (a concatenation of the newsgroup the document is coming from and an integer id). The second field is the document text itself. Here's the pig schema for that:

(doc_id:chararray, text:chararray)

Algorithm

LDA operates on a set of documents. Trivially we could just skip the pig part and write a simple java program that operates on the entire document set and be done with it. But that's not the point. Typically, your input documents have metadata attached to them. For example, the region or user they're coming from, or even just the date they were generated. So we'll just use pig's GROUP BY statement to group the documents by this metadata and cluster the documents within each group independently. If the documents don't have this kind of metadata we're stuck doing a GROUP ALL and dealing with all the documents at once. There are clever ways of parallelizing LDA in this case that I'm not going to go into. See here and here.

Here's a sketch of the algorithm:

(1) Load documents

(2) Group the documents by appropriate metadata (or by all)

(3) Run LDA on each group of documents

(4) Profit!!!

Implementation

So it's clear that we're going to need a java udf to do the actual topic clustering. Right? This udf will operate on a DataBag of documents and return a DataBag containing the discovered topics. Each topic will be represented by a Tuple with the following schema:

// Add the input databag as source data and run it through the pipe built // by the constructor. instances.addThruPipe(new DataBagSourceIterator(documents));

// Create a model with numTopics, alpha_t = 0.01, beta_w = 0.01 // Note that the first parameter is passed as the sum over topics, while // the second is the parameter for a single dimension of the Dirichlet prior. ParallelTopicModel model = new ParallelTopicModel(numTopics, 1.0, 0.01); model.addInstances(instances); model.setNumThreads(1); // Important, since this is being run in the reduce, just use one thread model.setTopicDisplay(0,0); model.setNumIterations(2000); model.estimate();

public Instance next() { // Get the next tuple and pull out its fields Instance i = new Instance(currentText, "X", currentId, null); return i; }

public void remove() { tupleItr.remove(); } }}

There's a few key things going on here. First, the udf operates on a bag that contains tuples with exactly two fields, doc_id and text. Mallet has the notion of pipes where your input data flows through a set of 'pipes' as a way of preparing the data. The class DataBagSourceIterator is simply a convenient way of plugging a DataBag object into this flow.

One of the pipes our documents flow through actually tokenizes the text. TokenSequenceNGramsDelim does this work. Mallet has a built-in TokenSequenceNGrams that works nicely, unfortunately when tokenizing n-grams where n > 1 it uses an '_' by default to separate the terms in the ngram. TokenSequenceNGramsDelim allows us to use our own delimiter, namely a ' ', instead.

Ultimately, all this udf does is read the input documents, prepare them for clustering, runs Mallet's built in LDA methods, and constructs the output DataBag in the way we'd like it.

Pig

Now that we have our udf, let's write a pig script to use it. Since the documents I've chosen to use don't have any obvious (or at least easy to get at) additional metadata we're going to use a GROUP ALL. Here's the pig script:

---- Group the docs by all and find topics---- WARNING: This is, in general, not appropriate in a production environment.-- Instead it is best to group by some piece of metadata which partitions-- the documents into smaller groups.--topics = foreach (group for_lda all) generate FLATTEN(LDATopics(20, for_lda)) as ( topic_num:int, keywords:bag {t:tuple(keyword:chararray, weight:int)} );

store topics into '$OUT';

There's a few things worth pointing out here. First, we load our text as normal. There's a step there to tokenize text which seems like it might be spurious. It uses the lucene tokenization udf from here as a way to remove stopwords. You could skip this step if that wasn't important for you. Next, the tokenized text is grouped back together by document id and concatenated back together to form cleaned documents. I'm using the nice udf from pygmalion to do the concatenation. Finally, the documents are grouped together and topics are discovered.

Run it!

At this point we're ready to run our script. I named this script 'discover_topics_example.pig'. And here's how I ran it:

Labeling

Now, we'd like to see how well we did. Here's the 20 topics we _know_ should exist:

alt.atheism

comp.graphics

comp.os.ms-windows.misc

comp.sys.ibm.pc.hardware

comp.sys.mac.hardware

comp.windows.x

misc.forsale

rec.autos

rec.motorcycles

rec.sport.baseball

rec.sport.hockey

sci.crypt

sci.electronics

sci.med

sci.space

soc.religion.christian

talk.politics.guns

talk.politics.mideast

talk.politics.misc

talk.religion.misc

There are a number of methods for labeling topics discovered in this way, (see here), but in the interest of time I'm going to manually match the topics above to the ones discovered. Obviously, 'eyeballing' it isn't appropriate for a production environment...

alt.atheism,16

comp.graphics,12

comp.os.ms-windows.misc,9

comp.sys.ibm.pc.hardware,3

comp.sys.mac.hardware,11

comp.windows.x,0

misc.forsale,10

rec.autos,14

rec.motorcycles,14

rec.sport.baseball,19

rec.sport.hockey,2

sci.crypt,15

sci.electronics

sci.med,1

sci.space,13

soc.religion.christian,6

talk.politics.guns,5

talk.politics.mideast,18,7

talk.politics.misc,4

talk.religion.misc,8

So, as far as I can tell there are some that map to multiple of the topics discovered and some that don't seem to map to one discovered at all. It's clear there's room for improvement (look at the parameters alpha and beta I'm hardcoding in the topic model for example). But all in all it's pretty good as a first pass. Now go away and find some topics.

Wednesday, October 26, 2011

Since it's been such a long time since I last posted I thought I'd make this one a bit longer. It really is a condensing of a lot of things I've been working with and thinking about over the past few months.

Nearest Neighbors

The nearest neighbors problem (also known as the post-office problem) is this: Given a point X in some metric space M, assign to it the nearest neighboring point S. In other words, given a residence, assign to it the nearest post office. The K-nearest neighbors problem, which this post addresses, is just a slight generalization of that problem. Instead of just one neighbor we are looking for K neighbors.

Problem

So, we're going to use the geonames data. This is a set of nearly 8 million geo points with names, coordinates, and a bunch of other good stuff, from around the world. We would like to find, for a given point in the geonames set, the 5 nearest points (also in geonames) that are nearest to it. Should be pretty simple yeah?

Get data

The geonames data set 'allCountries.zip' can be downloaded like so:

$: wget http://download.geonames.org/export/dump/allCountries.zip

Prepare data

Since the geonames data set comes as a nice tab-separated-values (.tsv) file already it's just a matter of unzipping the package and placing it on your hdfs (you do have one of those don't you?). Do:

$: unzip allCountries.zip$: hadoop fs -put allCountries.txt .

to unzip the package and place the tsv file into your home directory on the hadoop distributed file system.

Schema

Oh, and by the way, before we forget, the data from geonames has this pig schema:

The Algorithm

Now that we have the data we can start to play with it and think about how to solve the problem at hand. Looking at the data (use something like 'head', 'cat', 'cut', etc) we see that there are really only three fields of interest in the data: (geonameid, longitude, and latitude). All the other fields are just nice metadata which we can attach later.

Now, since we're going to be using Apache Pig to solve this problem we need to think a little bit about parallelism. One constraint is that at no time is any one point going to have access to the locations of all the other points. In other words, we will not be storing the full set of points in memory. Besides, it's 8 million points, that's kind of a lot for my poor little machine to handle.

So it's clear (right?) that we're going to have to partition the space in some way. Then, within a partition of the space, we'll need to apply a local version of the nearest neighbors algorithm. That's it really. Map and reduce. Wait, but there's one problem. What happens if we don't find all 5 neighbors for a point in a single partition? Hmmm. Well, the answer is iteration. We'll choose a small partition size to begin with and gradually increase the partition size until either the partition size is too large or all the neighbors have been found. Got it?

Recap:

(1) Partition the space

(2) Search for nearest neighbors in a single partition

(3) If all neighbors have been found, terminate; else increase partition size and repeat (1) and (2)

Implementation

For partitioning the space we're going to use Google quadkeys (http://msdn.microsoft.com/en-us/library/bb259689.aspx) since it's super easy to implement and it partitions the space nicely. This will be a java UDF for Pig that takes a (longitude, latitude, and zoom level) tuple and returns a string quadkey (the partition id).

// Add this point as a neighbor if pointA has no neighbors if (neighbors.size()==0) { Tuple newNeighbor = tupleFactory.newTuple(2); newNeighbor.set(0, pointB.get(0)); newNeighbor.set(1, distance); neighbors.add(newNeighbor); }

The details of the NearestNeighbors UDF aren't super important and it's mostly pretty clear what's going on. Just know that it operates on a bag of points as input and returns a bag of points as output that has the same schema. This is really important since we're going to be iterating.

Then we're on to the Pig part, hurray! Since Pig doesn't have any built in support for iteration, I chose to use Jruby (because it's awesome) and pig's "PigServer" java class to do all the work. Here's what the jruby runner looks like (it's kind of a lot so don't get scared):

#!/usr/bin/env jruby

require 'java'

## You might consider changing this to point to where you have# pig installed...#jar = "/usr/lib/pig/pig-0.8.1-cdh3u1-core.jar"conf = "/etc/hadoop/conf"

# # Actually runs all the pig queries for # the algorithm. Stops if all neighbors # have been found or if min_zl is reached # def run_algorithm start_nearest_neighbors(points, k, 22) if run_nearest_neighbors(k, 22) 21.downto(min_zl) do |zl| iterate_nearest_neighbors(k, zl) break unless run_nearest_neighbors(k,zl) end end end

# # Split the results into done and not_done relations # The algorithm is done when 'not_done' contains # no more tuples. # def self.split_results(k, zl) "SPLIT nearest_neighbors#{zl} INTO done#{zl} IF COUNT(neighbors) >= #{k}l, not_done#{zl} IF COUNT(neighbors) < #{k}l;" end end

end

NearestNeighbors.new(ARGV[0], ARGV[1]).run

Call this file "nearest_neighbors.rb". The idea here is that we register some basic pig queries to do the initialization of the algorithm and the iterations. These queries are run over and over until either the "not_done" relation contains no more elements or the minimum zoom level has been reached. Note that a small zoom level means a big partition of space.

Run it!

I think we're finally ready to run it. Let K=5 and the min zoom level (zl) be 10. Then just run:

$: ./nearest_neighbors.rb allCountries.txt 5 10

To kick it off. The output will live in 'done10' (in your home directory on the hdfs) and all the ones that couldn't find their neighbors (poor guys) are left in 'not_done10'. Let's take a look: