Unwinding Uber’s Most Efficient Service

A few weeks ago, Uber posted an article detailing how they built their “highest query per second service using Go”. The article is fairly short and is required reading to understand the motivation for this post. I have been doing some geospatial work in Golang lately and I was hoping that Uber would present some insightful approaches to working with geo data in Go. What I found fell short of my expectations to say the least…

The post centered around how Uber built a service in Go to handle the problem of geofencing. The core of the geofencing problem is searching a set of boundaries to find which subset contains a query point. There’s a number of standard approaches to this problem and this is the route Uber chose.

Instead of indexing the geofences using R-tree or the complicated S2, we chose a simpler route based on the observation that Uber’s business model is city-centric; the business rules and the geofences used to define them are typically associated with a city. This allows us to organize the geofences into a two-level hierarchy where the first level is the city geofences (geofences defining city boundaries), and the second level is the geofences within each city.

If you asked someone to solve the geofencing problem who had never been exposed to spatial algorithms, this is probably what they would come up with. Color me disappointed that engineers at a company valued at $50b, whose core business revolves around finding things on Earth that are nearby, chose to ignore standard solutions without a concrete reason outside of “it’s too complicated”. It’s particularly disappointing considering Uber bought a portion of Bing Maps engineering based out of Colorado last summer. I used to work on the Bing Maps Streetside team that Uber acquired and I know for a fact that there are quite a few people that know a thing or two about spatial indexing on those teams. One of my interview questions was even how to find which geo-tagged Instagram pictures were of the Eiffel Tower.

circa 2013

Biases aside, I respect that Uber set out a specific latency distribution requirement of serving 99% of all requests in under 100 milliseconds and used that as their success metric. If augmenting the brute force approach got them there, then kudos are in order. My stance on delivering kudos changed as I read further through the article.

“While the idiomatic Go way is to synchronize concurrent read/write with goroutines and channels, we were concerned about the negative performance implications.”

Triggered. They are discarding language primitives under the guise of performance, but choosing to ignore the glaring inefficiencies in their search algorithm. They’ve completely unwound an industry favorite Knuth-ism and chose to optimize minutiae. Instead of working with solutions that interview candidates at the company they just bought are expected to have a handle on, they chose to inject their own intuition. It’s blind justification and I can count the number of times this has been a good idea on one hand. Maybe you think I’m being hyperbolic? Let me show you I’m not.

Geofencing Time Complexity Analysis

Starting off with a time complexity analysis of possible solutions is a good place to start. This doesn’t require doing any actual coding, just a bit of research and should demonstrate the inefficiencies of Uber’s approach. Even if you aren’t familiar with Big-O analysis, hopefully it will still get the point across.

via bigocheatsheet.com

The first thing we need to get out of the way is the cost of the point-in-polygon check (I like to call it the polygon inclusion query). The wikipedia article does a good job of breaking down two common algorithms used for polygon inclusion; ray casting and the winding number. Both algorithms need to compare the query point to every vertex of the polygon, so their efficiency is a function of the number of vertices. If we break down the problem using Uber’s city-centric model, we end up with a few variables that we can use for our analysis.

q := Point QueryC := Number of citiesV := Number of vertices that define a city boundaryn := Number of fence polygons in a cityv := Number of vertices in a fence polygon

From here I’ll outline a few algorithms quoted in the article and see how they stack up.

Brute Force

I’ll lean on Uber’s description of the brute force algorithm

…go through all the geofences and do a point-in-poly check using an algorithm, like the ray casting algorithm.

In our framework that translates into going through each city C, then each fence polygon n, and finally comparing the query point to each vertex in the polygon v.

Brute() -> O(Cnv)

Augmented Brute Force

Uber augmented the brute force approach by first checking which city the point is in, then going through the fences in that individual city. This effectively trims the search space by an order of however many cities the data had. The cost of comparing the query point to every point in the city boundary still needs to be account for. In this two stage approach, here’s what we get:

Uber() -> O(CV) + O(nv)

These are both very rudimentary approaches, let’s get to the fun stuff.

RTree

The r-tree data structure is based on a b-tree with a special sequence definition for multi-dimensional objects. The sequence is defined by the comparison of an object’s Minimum Bounding Rectangle (MBR) to another objects MBR and if it completely contains it. In the two-dimensional case of geofencing, the MBR is a simple bounding box (bbox) defined by a minimum and a maximum coordinate. Checking if an object’s bbox contains another’s is a constant time operation of asserting that the minimum coordinate is less and the maximum is greater than the other’s bbox on both axes. The article on wikipedia has a thorough explanation, but here’s a visual aid for the tree and it’s objects bounding boxes.

via wikipedia.org

If implementing that on your own sounds untenable, doing a quick search comes up with a few pure Go implementations. The rtreego package gets most of the way, but should be specialized a bit for 2 dimensions and has a couple extra mallocs that can be eliminated.

Winding up our r-tree analysis leaves us with the work of searching the r-tree for which polygon’s bounding boxes contain a point and then doing the polygon inclusion query on each of them. The average r-tree search time complexity is O(logMn) where M is the user defined constant of the maximum children a node can have.

M := Maximum number of childrenRtree() -> O(logM(Cn)) + O(v)

QuadTree

A quadtree is a specialization of a generic kd-tree for 2-dimensional indexing. Basically you take a flat projection of your search space and divide it into quarters that we’ll call cells. You then divide each of those cells into quarters recursively until you hit a defined maximum depth which will be the leaves of the tree. If we take a mercator projection of the Earth and label each of the cells by appending an identifier to the parent label we can leverage a quadtree structure to create a tiling system fast geospatial searches like Bing Maps does/did. Bing Maps calls the cell labels “QuadKeys” and they are directly translatable to Map Tiles. Here’s a nice overview of how they are created.

Bing Maps QuadKeys

S2

Why am I even talking about a quadtree? Well the “complicated” S2 algorithm mentioned in the article is just an implementation of a quadtree! The main differences between the Bing Maps tiling system and S2 is that the S2 projection is done via a cube mapping of the Earth sphere so that each cell has similar surface area. The cells are also arranged using a space spilling curve to conserve spatial locality in the cell label. Here’s a post with a deeper explanation. S2 is written in C++ and has bindings for various languages. One of the great things about Go is having a simple build process and producing a single statically linked binary. Introducing C++ bindings would remove those benefits. Lucky for us there’s a pure Go implementation at golang/geo. It is admittedly incomplete and is lacking an s2.Region implementation for s2.Polygon, so the s2.RegionCoverer can’t be used out of the box but the core is there. Here are two example of flat covers at different levels and a RegionCoverer that generates multi-level covers.

Upper East Side covers from s2map.com

Winding back to our analysis, if we get a set of cell labels for each feature and a cell for our query, we can narrow the search space down to a couple polygons in constant time and then do a polygon-inclusion check. We’ll add a new parameter T for the number of polygons with the same cell label. It’s constant by a relation of feature area to zoom level.

T := Features with cell labelQTree() -> O(T) + O(v)

There are ways we could leverage S2 cells or QuadKeys to get an internal covering of our boundaries and get a constant time check if a point is included in a polygon. The tradeoff for skipping the point-in-polygon check is storing all the internal covering keys which becomes memory bound very quickly. We could alleviate some of the memory usage with things like prefix tries or bloom filters. Maybe I’ll dig into that later.

Comparison

After going through each of the algorithms in question, we end up with a few very different looking complexities for our algorithms.

q := Point QueryC := Number of citiesV := Number of vertices that define a city boundaryn := Number of fence polygons in a cityv := Number of vertices in a fence polygon

Having so many parameters can make the analysis a bit obtuse. If we use our estimating skills to guess some values for these parameters, we should be able to a paint a picture of how efficient each one is. While we’re straying traditional time complexity analysis, I think it will be telling to run through this exercise.

Estimating

The original post said that decomposing the feature set to cities trimmed the search space from tens of thousands to hundreds. We can then infer that they have hundreds of cities to search. The number of points that form a boundary of a city can vary significantly. I’ve seen definitions of Manhattan vary between one thousand to six thousand. Let’s assume they choose simple definitions or do some of their own polygon simplification like Douglas-Peucker (despite never mentioning it) and say a city boundary averages 100 points.

NYC Neighborhoods

How many fence polygons will each city have? Drawing from the same logic as we did for the number of cities and knowing that New York City has 167 neighborhoods, 100 again seems like the right order of magnitude. Looking at the vertices for a fence, presents neighborhoods like Williamsburg in Brooklyn having a few hundred points, but user defined polygons almost certainly have simple shapes of a few points. Let’s take a guess and again stick with 100 for our v parameter. Considering that v is the polygon inclusion query itself and each algorithm has to do it, I’m less concerned about getting it correct. I think we’re at the very least in the correct order of magnitude.

If our logic is sound, we should get twice the gains in efficiency that the Uber algorithm saw using a standard spatial index.

Thinking inside the box

The astute reader may have picked up that there’s a simple change to Uber’s algorithm we could make to drastically increase it’s efficiency. We could use bounding boxes to avoid many of the expensive point-in-polygon queries the same way the Rtree algorithm does. They never make any references to using bounding boxes in the post, so I can only assume that they didn’t take advantage of them. A brute force search on all the fences with a bounding box check is more efficient than the Uber algorithm. Applying a bounding box check to their algorithm gets us on the same order of magnitude as the spatial indexes.

This is a tangible change that they could make in their systems today. If they started from scratch they should still stick with the spatial algorithms for their flexibility and avoiding managing the city boundary to polygon mappings. Even though we’re dealing with gophers instead of snakes, the zen of “Flat is better than nested” still rings true.

Benchmarking Geofencing Algorithms

En garde!

Still not convinced? I implemented each of the algorithms, modeled the geofencing problem and benchmarked the results. To dispel the complexity of the S2 library, I also implemented the missing S2.Region interface for the s2.Loop type. A loop is a simple closed polygon in S2 terminology. It took an hour to implement.

For ease of comparison, I created a GeoFence interface for each algorithm to implement.

I loaded up the NYC 2010 Census Tracts and queried which tract the location of the Whitney Museum was in at the time. I counted each NYC burrough as a city for the Uber algorithm. I renamed it to the City algorithm as to not shame anyone. The benchmarking code looked something like this:

Running these micro benchmarks validated our time complexity analysis by demonstrating the relative efficiencies of each approach. I had to use a log scale to get the results to be visually comparable.

The evidence looks pretty damning at this point, but maybe you’re still skeptical. I modeled Uber’s dataset by using geo-tagged tweets as a proxy for ride requests and the live locations of MTA buses as a substitute for car locations. Searching the 2,123 Census Tracts in NYC across the five burroughs for these buses and tweets should get some resemblance of Uber’s actual data feeds. Benchmarking via wrk on a single core Digital Ocean droplet from another across the same private network demonstrated that using a spatial index increased throughput more than 166% compared to the city algorithm. For a control sample, an endpoint serving a short string served an average 10,930 requests / second.

These numbers may look paltry next to the quoted throughput of 170k QPS in the Uber article. However, keeping in mind that this benchmark is a single core, their cluster is 40 nodes and making the assumption that the cluster is composed of 16 core CPUs, we’re trading at a 1:640 ratio. If we apply that ratio to the performance disparity we’re talking the difference of millions of requests of throughput.

To get the full story, let’s look into what’s going on under the hood by profiling the execution of where each fence is spending it’s time.

It’s evident that the majority of the work in the brute force family of fences is done in searching, while the spatial indexes are serializing more data and responding to more requests. We could cut into that serialization cost by accepting a wire protocol like protobuf or the schema-less msgpack. Uber has a blog post about what they chose for serialization. I’m sure there’s also plenty of tuning that could be done on the handlers and the server.

The geofence-profiling repo has the data used in this experiment, the specifics on how the benchmarking was conducted and more in depth results such as the pprof graphs and latency tables.

Winding it back up

It is clear to me that the team at Uber under-engineered this problem. Thoughtfully designing this service could trim down the number of nodes by an order of magnitude and save hundreds of thousands of dollars each year. That may sound like pittance to a company valued at more than the GDP of Delaware, but in my eyes that’s the salaries of a few engineers and a few good engineers can go a long way. Maybe even further than the few extra Mercedes-Benz S-Classes they could add to their fleet from the money they could be saving...

I hope you enjoyed my little experiment as much as I did. The gofence library follows. It includes an install script for that will pick the right binary for your architecture, so you can start fencing geojson point features into your very own polygons today.