Google's epic graph cruncher mimicked with open source

Unlike Facebook or Yahoo!, Google is loath to open source its back-end software. For many, this is a sore point, as the search giant has built its famously distributed infrastructure atop countless open source tools fashioned outside the walls of the Googleplex. But Mountain View does give back in less-direct ways.

In some cases, Google will publish research papers describing one of the proprietary platforms driving its back end, and to a certain degree this allows outside developers to mimic these platforms with open source projects. Google papers on its GFS distributed file system and MapReduce distributed number-crunching platform, for example, gave rise to the open source Hadoop, and a paper on its BigTable distributed database sparked the open source HBase project.

Now, much the same thing has happened with Google Pregel, Mountain View's platform for processing enormous online graphs, such as a map of the web itself, or of a social network, graphing relationships between people. This week, a Texas startup known as Ravel unveiled an open source project based on Google's 2010 paper describing Pregel. Open sourced under an Apache license at GitHub, the project is dubbed GoldenOrb.

Zach Richardson

A few years back, while working on a PhD in computational mathematics, Ravel president and GoldenOrb lead architect Zach Richardson helped found a small company that basically helped other businesses processes large amounts of data, including "semantic web" data, which seeks to give machines a better "understanding" of text on the internet. They soon realized that to solve such problems required better tools.

"We were doing consulting at the intersection of the semantic web and Big Data," Richardson tells The Register. "Semantic web data is inherently stored in a graph, and when you get to very large data sizes, a lot of the traditional methodologies for processing or trying to understand that data no longer work. Or they don't scale. Or they take a completely unrealistic amount of time."

As luck would have it, Google published its Pregel paper. Pregel is a computational model that dovetails with Google's existing data-storage technologies, including the Google File System (GFS) and BigTable. In essence, data from GFS or BigTable is shuttled to Pregel, where the data is crunched. Presumably, Google Chubby – the company's distributed lock service – is used to manage access to data.

In the open source world, GFS, BigTable, and Chubby are mirrored by the Hadoop File System (HDFS), HBase, and Zookeeper. Naturally, Richardson and his fellow developers built GoldenOrb atop such open source platforms. Zookeeper handles data synchronization across distributed machines, and Hadoop's remote procedural call (RPC) passes message from node to node. Google's Pregel paper provides a high-level description of the Pregel programming model, but the rest was guesswork.

"Google provides the higher level concepts of when something needs to synchronize, what communications need to happen between servers, and what your programming model needs to look like to use it," Richardson says. "But how close we are to their implementation? It's very hard to guess."

Building an initial GoldenOrb platform took about seven months of "on and off" work. Richardson and his team has not even had a cursory discussion with Google about the platform.

In addition to developing the platform, Ravel will build applications that run atop it. "We're focused on building enterprise products that analyze data," Richardson says. "Graph problems are [almost infinite]. This includes social network analysis, a very popular topic, but the same algorithms might also be used in things like epidemiology research or pharmaceutical research."

According to Richardson, the platform is suited to situations in which you need random access to data while your algorithm is running. MapReduce is designed for batch processing. You take a large chuck on data, break it up into tiny pieces, and spread it across a cluster of machines for processing. GoldenOrb can run algorithms that grab particular pieces of information from distributed machines on the fly. "With MapReduce, if I'm doing a calculation on one machine and I happen to need information on another machines, there's no way to get it," he says. "GoldenOrb can share information across all machines as necessary to solve the problem."

Ravel employees about fifteen people, and according to Richardson, the company has already started building products on the open source platform. But he declined to discuss them. ®