Extending the LAMP suite with Memcached, the Spread Toolkit, and other independent spirits

Building infrastructure for a scale-to-infinity search engine like Technorati's (which scours and archives the "live" Web of RSS and similar feeds) is like building a series of barns, each after the last one burns down. When recounting the brief history of his company, David Sifry usually says, "Well, after the first infrastructure fell down... and after the second infrastructure fell down... and after the third infrastructure fell down..." No wonder he also says "Scaling is everything". Instructive failure is a less glamorous description of trailblazing.

Technorati's trail started in David Sifry's basement, on a Penguin Computing server loaned to help with research on a feature about blogging that David and I co-wrote for the March 2003 issue of Linux Journal. (See "Building with blogs" and "The Technorati Story".) The service was instantly popular and has continued to grow steadily. By the time you read this, Technorati will be watching nearly four million sydicated sources (mostly weblogs) and well over a half billion links. (Disclaimer: I'm on Technorati's advisory board.)

Think of Technorati as a search engine for stuff that's too new for Google -- plus a platform for free and paid services based on the company's growing archives and countless potential forms of derived data.

One can look at Technorati as another fast-scaling LAMP hack (of which Google is the largest example). But when I talk to Technorati's techies, it turns out some of the most interesting applications are from open source suspects outside those that comprise the familiar LAMP initials (Linux, Apache, MySQL, PHP, Perl, Python, PostgreSQL). I recently asked the company's new VP of Engineering, Adam Hertz (who came over from Ofoto in July) to tell me more about one or two of those very useful but low-box-office tools. He immediately pointed to memcached, and handed the explaining over to Ian Kallen, senior architect on his staff.

"A lot of what we're doing has never been done before," Ian told me. "But a lot of the scaling problems we hit along the way *have* been experienced before. And the result is open source freeware, often very robust, that eases exactly the pain we're experiencing." One such result is Memcached, a distributed caching system developed by Danga Interactive -- the folks behind Live Journal, which accounts for a very large percentage of the very blogs Technorati follows. Ian explains:

Lookup queries in a large data repository can be expensive. If lookups are repeated at a rate near or less than the rate of change, it makes sense to spare the application repeated round trips to fetch the data. An obvious solution is to have the application process keep lookup results in a cache. The downside is that caching is typically resource consumptive; in a multi-process application such as a web service, having each process (or thread) keep its own cache reduces the cache efficiency and bloats the application resource footprint. The solution to this is a shared cache.

High application availability requirements are typically met with server redundancy. Now if each server instance has a cache of its own, the cache efficiency reduction becomes as ridiculous as having individual processes share a cache on a single host. The solution there is a distributed cache and that's where memcached comes in.

Listening on a network socket and utilizing very simple parameters to instrument memory allocation, memcached maintains an in-memory dictionary of keys that is dynamically populated with values. Technorati uses search parameters as keys and search results as values. The payback comes when search results that are duplicative of previously run queries return; they come back at least an order of magnitude faster.

In addition to its speed and simplicity, memcached's other principal strength is its flexibility. It can store objects in language-native serialization formats such as Perl's Storable or Java's Serializable. Technorati uses PHP's native serialization to store cache results. However, memcached can store any bytes that can be used interoperability; one cache client implemented in Java can access and read a cached XML document stored by another client written in Perl, Python or PHP. This can be a huge win for heterogeneous development environments.

Application stabilization and performance optimization is a critical concern for a burgeoning data repositories such as Technorati's. Caching isn't the only thing in the software toolbox. Application architectures can still benefit or suffer from the quality of other design decisions. However, the integration challenge posed by memcached is so low as to make it a primary consideration for data retrieval acceleration.

When I asked Adam again about other less-known tools that might be in his box, he said the company was also about to use the Spread Toolkit -- a language-independent messaging system that allows updates, events, and information to flow through distributed systems. He explains,

When a blog pings us, think of that as an event. Our spider responds, and goes and gets the new content. It will then put the update on the message bus, so any application can see the message as it goes by. Each subscriber -- a Technorati application or service -- gets a chance to see every message that goes by. It can either pick it up or pass on it. An application can say, hey, am I interested in this update? This way, we reduce the internal query load on the database, and we also make the applications more real-time. This is called multicast. It's a similar service as TIBCO has been for financial systems.

For example a blog post about a political book might be interesting to Book Talk, News Talk and Politics Today. If Book Talk, News Talk and Politics (three applications) A blog post about a political book might be interesting to all three. Rather than having all three applications querying the database to find relevant updates in the last 5 minutes, that blog post would travel across the system and be picked up by all three applications.

This not only speeds up performance for users and applications, but opens lots of new service opportunities for users and outside applications and services using the Technorati API.

Spread (also) provides fault-tolerant messaging. Meaning if some machines that are part of a Spread group crash or are partitioned from the others, Spread guarantees that all the machines that are still connected will see the same set of message. This 'strong semantic' group messaging can be used to build replicated databases (one open-source project building a replicated database using Spread is Postgres-R), and guarantee consistency in clustered caching systems (like Technorati's).

Five interesting things here.

First, note where the Technorati doesn't look for answers. The old industrial model assumes that you obtain the expertise you need internally (from a responsible "position" in the company org chart), or you buy products and expertise from outside vendors and consultants. Here we have a CIO and a lead engineer looking outside the company for infrastructural building materials, as plus experts and expertise that are both in the marketplace yet not for sale. In other words, tools like Memcached and Spread are part of a larger conversation, and a larger set of relationships, than a corporate customer like Technorati would get from a commercial vendor. This isn't a knock on vendors; but it reminds us how markets defined in terms of vendor relationships, and of sales volumes in product categories, fail to include a large and growing part of the market ecosystem.

Second, the advantages of using open source tools, and of participating in development projects like Memcached and Spread, are likely to be lost on traditional IT shops, which discourage trailblazing DIY work. Adam Hertz explains:

Two themes: BigCoIT is all about standardization and isolationism.

Standardization, so the story goes, reduces risks and costs. It certainly reduces complexity, but it can take a huge toll on flexibility and responsiveness. Standardization often involves using one multi-purpose tool or platform to accomplish lots of different purposes. This often involves customization, which is done by in-house experts our professional services firms. Great examples: Siebel, Lotus Notes, etc.

DIY shops tend to be cynical, or even downright frightened, of systems like that, because they're so inflexible and unhackable.

Another form of standardization is what people are allowed to have on their desktop PCs. In a lot of big shops, everyone has the same disk image, with all applications preinstalled. There's a huge suspicion of anything that comes from the outside world, especially open source. It's regarded as flaky, virus-laden, unscalable, etc. This produces isolationism, which means that there are major barriers to just try something.

In more open environments, there's a permeable membrane between the corporate IT environment and the Net. People tend to get new tools from the net, usually open source, and just give 'em a spin. Culturally, this keeps the organization open to innovation and new approaches. It builds bonds between the employees and the development community on at large.

Standardized, isolationist shops miss out on all of this. The maintain control, but they inevitably fall behind.

Third, Technorati isn't just "solving problems", although problem-solving does soak up lots of IT cycles. It's also looking to tools like Memcached and Spread to open new business and other opportunities. When I asked Adam about using Spread to expand Technorati's Web service offerings, he said the opportunities, already wide open, only get wider:

It would also be very useful to implement customized watchlists, for example. Suppose you had some weird-assed query...or more accurately, filter. Like, very specific sorts of posts you want to find out about -- posts that mention you, Linux Journal, and open source, for example. We could set that up as a subscriber to updates. It could just watch all the updates come by, discard the 99.99% of updates that aren't relevant, and when it gets a relevant one, it could send you mail or something. Contrast that with our current watchlist implementation, that queries the database. Then and imagine if -- or when! -- we have 100,000 watchlists. (Which are customized reports, provided to subscribers, of fresh inbound links to a given URL or keyword.)

Fourth, Adam is quick to defer to his engineers' expertise, among which is finding useful and free open source tools in the marketplace. The sounds Adam makes when he talks about his work remind me of those Linus makes when he talks about kernel maintainers:

It is very hard to find people who don't flame and are calm and rational--and have good taste. I mean it's like...give me one honest man. It doesn't happen...too much. And at the same time, when it happens, it matters a lot. Just a few of these people make a huge difference.

Fifth, it becomes clear that trailblazing work by tasteful flameproof engineers is what gives the world the infrastructure it needs to supports real business, in addition to maintaining the mundane internal operations of enterprises.

Just in the last few days, while I was writing this column, two other open source tools also came to my attention. They're interesting because they represent two extremes in the marketplace: one external, one internal.

The external one is the Gallery Project, which appears to be rapidly establishing itself as the Apache of photo gallery software. Within a week, Gallery went from something I barely knew to something it seemed everybody was telling me about. It's an extraordinarily rich and deep system that will surely be far more rich and deep by the time you read this. It shows that Apple, Adobe and Microsoft aren't the only ones who care to provide users and photographers with useful tools -- and that the tools that matter most aren't those that live on clients or vendor hosting services, but rather on photographers' own servers, or those of their friends, relatives and other parties (including businesses). Gallery is pure DIY infrastructure, and another dramatic example of how the Net and open source allow the demand side to supply itself.

I also think it's significant that Galley scratches an itch that isn't just technical. Although it's a bit of a hack to install, it's what marketers call a "consumer-facing" service. It will be interesting to watch the market effects over the coming year or two.

The internal tool is yum (yellowdog updater modified), an automatic updater and package installer/remover for rpm systems, developed by the mathematics department at Duke University -- specifically by Yunliang Yu, Senior Systems Programmer there. It "automatically computes dependencies and figures out what things should occur to install packages (and) makes it easier to maintain groups of machines without having to manually update each one using rpm", the yum site explains.

Yum is public code, with public source repositories and a GPL license. Yet it was developed by Duke for its own internal requirements. Richard Hain, chair of the Math department, recently told me there are more than 100 Linux machines running in that department alone, with a serious commitment to running Linux on the desktop, as well as on larger machines for serious number crunching. (Most office staff are running Linux desktops. Windows users are running on Win4Lin.) Yum is a necessity in their environment, Hain explains. Frequent hardware updates alone require a "departmental distro that runs yum every day and maintains packages".

So, it seems to me, the unusual suspects that matter most in open source markets aren't just the tools, but the independent spirits that create and use them.