Friday, September 28, 2007

Starting Findory: Infrastructure and scaling

[After a long break, I am returning to my "Starting Findory" series, a group of posts about my experience starting and building Findory.]

From the beginning of Findory, I was obsessed with performance and scaling.

The problem with personalization is that that it breaks the most common strategy for scaling, caching. When every visitor is seeing a different page, there are much fewer good opportunities to cache.

No longer can you just grab the page you served the last guy and serve it up again. With personalization, every time someone asks for a page, you have to serve it up fresh.

But you can't just serve up any old content fresh. With personalization, when a visitor asks for a web page, first you need to ask, who is this person and what do they like? Then, you need to ask, what do I have that they might like?

So, when someone comes to your personalized site, you need to load everything you need to know about them, find all the content that that person might like, rank and layout that content, and serve up a pipin' hot page. All while the customer is waiting.

Findory works hard to do all that quickly, almost always in well under 100ms. Time is money, after all, both in terms of customer satisfaction and the number of servers Findory has to pay for.

The way Findory does this is that it pre-computes as much of the expensive personalization as it can. Much of the task of matching interests to content is moved to an offline batch process. The online task of personalization, the part while the user is waiting, is reduced to a few thousand data lookups.

Even a few thousand database accesses could be prohibitive given the time constraints. However, much of the content and pre-computed data is effectively read-only data. Findory replicates the read-only data out to its webservers, making these thousands of lookups lightning fast local accesses.

Read-write data, such as each reader's history on Findory, is in MySQL. MyISAM works well for this task since the data is not critical and speed is more important than transaction support.

The read-write user data in MySQL can be partitioned by user id, making the database trivially scalable. The online personalization task scales independently of the number of Findory users. Only the offline batch process faced any issue of scaling as Findory grew, but that batch process can be done in parallel.

In the end, it is blazingly fast. Readers receive fully personalized pages in under 100ms. As they read new articles, the page changes immediately, no delay. It all just works.

Even so, I wonder if I have been too focused on scaling and performance. For example, there have been some features in the crawl, search engine, history, API, and Findory Favorites that were not implemented because of the concern about how they might scale. That may have been foolish.

The architecture, the software, the hardware cluster, these are just tools. They serve a purpose, to help users, and have little value on their own. A company should focus on users first and infrastructure second. Despite the success in the design of the core personalization engine, perhaps I was too focused on keeping performance high and avoiding scaling traps when I should have been giving readers new features they wanted.

Are you using any form of LocalitySensitive Hashing schemes to do fast clustering and retrieving of the personalized content. It seems that LSH can be used to do a first level recommendation which can be further refined with more sophisticated techniques like SVM/K-means etc.