One thing I wondered after reviewing the slides -- and this is nothing but a crazy thought exercise on my part -- is whether they might be able to simplify the LiveJournal architecture (the arrows pointing everywhere shown on slide 4).

In particular, there seems to be a heavier emphasis on caching layers and read-only databases than I would expect. I tend to prefer aggressive partitioning, enough partitioning to get each database working set small enough to easily fit in memory.

I know little about LiveJournal's particular data characteristics, but I wonder if aggressive partitioning in the database layer might yield performance as high as a caching layer without the complexity of managing the cache consistency. Databases with data sets that fit in memory can be as fast as in-memory caching layers.

Likewise, I wonder if there would be benefit from dropping the read-only databases in favor of partitioned databases. With partitioned databases, the databases may be able to fit the data they do have entirely in memory; read-only replicas may still be hitting disk if the data is large.

Hitting disk is the ultimate performance killer. Developers often try to avoid database accesses because their database accesses hit disk and are dog slow. But, if you can make your database accesses not hit disk, then they can be blazingly fast, so fast that separate layers even might become unnecessary.

Again, wild speculation on my part. Everything I have seen indicates that Brad knows exactly what he is doing. Still, I cannot help myself from wondering, is there any way to get rid of all those layers?

4 comments:

What about the issue of availability - if a partition is unavailable, but the clients that update and the clients that read would be out of luck. If there are very many more readers than writers (like product data at amzn) you may want to decouple the systems purely for availability concerns.

For handling sudden traffic spikes, hotspot partitions, and failures,it's much easier to add a bunch ofread-only replicas than to repartition the data.

LiveJournal does use partitions.They get two scaling knobs:* # partitions scales with the number of active journal writers* # replicas scales with the amount of incoming traffic.* # replicas for a given partition can be scaled up for hotspots.

In this case, the cache consistencyis no big deal, this is a bloggingsystem used by teenagers, notPayPal.

Hi, Mike. Sorry, I was not clear. I did mean that each partition would be replicated. It would look like the UserDB clusters on slide 4.

The big difference between that and having one large database with a master and many read-only replicas is that the big database probably will hit disk but the parititioned, replicated databases should serve out of memory.

That's a good point that read-only data that is infrequently updated, like Amazon's product data or Google's search index, may be handled differently. However, heavily partitioning read-only data to keep it in memory still makes sense. Google, for example, does that with its index shards on GFS.

I've been interested in how LJ scales for a while, and I think half of their success is managing to avoid the database as much as possible for simple reads of information that can be gotten with less overhead. The use of Memcached is a great part of that - it literally stores the finished HTML, ready to be dropped into the outbound stream. When it is read first from the DB it's got to be processed to get it screen ready - including parsing custom html-style tags and other dynamic insertions.

If that had to be read from the database - even if it was in memory most of the time - it would be a significant overhead, instead they can usually just pick a block of HTML text out of memcached, indexed by the original user and post number - and of course the most read posts are almost invariably in the last 12-24 hours..

Though I'm not yet storing finished user profiles for the website I run (a dating site), I am caching the the number of users, and some other simple information, for 30 seconds between counts - and though the entire dataset is in memory, it's still saving a several million database requests a day - and previous experiments where I more aggressively cached user's raw profile data and saw a huge decrease in number of database hits, and the amount of network traffic dropped as much as 80% (at that time I was just caching it into shared memory on a single webserver, from a local network database server).

Because there was no database overhead, the pages were also faster to fetch. When I can cache - in memcached - the finished HTML, I expect further big reductions in time to produce a page.

As for database partitioning, they already are, with the user shards, p4, right hand side. The global database is a still potential problem, and I understand they've been trying to de-centralise that for a while, but with 12.8million journals and around 200,000 posts per day (stats here: http://www.livejournal.com/stats.bml) it takes a lot of thought.

One final thought about partioning data, bear in mind that any user could be viewing any other user's recent posts - that's a lot of unfinished data to throw around, and potentially hundreds of times per minute for popular journals, like JWZ's or RSS feeds for WilWheaton or xkcd.com.

Via dzone: http://poorbuthappy.com/ease/archives/2007/04/29/3616/the-top-10-presentation-on-scaling-websites-twitter-flickr-bloglines-vox-and-more -- include a slightly older, but more complete version (80 page, my personal favourite: page 8) of how Livejournal has been scaled up.