need monitoring system, ability to graph, all centrally, so you don’t have to go to individual machines.

Being on the front page of digg crashed server (digg sends a lot of traffic)

Why not memcache the front page? It gets loaded all day long, and it is always the same.

Rewrote the system to only read from database when it’s not memcached. Refresh memcache once per day.

Before change, was 60% db writes, 40% reads. After change was 99% db writes, only 1% reads. All the reads were now coming from memcache.

At twitter, expect that people will come along and read what you’ve written. So they do write-thru caching. The tweet is put first into the cache, then into the database. This way they never need to read the database to get the recent tweets.

when you get beyond a certain point, you can’t analyze the data on a single machine. you have terabytes and terabytes of data.

Hadoop lets you run distributed jobs, that automatically retry when systems go down or fail.

Without this, as the data grows, you end up asking simpler and simpler questions.

With this, you can ask more sophisticated questions.

Scaling search…

they can process 1 to 2 searches per second

search is hard

What was the first thing to blow up for you?

1st was mysql, 2nd was apache.

made the switch over to engineX for serving up images. much, much faster. Using Apache was like using a sledgehammer to server up images.

connection issues with postgres.

migrating data schema when at scale is really hard… turn off indexes before copying data

twitter: one thing that kept our ops team awake at night…

we are a rails app

how do we maintain relationships?

we had it normalize with a follower table: user_id, follower_id in a single table.

lookups against this table were table

they built an intermediate solution… denormalized data structure

while they worked on a longer term solution… they built a custom social graph data tool.

it need to work across 7 orders of magnitude: from someone with 1 follower to someone with 1M followers

Questions

Deployment

twitter uses murder: bit-torrent for deployment. seed some servers, then those servers help feed others. brought main app deployment time from 12 minutes to 37 seconds. check out twitter opensource – they open source most of their tools

Hardware databases

twitter is using some, facebook experimenting with them. they are PCI express cards almost as fast as main memory.

databases versus key stores

it’s natural to go to denormalized – you just want the data you want in the form you want

over time, more of the logic goes into the application code, so database indexes are less useful

how do you manage when data is on particular servers

at twitter, using cassandra, are already consistent

there are bunch of new systems that have different tradeoffs. some have eventual consistency, some don’t. if your application can’t handle eventual consistency, then cassandra isn’t for you.

did anyone consider any of the top ten database, like say oracle

twitter: we strongly prefer open source systems. as we scale, we like to be able to peek under the hood, and see what is going on.

facebook: we like open source, we like the way open source projects work together, we like to be nimble – these proprietary systems are not so nimble.