I've done some XMPP development so when I read Facebook was making a Jabber chat client I was really curious how they would make it work. While core XMPP is straightforward, a number of protocol extensions like discovery, forms, chat states, pubsub, multi user chat, and privacy lists really up the implementation complexity. Some real engineering challenges were involved to make this puppy scale and perform. It's not clear what extensions they've implemented, but a blog entry by Facebook's Eugene Letuchy hits some of the architectural challenges they faced and how they overcame them.

A web based Jabber client poses a few problems because XMPP, like most IM protocols, is an asynchronous event driven system that pretty much assumes you have a full time open connection. After logging in the server sends a client roster information and presence information. Your client has to be present to receive the information. If your client wants to discover the capabilities of another client then a request is sent over the wire and some time later the response comes back. An ID is used to map the reply to the request. All responses are intermingled. IM messages can come in at any time. Subscription requests can come in at any time.

Facebook has the client open a persistent connection to the IM server and uses long polling to send requests and continually get data from the server. Long polling is a mixture of client pull and server push. It works by having the client make a request to the server. The client connection blocks until the server has data to return. When it does data is returned, the client processes it, and then is in position to make another request of the server and get any more data that has queued up in the mean time. Obviously there are all sorts of latency, overhead, and resource issues with this approach. The previous link discusses them in more detail and for performance information take a look at Performance Testing of Data Delivery Techniques for AJAX Applications by Engin Bozdag, Ali Mesbah and Arie van Deursen.

From a client perspective I think this approach is workable, but obviously not ideal. Your client's IMs, presence changes, subscription requests, and chat states etc are all blocked on the polling loop, which wouldn't have a predictable latency. Predictable latency can be as important as raw performance.

The real scaling challenge is on the server side. With 70 million people how do you keep all those persistent connections open? Well, when you read another $100 million was invested in Facebook for hardware you know why. That's one hella lot of connections. And consider all the data those IM servers must store up in between polling intervals. Looking at the memory consumption for their servers would be like watching someone breath. Breath in- streams of data come in and must be stored waiting for the polling loop. Breath out- the polling loops hit and all the data is written to the client and released from the server. A ceaseless cycle. In a stream based system data comes in and is pushed immediately out the connection. Only socket queue is used and that's usually quite sufficient. Now add network bandwidth for all the XMPP and TCP protocol overhead and CPU to process it all and you are talking some serious scalability issues.

So, how do you handle all those concurrent connections? They chose Erlang. When you first hear Erlang and Jabber you think ejabberd, an open source Erlang based XMPP server. But since the blog doesn't mention ejabberd it seems they haven't used it .

Why Erlang? First, the famous Yaws vs Apache shootout where "Apache dies at about 4,000 parallel sessions. Yaws is still functioning at over 80,000 parallel connections." Erlang is naturally good at solving high concurrency problems. Yet following the rule that no benchmark can go unchallenged, Erik Onnen calls this the Worst Measurement Ever and has some good reasoning behind it.

In any case, Erlang does nicely match the problem space. Erlang's approach to a concurrency problem is to throw a very light weight Erlang process at each state machine you want to be concurrent. Code-wise that's more natural than thread pools, async IO, or thread per connection systems. Until Linux 2.6 it wasn't even possible to schedule large numbers of threads on a single machine. And you are still devoting a lot of unnecessary stack space to each thread. Erlang will make excellent use of machine resources to handle all those connections. Something anyone with a VPS knows is hard to do with Apache. Apache sucks up memory with joyous VPS killing abandon.

The blog says C++ is used to log IM messages. Erlang is famously excellent for its concurrency prowess and equally famous for being poor at IO, so I imagine C++ was needed for efficiency.

One of the downsides of multi-language development is reusing code across languages. Facebook created Thrift to tie together the Babeling Tower of all their different implementation languages. Thrift is a software framework for scalable cross-language services development. It combines a powerful software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, and Ruby. Another approach might be to cross language barriers using REST based services.

A problem Facebook probably doesn't have to worry about scaling is the XMPP roster (contact list). Handling that many user accounts would challenge most XMPP server vendors, but Facebook has that part already solved. They could concentrate on scaling the protocol across a bunch of shiny new servers without getting bogged down in database issues. Wouldn't that be nice :-) They can just load balance users across servers and scalability is solved horizontally, simply by adding more servers. Nice work.

Reader Comments (13)

noone said they use XMPP internally... they just announced they would be exposing an XMPP interface to the chat server.. for all i've read about facebook chat, it could be a completely home-brewed solution

70M users or not, what counts is the number of simultaneous users connected. Say it's 8M at peak time. Connections don't cost a thing. The only real limit is the number of IP ports you'll have for a given IP (16 bits). If you want to be conservative, you'll put 20K connnections per machine (which you can push to 40K; 2/3 of your max). That's 400 machines and 50% extra capacity. That also means you have ~2TB RAM and a shitload of CPU. Even if each connections sends a message evey 10secs, that's only 2K qps / machine, and you can store a lot of state with 2TBs. You could also load these machines with more NICs, and handle more connections / box if space is an issue.

I'd go low key and write a custom c++ server for that purpose. There's no point in reusing a generic http server, as you're not serving generic traffic, and don't need generic capabilities. You need something that's optimized to just holding those connections. Most of the time, they're idle from user traffic. Keep alive & presence is going to create most of your BW & CPU usage. That server could either be handling the chat logic directly, or simply multiplex user traffic to another pool of machines. I'd probably choose the multiplex approach, as it allows to decouple connection handling from logic, and the 2 layers can be rolled out / updated independently. It'll cost more hardware, but you have $100m to spend.

I find it interesting that the Facebook devs choose to implement an epoll-based solution over a "pure" Erlang solution. As I stated in the post quoted above, I sincerely do believe that the Erlang scheduler is quite fantastic at spawning off multiple functions to do work in parallel. From what I've seen in my own testing, I'm not surprised they dropped-down to a lower level for epoll scheduling though. Would be nice if they posted their port back to the community, I'd love to see how far they take the message handling at the C-level before handing off to the Erlang layer.

The statement "Until Linux 2.6 it wasn't even possible to schedule large numbers of threads on a single machine" is false/missleading. Other operating systems like Solaris were having no problem doing just that...

>> Erlang is famously excellent for its concurrency prowess and equally famous for being poor at IO

The ChatLogger is a separate service running on a dedicated cluster. I don't think the design was driven by Tim Bray's famous benchmark, where he triggered the rumour that Erlang sucks for IO, by using a function designed for distribution-transparent user interaction for slurping log data from disk. In many ways, Erlang is great at IO - it's just that it's optimized for handling IO on thousands of open ports while maintaining decent real-time characteristics for all processes inside the VM.

The disk_log component in Erlang/OTP is actually quite good. The Mnesia DBMS uses it for streaming out table data to disk, and loading table data at restarts. I timed loading of a 12 GB table at 10 minutes once, which is about 20 MB/sec - not world-class streaming speeds, but on a normal disk, it's not terribly bad, especially since the work was done in the background and the system felt responsive the whole time.

Bottom line: Erlang's IO performance is impressive in many ways, but for a dedicated logging cluster, using C++ is surely a better idea.