Nuts & Bolts: Campfire loves Erlang.

A couple of years ago a lot of buzz started in the Ruby community about Erlang, a functional programming language developed by Ericsson originally for use in telecommunications systems. I was intrigued by the talk of fault tolerance and concurrency, two of the cornerstones that Erlang was built on, so I ordered the Programming Erlang book written by Joe Armstrong and published by the Pragmatic Programmers and spent a couple of weeks working through it.

A year later, Kevin Smith began producing his excellent Erlang in Practice screencast series in partnership with the Pragmatic Programmers. It’s amazing how much difference it made for me to be able to watch someone develop Erlang applications while talking through his thought process along the way.

As I was learning Erlang, I kept threatening to rewrite the poller service that handles updating Campfire chat rooms when someone speaks in room. At some point my threats motivated Jamis, who was also playing with Erlang in his free time, to port our C based polling service to Erlang. Jamis invited me to look at the code and I couldn’t help myself from refactoring it within an inch of it’s life.

The code that Jamis wrote worked fine, but it was not very idiomatic Erlang. While I didn’t have much more experience developing Erlang code than Jamis, I had definitely seen more real Erlang code. I tried to pattern our work after what I had been exposed to, making improvements along the way. We ended up with 283 lines of pretty decent Erlang code.

For the curious, here’s a very simple example function from the real Campfire poller service. This function takes two arguments, the name of a parameter to search for, and the list of parameters. If it finds a matching parameter it returns the associated value, otherwise it returns the atom undefined. Atoms are like symbols if you’re a Ruby programmer.

Last Friday we rolled out the Erlang based poller service into production. There are three virtual instances running a total of three Erlang processes. Since Friday, those 3 processes have returned more than 240 million HTTP responses to Campfire users, averaging 1200-1500 requests per second at peak times. The average response time is hovering at around 2.8ms from the time the request gets to the Erlang process to the time we’ve performed the necessary MySQL queries and returned a response to our proxy servers. We don’t have any numbers to compare this with the C program that it replaced, but It’s safe to say the Erlang poller is pretty fast. It’s also much easier to manage 3 Erlang processes than it was the 240 processes that our C poller required.

Erlang definitely isn’t a replacement for Rails, but it is a fantastic addition to our collective toolbox for problems that Rails wasn’t designed to address. It’s always easier to work with the grain than against it, and adding more tools makes that more likely.

Great writeup. Just posted a link from Erlang Inside. Its exactly this type of application that made me, as a Rails developer, take a look at Erlang and exactly why I always argue it’s a great complement to Rails applications.

The idea that you could write such a powerful and robust service in 283 lines of code is really jaw-dropping. Erlang is to DRY backend services as Rails is to Web Apps.

MI

on 14 May 09

Matt, the point was that Erlang as a whole, regardless of the framework, is a much better choice than Ruby for the type of high concurrency, low latency, applications like the Campfire poller. At the same time, Ruby and Rails are much more expressive for implementing business logic than any of the Erlang frameworks I’ve looked at.

What is the purpose of that parameter function? The only additional things it brings to lists:keysearch is a 1 (which is unlikely to change) and returning “undefined” instead of “false”. It seems like it would be even better just to replace all calls of that function with a direct call to lists:keysearch and use “false” instead of “undefined”.

Also, to build on Chad’s point: Indeed, if you are playing directly to Erlang’s strengths of routing, there is almost certainly no other language that will let you write incredible robust, scalable, clusterable, simple routing code in a short period of time like Erlang will. Two or maybe three from that list, but not all five. Those are problems where it’s easier to start from scratch and learn Erlang than try to learn some other crap library in some other language. Just think of Erlang as a really big DSL…

MI

on 14 May 09

Chad, I absolutely agree. The mind boggling part for me is that the code could easily be 25%+ shorter than it is already—that’s how much of the code is devoted to logging, most of which is disabled in normal operation.

Joe, I actually wrote it as a Mongrel handler at one point before Rack existed and it worked reasonably well, but it wasn’t compelling enough to switch away from the C version that we had in production at the time.

Do you have one erlang process per server? Or all 3 on the same server?

I assume there’s a webserver in front of the erland processes?

MI

on 14 May 09

Jeremy, the main purpose is just to be more intention revealing than a bare call to lists:keysearch would have been if it had been duplicated. I had a hard time finding a function that I could paste in isolation to give an idea of what Erlang code looks like. Most of it relies on the context of the application in order to make sense and that was the best I could come up with.

Is the erlang process also responsible for generating the javascript snippet that the polling server returns?

MI

on 14 May 09

Joe, I honestly don’t recall what the performance of the Mongrel handler looked like. It’s been quite some time since I played with it.

As far as the deployment environment is concerned all requests come into Apache and then we either return a static file or route the request through HAproxy Rails or the poll service, based on the URI.

The three Erlang processes are running on three separate virtual machines. I tested it with a single virtual machine and it was able to handle the peak production load but not with as much cushion as I like so I made the decision to go with three instances so we’d have more excess capacity to absorb a failure. That’s the system administrator in me at work.

We generate the Javascript content when the message is posted to the room so that we only have to do it once, rather than N times where N is the number of people chatting in the room.

we use erlang for some of our backend services at iWantMyName in combination with a catalyst based frontend (no flame wars). it started out as a proof of concept with a small interface for our iPhone application but grew to more and more services. it is simply amazing how hot code swap just works and how you can actually look into your running nodes without turning on debug mode or drown in logfiles. it was a real eye opener for me and we are looking into more erlang based projects in the moment.

I’m doing 100% Erlang consulting at present and there seems to be a real uptick in interest. As Mark’s post illustrates, Erlang is an excellent fit - better than almost any other language - for back end systems.

37Signals uses Erlang! With Armstrong’s book, and Kevin Smith’s screencasts, I have really fallen in love with the language for back end systems.

Well, all but one aspect: deployment. Deployment of my Erlang applications has been almost a level of hell and I got to the point where I built my own deployment system using bits and pieces from OTP (Sorry, but OTP’s required directory structure for “standard” OTP systems is insane).

How are you guys deploying this stuff? And @Kevin, you’ve talked about getting a book together about Erlang deployment, has that made progress?

This is a great writeup. Why did this turn out to be more efficient than the C version? I think there are a lot of us out here that could learn from having this specific comparison addressed. There’s the general idea that C should be faster than “anything” given you had already put in the development cost.

matt

on 14 May 09

@Jason Roelofs:
For deployment, i’ve found using a systems mgmt system (like chef, etc.) tends to work fine. For more rapid application changes i’ve also used capistrano in the past (obviously after having used chef to lay down erlang, certain libs, etc.).

We have a similar requirement at Zenbe.com for our collaboration product ShareFlow.

We currently use a simple Sinatra application to handle our polling requests. We are considering switching to an xmpp based push service in the future (possibly based on ejabberd/strophejs); however, for now the Ruby Sinatra service is working well for us.

It’s interesting to read about how your architecture evolved. Thanks!

MI

on 14 May 09

Jason Roelofs, matt: Our needs are pretty simple for deployment. I built a very simple Chef recipe to configure all of the dependencies and create a system init script to start, stop, or attach to the service. I use Capistrano to deploy the actual code to the servers.

Thanks for the post Mark. When I first heard about erlang I wondered why I would want to use it and it’s strange syntax when I have ruby. I think you do an excellent job of showing how using the best tool for the job can make a big difference.

Praveen

on 14 May 09

I’ve been looking at Erlang lately and while concurrency and Fault Tolerance seem very appealing, I’ve been wondering about Database connectivity. I understand there are no ActiveRecord or DataMapper like goodies but is ODBC the only interface (to MySQL) available? Any advise on how you guys did it? thanks.

“At the same time, Ruby and Rails are much more expressive for implementing business logic than any of the Erlang frameworks I’ve looked at.”

Mark, I have done some prototyping with Rails and I also created ErlyWeb, and I’m not sure on what you base this assertion. My impression is that ErlyWeb and Rails require roughly equivalent effort to build web apps. ErlyWeb makes it much easier to build scalable real-time/comet apps. With LFE (Lisp Flavoured Erlang), ErlyWeb is arguably much more expressive than Rails because it gives you the power of code generation using lisp macros.

I assume that’s 240 Ruby processes, and you needed 240 processes because Ruby can’t do anything in parallel. Why wouldn’t you just try JRuby? It could have been one process. Why do people automatically jump off Ruby when the C implementations can’t solve their problem? It baffles me.

MI

on 14 May 09

Charlie: No, that was actually 240 C processes. We had a FastCGI based C poller that we used prior to the Erlang work.

We actually do use JRuby for some internal systems tasks where we need to parallelize and it works great, but it just doesn’t feel like as natural a fit for a job like this as a language like Erlang with baked in concurrency is. JRuby is definitely another tool in our toolboxes though.

Andrew Banks

on 14 May 09

Also written in Erlang is CouchDB, which looks very interesting for web applications.

MI: Ahh thank you, that makes more sense then. Making a move to JRuby in that case would probably not have been any better than Erlang. Though it probably would have been interesting to implement :)

Shane Adams

on 15 May 09

Wow, Erlang does sound really impressive. One thing I don’t understand is what/why you would need 240 C processes to process 2000 req/sec? Not sure I followed what the erlang impl was replacing. If it was 240 C processes, it sounds like there was a problem in the original Implementation.

I’m not saying it should be in C or C is better but C certainly should not require 240 processes if I am reading you right

MI

on 15 May 09

Shane: It required that many processes because it was a very small FastCGI program. We certainly could have written a much more involved C implementation that made use of threads to provide more concurrency in fewer operating system processes, but it would have been a lot more complicated code. In Erlang, the concurrent parts almost just fall together.

Shane: I am wondering along with you. Why are too much C processes needed specially since state is seemingly maintained on the database level.

Another question, did all the 240 processes live in the same virtual instance? and how many processor cores they had access to? Coz the number seems too big and would most likely introduce a sizable context switching overhead. Keeping in mind that the I/O is very quick anyway.

Final question, how much of those 2.8ms time is spent waiting on the db?