The Reengineering of Facebook Messages

How do you completely redesign the software used by 750 million people—without hitting the pause button?

Upgrading any kind of software usually requires that its users stop using it, at least briefly, to enable the new software to replace the old and to transfer any stored information before users start working with the new version. We’re all familiar with messages from systems administrators reminding us that servers we’re using will be off-line for a couple of hours in the middle of the night for maintenance.

But when you’ve got three-quarters of a billion users around the world, there’s no "middle of the night." And in an era when people have come to expect e-mails and texts and tweets and posts to arrive within seconds of sending them, there’s little patience for pauses of any kind.

So when Facebook decided in 2009 to completely redesign its messaging system, the engineers on the project knew that the toughest part of the change was not going to be creating the new software but rather getting it out to users without interrupting their individual message flows in any way.

A little background: Since launching in 2004, Facebook has offered users ways to communicate publicly—albeit to a select audience (the Wall)—and privately (messages). Back in 2009, the messaging function looked a lot like other Web-based e-mail systems—you entered a subject line and a message. Replies were threaded—that is, stacked under each individual subject, and your personal mailbox sorted everything by subject and date.

"The genesis," recalls Andrew Bosworth, director of engineering, "was a realization from Zuck [Facebook CEO Mark Zuckerberg] that smaller, real-time, more-contextual messages were just taking over communication. E-mail messages were increasingly seen as too slow—not technological slowness, but philosophical slowness—with a bit of formality to them."

From that realization, Bosworth says, came the idea of a product that combined different technologies, like e-mail and chat, with different devices, like phones and computers. The first thing to go was the subject line.

"Subject lines are a barrier," Lau says. Looking at existing Facebook messages, engineers noticed that 30 or 40 percent had no subject; another significant percentage just used "hey" as a subject. And personally, Lau had found it very stressful to fill in the subject line when he was dating online. "Do I put something witty there? Is it mean if I leave it blank? Sometimes I didn’t message people because I couldn’t think of something appropriate to put in the subject line."

“Yes, it’s awesome, but it’s kind of scary in that your illusions of who you are may get confused with what’s actually there.”
—Kenny Lau, Facebook software engineer, on the presentation of messages as a “canonical thread”

So subject lines were out. Instead, messages are threaded by person. In the new Facebook messaging setup, if you start a new message to one of your contacts, all the messages you’ve ever sent that person pop up—even if the last one was a year ago or more. The Facebook engineers call this the "canonical thread." "I can look at all the communication I’ve ever had with my girlfriend in one thread," says Lau, "and see everything we’ve ever talked about." Of course, sometimes that might bring up discussions you’d just as soon forget, but "that’s the reality, and yes, it’s awesome, but it’s kind of scary in that your illusions of who you are may get confused with what’s actually there," he says.

Next up was tearing down the walls between messages (which had been, like e-mail, not conducted in real time), chat (which is live communication), and texts (which even Facebook users turn to when they are away from their computers or don’t want to burden the data plans on their smartphones). The new messaging system stores live chats in the same thread as messages that are sent when one of the users isn’t on Facebook, and any message turns into a chat if both users are online and have indicated that they’re available. Users can opt to have messages sent as texts to their cellphones when they’re not on Facebook and can reply via text as well. (A few months ago, Facebook added a messaging app for smartphones that works better than text for mobile communications—particularly group communications—but the SMS option is still there.)

Essentially, all e-mail became chat in its informal formatting, but all chat and texts became e-mail, in that they are no longer ephemeral.

That’s how the user experience changed. But for the engineers working on the project, the big change they would have to consider was how all these messages would be stored. Facebook was going to have to hang on to a lot more data—previously, chats weren’t saved—and be ready to retrieve it in an instant. Whenever you messaged anybody, you would instantly see all your past shared communications.

"We spent the second half of 2009 figuring out the storage system," recalls Karthik Ranganathan, an engineer on the project. They knew that "the storage system needed to take a lot of writes"; that is, users would be creating vast amounts of new data. In Facebook’s other popular communication tool, the Wall, users read more postings than they write; a personal messaging system is more balanced. And, said Ranganathan, we knew "that many of the messages being read would be the most recent, but some would be completely random."

They decided that the messaging system would monitor when users are active on Facebook (not just logged in, because some people stay logged in all the time), predict what messages a user is likely to view (with most recent messages weighted more heavily than older messages), and pull those off disk storage and into a cache so they could be delivered quickly if needed. To make sure no one ever loses a message, each message is stored in each user’s account, and that’s replicated three times, so a one-on-one conversation has six copies. After much discussion, the engineers settled on a system called HBase, an open-source database written in Java that stores data on multiple machines.

"We spent three months investigating storage systems," Bosworth says. "And maybe because we picked well—or maybe because it didn’t matter that much—we haven’t had a problem with it. In developing software, the thing you worry most about tends to go well because you’re focused on it. It’s the things you aren’t worried about enough" that cause problems, he says.

“A source of complexity is a source of bugs.”
—Andrew Bosworth, Facebook director of engineering

In fact, he says, the engineers actually overbuilt the storage system, developing a technology they called Atlas, which figures out where to send a user’s data among clusters of machines running HBase. (Storing data on multiple clusters enables systems managers to fix problems or perform maintenance without turning off access for all users.) But the engineers overloaded Atlas with other features, so they ended up turning it off because it made the system more complex. Says Bosworth: "A source of complexity is a source of bugs." The effort wasn’t a total waste, though. They’ll likely have to bring Atlas back into the system in a few years, Bosworth says, when the number of data centers increases.

After figuring out how to store the messages, the engineers turned to the problem of spam, a bane of e-mail services. While traditional spam filters look mostly at message content, the spam filters built into Facebook messages also pay a lot of attention to who the message senders are. Messages from your friends and friends of friends bypass the spam filters and go directly into your in-box, unless you’ve changed the default or previously moved messages from that person out of your in-box; messages from people you aren’t connected to through a friend, along with announcements from organizations or businesses, go into a folder called "other." Messages with spamlike content and no friend-of-friend connection go into a separate spam file, the link to which is tucked away at the bottom of the "other" mailbox and requires scrolling past every message in that mailbox to be seen.

The biggest problem facing the engineers was the old Facebook messaging system, with its 750 million users sending 7 billion messages a day and about seven years of messages stored in a variety of formats. "We had to morph all that data into data that would fit in with this new system," says Ranganathan, "and then actually move the data, because we were going to be storing it on a different set of servers. And we had to do all that while people were sending messages, making sure we didn’t drop any messages."

“If you lose the message that was sent 3 minutes before the upgrade, that’s the worst, because it’s probably the one you care most about.”
—Kenny Lau, Facebook software engineer

"That was one of the biggest engineering challenges I’ve ever faced," says Lau. "If you lose the message that was sent 3 minutes before the upgrade, that’s the worst, because it’s probably the one you care most about."

The solution, Ranganathan says, was for the migration software to briefly send messages to two places—the old data store and the new one. While a user was being moved, the new data store held messages without sorting them, while the user continued to use the old data store. When all of a user’s data completed the transfer from the old servers to the new, the software took the brakes off and started sorting the messages into threads and displaying them to the user. The user never spotted a pause.

Most users were on the new system three months after the official November 2010 launch; 95 percent had moved after six months; today, roughly a year later, all users are on the new system.

While the vast majority of users didn’t notice their moving days, behind the scenes the engineers weren’t quite so calm. They moved over the first million or so users and let them invite their friends to join them on the new system. Then they looked at the server usage to see if their estimates of how many servers this new messaging system would require were correct. And then they got a little nervous.

“It looked like we’d probably need a hundred times more machines than we had actually ordered.” —Karthik Ranganathan, Facebook software engineer

Says Ranganathan, "If we extrapolated this, it looked like we’d probably need a hundred times more machines than we had actually ordered."

What was going on? Had they made a mistake?

The engineers debated intensely for a couple of days, Ranganathan recalls, before zeroing in on what was causing the discrepancy. The first million users migrated were particularly active Facebook users, chosen because the developers figured these users would appreciate the new software the most. The friends they invited were also likely to be particularly active users.

The engineers changed their rollout strategy, adding more randomness in the user selection, and breathed a sigh of relief when it looked like their projected storage requirements were going to hold.

The migration went along smoothly for a while after that. Then, after about half of Facebook’s then 600 million-plus users had been moved over, the engineers realized they had another problem—new Facebook users were initially going on the old messaging system, joining the migration queue. "We hadn’t thought about the new users," Ranganathan says. "We knew we had 500 or 600 million people to move, not [thinking] about who joins every day. But then we realized that we had a boat with a hole in it, and we’re trying to bail out the water, but the water is going to keep coming until you plug the hole."

Once the hole was plugged and new users started on the new system, the migration went fairly smoothly. When a user’s messages didn’t migrate successfully, the user didn’t notice; the messages just stayed on the old system until the engineers figured out what the problem was and fixed it.

This summer, Facebook added a messaging app for mobile phones, further blurring the line between text and chat and e-mail. The biggest trick to that, Bosworth says, was keeping a connection open between the phone and Facebook so messages could be sent seamlessly, without gobbling up the users’ data allotments or draining phone batteries. The solution ended up being a technology that minimizes the amount of data transmitted, called MQTT (Message Queue Telemetry Transport). MQTT is an open protocol developed by IBM that was originally used for satellite telemetry.

Today, more than 8 billion private messages fly through Facebook Messages daily. Will this kind of messaging platform kill e-mail? Bosworth says no. "The post office is still in service. E-mail won’t go away." But, he says, the future of everyday communications will look a lot more like Facebook Messages.