I’m co-founder of a startup called MinoMonsters, we make an iOS game by the same name. We went through YCombinator in the Winter 2011 batch. On December 18th, the eve of our big launch, I nearly fucked everything up. This is the story of what happened.

MinoMonsters is a monster battling game. Think of it as Pokémon for the iPhone. As is the case with many apps, our game is internet connected. During the course of play, the app talks to a backed server. The server diligently performs tasks like finding your friends, letting you purchase in-game items, and letting you play with others. When planning our server, we needed a solution that was quick to build, had fast performance, and was light over the wire. We chose to build the server in Ruby with EventMachine and Synchrony speaking in protocol buffers over TCP.

Since our soft launch on December 6th, the server has been chugging away without fault. But then the problems began to emerge. It all started with intermittent crashes. I would wake up to find that the running EventMachine process had simply stop running. The approach to understanding this problem is two-fold: recreate the crash to understand the “how” and capture a log to understand the “why”. Since the crashes were intermittent, I had no way of predictably crashing the server. So instead I started with logging.

Then I methodically went through each error and devised a solution. It wouldn’t solve my problem, but it would narrow down the options. Over two weeks, I casually “fixed” more crashes as they cropped up and eventually they stopped.

Then the day came. We had planned to set the price of our game to free. As soon as we did, users started flooding in. And then the problems started anew.

As before, the server was crashing. I couldn’t recreate the crash, but I didn’t need to. At first, it was crashing every hour. Then it was every 30 minutes. Soon enough, the server crashed within seconds of starting up. Logs became increasingly useless, as the places in code that threw exceptions were random at best. Something was falling apart and I couldn’t figure out what it was. Without a backend, we couldn’t verify virtual goods purchases. And without virtual good purchases, we would make no money off the crazy initial traction we received. This was bad news.

I spent four hours trying everything I knew to solve the problem. I scoured the web for obscure bugs. I completely ripped out Synchrony. I triple checked OS resources like file descriptors, disk space, and others. I was out of ideas. That’s when I sent out a distress beacon on Hacker News.

Within minutes I had dozens of people helping me get connected with the right Rubyists. In the end, two people helped me out: Ryan Stout (@ryanstout) and Aman Gupta (@tmm1). With Ryan, we immediately started with generic diagnosis steps. What does the server do? What systems does it touch? We systematically went through each touchpoint and observed it’s potential contribution to the problem. This helped us eliminate any external factors (OS, Ruby, Library) as the culprit. Then we got Aman involved. As one of the maintainers of EventMachine, he was able to quickly start looking at my configuration. Within minutes, he noticed something odd in my setup code:

Aman saw the issue right away. EventMachine.epoll is a configuration method that must be called before the run loop. From the documentation:

Call the method EventMachine#epoll anytime before you call EventMachine#run, and your program will automatically use epoll, if available

It turns out that this basic setup code that I copied from a 3-year-old blog post was incorrectly implemented. If I had been more careful about reading the documentation, it’s possible I might’ve caught this issue myself. The fix is easy:

I made that change, deployed, and crossed my fingers. Within minutes we knew that it worked. Users were connecting en masse. Thousands of monster battles started to ensue. Virtual goods transactions began flowing. All was right with the universe once more.

Let’s look at what happened:

EventMachine uses Ruby’s select by default for network I/O. This is a sane default

Ruby’s select performance degrades as the number of persistant connections goes up. It has an upper limit of 1024 connections.

I was aware of both EventMachine’s default behavior and the limitations of select.

Before the game launched I looked at the code to make sure I was making the configuration call to switch to epoll. It was there.

I did not confirm that the configuration call was working. I never inspected the running state of the app and double-checked that it was using epoll instead of select.

Ultimately, I failed to ensure that a critical scaling component was activated before going live. Since being fixed, the rest of the system has run smoothly, resulting in thousands of concurrents and millions of monster battles.

Huge thanks to Aman Gupta, Ryan Stout, and the supportive Hacker News community for helping me come to a solution quickly. With you all, I was able to take a multi-day problem and get it fixed in a matter of hours. Thanks.