Communiqués from the Green Felt developers

June 2017

As you probably noticed, for the past 2 months things have not been running smoothly. The server that Green Felt runs on is *really* slow and I think the disks are going bad. This, interestingly, didn’t cause things to just die, but rather it caused everything to just run slower than normal. At some point it crossed a threshold and couldn’t handle the amount of traffic the site gets.

We’ve been screwing around with carefully constructing the replacement server (mentioned in the last post), and last weekend Jim and I made a final push to get the main database running there. We knew it would take a number of contiguous hours of work to get it going, and that’s why it took 2 months—it’s hard to get both our schedules aligned when we’re both busy with work and non-work life. But in the end, we got it going. I was hoping that normal spinning disks would be enough, but our test run of the DB at about 3:00pm Pacific time had the exact same symptoms—constant “502” errors (which get reported as the infamous “backend server not available” error). The new server is currently living in my house, so I went off to Fry’s to buy a giant SSD to hold the database (well, technically just part of the database). After that got installed we powered it up and things started looking good again. It’s now been 2 days and things are still looking pretty good.

I like graphs so here are some pretty graphs that show the difference between good and bad. This first one is number of games played per hour:

Notice how spiky and horrible it looks on the left, compared to the right. The big gap is when we were actively working on the DB last Saturday. Usually when we do that there is a message about “maintenance mode” but for some reason we haven’t investigated yet, that failed and so none of the games got saved during that time :-(.

Here’s another graph. This one shows the amount of CPU the server uses (basically how hard it is working):

Notice that the CPU is pegged to 400% (it’s greater than 100% because there are 4 CPUs). When we stopped the database on that server the CPU pretty much instantly drops and the server becomes more responsive. Even more importantly than the CPU is the white line, which represents “load”. It basically says how many individual programs were trying to run at the same time. If it’s more than the number of CPUs then things have to wait (there’s technically more to it than that, but that’s close enough). Notice during the bad left hand side it’s up at 75 to 100! This meant that everything was super slow and barely anything was getting a chance to finish.

The last one is arguably the most important one:

This one shows web server errors. Remember the 5xx (and 502 errors in particular) are the ones that cause the “backend server not available” message. At the peak of the bad days we were returning 100 errors (the red line) per second! Ouch. Notice they dropped to zero after the Saturday work. This lets us know that things are generally working. I’m not sure what’s causing those spikes you see in the last 2 days. They look kinda bad at this scale, but when you zoom in they are just a handful of really really short spikes (less than a minute each). It’s certainly not the constant horribleness represented on left hand side of the graph.

Forum Stuff

One thing you may not know is that every forum post gets emailed to both Jim and me. We use the posts/emails to gauge the health of site and appreciate people telling us when things don’t work (less so the demands to fix things, although I do understand the frustration of trying to use something that isn’t working and is only fixable by someone else—really, I do). Even when we don’t respond we’re at least keeping up.

That said, this 2 month period was overwhelming in terms of bug reports. On Saturday I had 3000 unread emails in my Green Felt folder. 1000 of those were automated emails for various errors happening. Jim had similar inbox issues. We knew, of course, that things were not working well (the graphs show in plain detail all the errors that were happening), but we got overwhelmed and got really backed up on emails. So if you wrote something on the forum in the last couple months that wasn’t related to the server being on the fritz, you might want to send it again.

Also a special shout out to Sage, who was patiently responding to people even when we weren’t.

Donations

There’s been a lot of people offering to donate money for the hardware and upkeep. That sort of generous spirit of our users warms my heart, but I’m not sure taking donations is the right thing to do. I haven’t spoken with Jim about this yet, so this is all just my own opinion. Donations are nice, but I’d rather offer something more tangible in exchange for money, even if it’s just some sort of “premium membership” thing (I have no idea what that would entail). I know some people use Patreon for this kind of thing, but most of the people I’ve heard of that get good money from it are YouTube people with very large followings. I’m not sure we’re big enough for that.

I know I whined about the cost of the SSD in my forum post, but that was mostly just because at $800, I think it’s the single most expensive single disk I’ve ever bought. But the fact is, neither Jim nor I are strapped for cash right now, and the amount we’ve sunk into Green Felt is surprisingly little when you consider it’s been running for over 10 years.

What’s Left?

We still need to convert the rest of Green Felt over to the new server. Then we need to move it from my house into our hosting facility. This is requiring us to learn new stuff—we’re basing our new server on NixOS, which is new to both of us. That makes things go slower since we’re adding a fairly big learning curve on top of everything. Still I think it’s the right thing to do and it should give us a more reliable, more portable environment (it would be nice to run it on Amazon’s or Google’s cloud without a whole bunch of work).