Communiqués from the Green Felt developers

Uncategorized

We’ve had some bad downtime this week. On Wednesday (2018-05-30) something got wedged and games stopped being recorded. Jim and I were both busy and not paying close attention and so I didn’t notice till Thursday, when I happened to pull up the page during some downtime while my hair was being cut. I ended up remotely debugging and fixing the problem on my phone which was a pain, but worked and made me feel like I was some sort of elite hacker.

Today (2018-06-02) our SSL cert ran out for some reason and so things weren’t working until Jim fixed it.

Also today, we hit our 2,147,483,647th game played! If you don’t recognize that number, it’s the largest 32 bit signed integer (0x7fffffff if you’re into hexadecimal notation). That means that no more games can be added because the number that identifies the games can’t get any bigger. This was kind of a stupid oversight on our part and is the reason you are seeing the euphemistic “The server is undergoing maintenance” message when you finish a game. When we started this site in 2005 we didn’t think it would ever be popular enough for a number that big to come into play. Last year I even read an article about this exact thing happening to someone else and felt pretty smug that we weren’t that dumb. Oops.

The annoying thing is that all those games take up a lot of space. We have our database on 2 disks (volumes, technically)—one is a 2TB SSD (the main high score tables are there) and is 77% full. The other is a 4TB hard disk and is 95% full. The only way to permanently fix the issue is to re-write the data out which means we need double the amount of space we currently have to fix it. That means buying more disks, which will take a couple days (I don’t think there’s any place locally I can buy them so we’ll have to mail order from Amazon).

The long and short of it is that currently high scores are down and it’ll take us a few days to get back up and running again. The message is true though, the scores are being written out to a different disk and when the db is alive again we’ll import all those scores. If you read that article I mentioned, you might have noticed they had a quick fix to delay the inevitable. We’ll might do that tonight and get things kind of working. But we’re going to have to do a permanent fix soon-ish and so you’ll probably be seeing more of the “undergoing maintenance” message in the next week.

Update (2018-06-03):

Our crusty old server decided to die in the middle of the night. Because it’s housed in a satellite office of our hosting provider, there was no one on staff to reboot it. I had anticipated this and made a backup of the machine about a week ago. Jim and I spent Sunday morning getting the backup restored onto the shiny new server (currently hosted at my house). That is currently what the main site is running on. The blog and the forum are still running on the old server. We’ll be moving those to the new server when the new disks arrive (Amazon says Wednesday).

Update (2018-06-05):

We’ve copied the database to another computer with enough space and updated the DB to the latest version (Postgresql 10 if you are curious). We’re currently converting the id column that ran out of room into a representation that can hold bigger numbers (ALTER TABLE for you SQL nerds). This is unfortunately a slow process due to the size of the data. We started it going last night and it looks like it’s maybe 30% done. During this time the DB is completely offline and that is causing the server code to…not be happy. It can’t authenticate users (because users are stored in the DB, too) and so it’s not even saving games. Sorry about that. When the conversion is complete we’ll bring the DB back online and scores should be immediately be working.

Update (2018-06-08):

The disks all arrived and are all installed. We’re still converting the id columns. It’s very slow because the database has to more or less rebuild itself completely. Also we had a bug in the conversion script and once it got halfway done (after about 20 hours) it died and reset all the progress. :-/ That’s fixed though so this time it should work (fingers crossed).

Update (2018-06-10):

The database id column conversion finally finished. Took 79 hours to run. I’ve pointed the site to the server where the database is temporarily housed and will begin copying the data back to where it is supposed to be. This temporary database is not stored on SSDs so things might be slow. We’re not sure the disks can handle the full Green Felt load. It still might be a few days before everything is smooth.

Update (2018-06-28):

The database is back where it belongs, on the new SSD. Unfortunately there was a bug in my script that copied the DB over and I didn’t notice until I had switched over to it. I probably should have switched back to the other DB and recopied, but I didn’t realize the extent of the problem and how long it would take to fix it (I thought it’d be quicker this way—I was wrong). I’ve been rebuilding indexes on the DB for the past week and a half. Luckily the indexes can rebuild in the background without causing outages, so high scores could continue during the process. Unfortunately the one I saved for last does require that the database go down while it’s being rebuilt. I started it on 2018-06-17 at around midnight and expected it to be done when I woke up in the morning—it was not. It’s still going, in fact, which is why you haven’t been able to save games, fetch high scores, or login. I’m not sure how long this one is going to take—I would have expected it to be done already. Until it finishes, things are going to be down.

In other news, the old server (that the blog and the forum are hosted on) went down yet again for a few hours, prompting me to move those programs and their data over to the new server. This move appears to have worked. If you are reading this, it’s on the new server :-).

Update (2018-07-03):

A few days ago we gave up fixing the indexes (we kept finding more and more things wrong with the DB copy) and decided to re-copy the DB. This finished today, and we (again) switched over to the DB on the SSD. The copy was good this time, and things seem to be running well. Check out this graph:

The red line is the number of 502 errors (the “server is undergoing maintenance” errors that we all love). The vertical green dotted line is when the SSD DB came online. After that there have been no 502s (it’s been about 3 hours so far).

When we decided to re-copy the DB, we saved all the new games that were only in the flawed DB and have been re-importing them into the good DB. This is about a 1/4 of the way done but has dramatically speeded up since we got the DB back on the SSDs. Once the import finishes, we will be done with the DB maintenance! Well, until the next disaster strikes! 🙂

Update (2018-07-05):

Everything is going smoothly. We’ve imported all the games that were saved out-of-band (if you ever saw the “Your game was saved but our database is currently undergoing routine maintenance” message). As far as we know, everything is back online (and hopefully a little better than before this whole mess). Have fun!

I was answering a forum post when I noticed a bug in the way Hopeless calculated its scores. It counts up the number of blocks that get removed but it accidentally wasn’t counting the one you clicked on, so it was always 1 lower than it should have been.

The bug is fixed now, but since it changes the way scoring works, scores going forward will be a little higher than they used to be. Unfortunately it’s not possible to fix up the old scores. I don’t think this is as big a deal for Hopeless as it is for the solitaire games.

I also added a change that so that going forward we have the option of rescoring games. Which is good because I’m still not exactly content with Hopeless scoring—I think it’s biased toward games of 3 colors.

As you probably noticed, for the past 2 months things have not been running smoothly. The server that Green Felt runs on is *really* slow and I think the disks are going bad. This, interestingly, didn’t cause things to just die, but rather it caused everything to just run slower than normal. At some point it crossed a threshold and couldn’t handle the amount of traffic the site gets.

We’ve been screwing around with carefully constructing the replacement server (mentioned in the last post), and last weekend Jim and I made a final push to get the main database running there. We knew it would take a number of contiguous hours of work to get it going, and that’s why it took 2 months—it’s hard to get both our schedules aligned when we’re both busy with work and non-work life. But in the end, we got it going. I was hoping that normal spinning disks would be enough, but our test run of the DB at about 3:00pm Pacific time had the exact same symptoms—constant “502” errors (which get reported as the infamous “backend server not available” error). The new server is currently living in my house, so I went off to Fry’s to buy a giant SSD to hold the database (well, technically just part of the database). After that got installed we powered it up and things started looking good again. It’s now been 2 days and things are still looking pretty good.

I like graphs so here are some pretty graphs that show the difference between good and bad. This first one is number of games played per hour:

Notice how spiky and horrible it looks on the left, compared to the right. The big gap is when we were actively working on the DB last Saturday. Usually when we do that there is a message about “maintenance mode” but for some reason we haven’t investigated yet, that failed and so none of the games got saved during that time :-(.

Here’s another graph. This one shows the amount of CPU the server uses (basically how hard it is working):

Notice that the CPU is pegged to 400% (it’s greater than 100% because there are 4 CPUs). When we stopped the database on that server the CPU pretty much instantly drops and the server becomes more responsive. Even more importantly than the CPU is the white line, which represents “load”. It basically says how many individual programs were trying to run at the same time. If it’s more than the number of CPUs then things have to wait (there’s technically more to it than that, but that’s close enough). Notice during the bad left hand side it’s up at 75 to 100! This meant that everything was super slow and barely anything was getting a chance to finish.

The last one is arguably the most important one:

This one shows web server errors. Remember the 5xx (and 502 errors in particular) are the ones that cause the “backend server not available” message. At the peak of the bad days we were returning 100 errors (the red line) per second! Ouch. Notice they dropped to zero after the Saturday work. This lets us know that things are generally working. I’m not sure what’s causing those spikes you see in the last 2 days. They look kinda bad at this scale, but when you zoom in they are just a handful of really really short spikes (less than a minute each). It’s certainly not the constant horribleness represented on left hand side of the graph.

Forum Stuff

One thing you may not know is that every forum post gets emailed to both Jim and me. We use the posts/emails to gauge the health of site and appreciate people telling us when things don’t work (less so the demands to fix things, although I do understand the frustration of trying to use something that isn’t working and is only fixable by someone else—really, I do). Even when we don’t respond we’re at least keeping up.

That said, this 2 month period was overwhelming in terms of bug reports. On Saturday I had 3000 unread emails in my Green Felt folder. 1000 of those were automated emails for various errors happening. Jim had similar inbox issues. We knew, of course, that things were not working well (the graphs show in plain detail all the errors that were happening), but we got overwhelmed and got really backed up on emails. So if you wrote something on the forum in the last couple months that wasn’t related to the server being on the fritz, you might want to send it again.

Also a special shout out to Sage, who was patiently responding to people even when we weren’t.

Donations

There’s been a lot of people offering to donate money for the hardware and upkeep. That sort of generous spirit of our users warms my heart, but I’m not sure taking donations is the right thing to do. I haven’t spoken with Jim about this yet, so this is all just my own opinion. Donations are nice, but I’d rather offer something more tangible in exchange for money, even if it’s just some sort of “premium membership” thing (I have no idea what that would entail). I know some people use Patreon for this kind of thing, but most of the people I’ve heard of that get good money from it are YouTube people with very large followings. I’m not sure we’re big enough for that.

I know I whined about the cost of the SSD in my forum post, but that was mostly just because at $800, I think it’s the single most expensive single disk I’ve ever bought. But the fact is, neither Jim nor I are strapped for cash right now, and the amount we’ve sunk into Green Felt is surprisingly little when you consider it’s been running for over 10 years.

What’s Left?

We still need to convert the rest of Green Felt over to the new server. Then we need to move it from my house into our hosting facility. This is requiring us to learn new stuff—we’re basing our new server on NixOS, which is new to both of us. That makes things go slower since we’re adding a fairly big learning curve on top of everything. Still I think it’s the right thing to do and it should give us a more reliable, more portable environment (it would be nice to run it on Amazon’s or Google’s cloud without a whole bunch of work).

Our aging server stopped working last night at about 7pm. We couldn’t fix it remotely so we had to ask someone in the datacenter where the server lives to hard reboot the machine for us. Because of some sort of logistical issues (no one at the datacenter at that time), it took about 4 hours to get the machine rebooted. We don’t have any redundancy and so during that time we were just off the air.

Oddly, since the server has been rebooted the disks are acting differently (in a good way) and we I haven’t seen any of the “502 gateway timeout” errors that have been plaguing us for the last few weeks. We’ve been working on that (rather unsuccessfully so far), so it’s a little disappointing to see it magically fix itself. To investigate the 502s we gather a bunch of metrics use them to build neat graphs so that we can see what’s going on:

This graph shows both the outage and today’s lack of “502 Gateway timeout” errors—the outage is the huge 4 hour gap between the 2 vertical dashed green lines and the red line shows the 502 errors. Notice today it’s nice and flat (ahhhhh), while yesterday there were a bunch of ugly spikes during peak hours (US/Pacific time).

New Server

On the plus side, we have obtained a fancy new server, it’s got roughly 6 times the number of processor cores (with each core being twice the speed), and more than 30 times the memory. We’re (slowly) getting it ready. The old server was built in such a way that it was hard to move the programs around and keep everything working. We’re taking the more modern approach with the new one, but it means a lot of thought and planning up front so that everything is smooth (and possibly redundant) in the future. Jim and I are both pretty busy with our day jobs right now, so all this is happening in our spare time.

We had 2 disk failures early Tuesday morning (2016-03-29 05:14:49 PST). Jim and I spent today at the data center where the server lives, fighting with the BIOS and getting 3 replacement drives installed (1 disk had died a few months back).

The drives are back and the servers are serving, except for the main database which wasn’t redundant on the server. We have backups of it off site, but it will take some hours to get it copied back.

The database is where we store the scores for the high score tables and leader board, as well as user login information. This means those features are not going to work until the database is restored. In the meantime, you should be at least be able to play games anonymously.

Jim and I were impressed by how many people tried to reach us when they couldn’t get the site to work, and at the lengths some of you went to: one of you figured out my cell phone number and called me (I was asleep and didn’t hear my phone ring), another one of you figured out who our hosting provider was and emailed their support team (leading to a funny conversation when we went into the data center today). Don’t take that wrong, we aren’t annoyed—we’re heartened that so many of you like the site so much, and we’re sorry for the downtime.

Update 2016-03-31 09:25 PM PST: The site doesn’t work well without the database so we’re temporarily pointing to our offsite backup database. This is just hosted on my cable modem at my house so I might not have enough bandwidth to support the full load of green felt. This is just temporary until the backup is copied back to the main server. Hopefully it’ll hold up well enough until then ;-).

Update 2016-04-05 01:25 AM PST: The backup has been restored and we switched over tonight at midnight. Everything seems to be OK.

Last night our hosting provider was having work done in the facility where our server is hosted and the power got turned off. We run a very minimal setup here at Green Felt, which means we don’t have backup servers for situations like these. Hence the downtime last night and this morning.

While things were down we decided to take the opportunity to upgrade one of our offsite database backups. Now that the machine is back up we’ll be copying that upgraded data back to the server. During this time the site will be up and games will be playable, but high scores and the leader board will not work. Also logging in and creating users will not work, though if you are currently logged in you should stay logged in. Games will still be saved to our offline queue, and when the DB is back up and running we will write all the games in the offline queue back to the database.

We’re not sure how long the DB copy will take, but lets assume it will be a couple hours.

Edit [1:52 PM]: We’re having trouble with the upgrade so we’re going to bring the DB back up and try again sometime in the future. Everything should be running now.

We just pushed a fix to Pyramid. The problem was caused by a recent change we made, but the bug only showed up if you had your card size set to something other than “Automatic”. Jim and I always keep it on “Automatic” and we didn’t think to try changing the card size, so we couldn’t find the problem.

Luckily user “jlavik” let us share her computer so we could poke around and diagnose the issue. Once we figured out it was due to the card size the fix was pretty easy.

So, thanks to Judith, and to everyone else who reported the problem. We rely on bug reports to know when we broke things—we try to test everything we do, but things inevitably fall through the cracks. Luckily we have an awesome community that helps us out when things are broken.

We’ve added a new high score option today. It’s the checkbox at the bottom of the high score table that reads “Show a player’s first score, not their best.” The high score table shows only one score for other players (though it always shows all of your own scores) and this option allows you to set which score of theirs you want to see—the best or the first.

Before we added this option, the high score tables showed the first game for other players. Along with adding the option, we’ve changed it so that is no longer the default.

Why did we do this?

Because of the way the high score tables used to work, certain people were “gaming” the system—they’d play a game over and over anonymously (or with different alternate user accounts) and then log into their main user account and play it one last time to get a ridiculously fast run time. Jim and I talked about different ways to combat this but in the end we decided to cater to that style of play instead of trying to ban it outright. With the new default high score settings, there’s no need to log out, play, log in, play again. Just play a game repeatedly while logged in and your score will move up the ranks.

We’ve left the option to show the first score there for people who just want to see how well they can do against a level playing field.

We’re upgrading our database (PostgreSQL 9.2 to 9.3, if you’re interested). Scores will be kept but the high score tables will not be able to be viewed and no one will be able to log in until it is finished.

I’m trying to stay up (by watching a movie) until it is finished upgrading so I can start everything back up as quickly as possible, but I’m fading fast and may fall asleep before it’s done. If that happens it’ll be down until I wake up and finish the (manual) process.

Update: Jim and I finished the upgrade this morning at about 10. Everything should be working at this point. The games that happened during the DB upgrade are still being loaded into the new DB, so things might be a little slow for a couple hours.