It appears that the problems on the site were due to someone (read: me) messing up when they restarted the database-replication!

Thanks to a bunch of helpful problem-reports from a number of users, I had some good data to look at to figure out what was wrong. It was actually pretty easy to figure out once I had all of those problem-pages to look at (I’m talking about you Brian May!).

Bonus pretzel

While I was waiting for the computers to move some massive files around, I had a couple of minutes here and there to make other tweaks to the site. Two somewhat interesting things that came out of this time are that the 1) “job queue” is getting automatically run every hour now (which keeps things up to date and avoids assigning extra jobs to random users who would get like a 2 minute page-load randomly every 10,000 pages) and 2) the road-block page that shows up when the site is shut-down for maintenance now has an iframe in it which shows a google-search of LyricWiki.org for the same page and suggests that users click the “Cached” link. This will allow people to see a somewhat-recent of most of the pages even while the site is down. I dig it.

The site has been behaving strangely lately and I’m not sure how I managed to break it or what is wrong, but I’m looking into it actively. I’m going to be working on any intensive-changes starting tomorrow (saturday) morning since weekends tend to have less traffic than weekdays.

People have been forwarding me a good bit of info about the problems, and it has been very helpful. If you have seen anything strange (especially if you have noticed a pattern in it), please pass the information along to me!

Hopefully the site will be all patched up by the end of tomorrow. Check the blog for updates as things are happening (brief outages are very likely tomorrow).

I really apologize for the oddities on the site. They are almost certainly my fault (and even if they weren’t… they would be since it’s my responsibility to keep things humming along).

So, in the wake of last week’s problems on LyricWiki, the system was finally purring again at full-steam when this morning at about 1am, our data center had its first-ever full power-outage. This is the kind of thing that isn’t supposed to happen since good datacenters (like LyricWiki’s) have power from multiple sources and generators that kick on if all of those fail. But once in a while a human error will mess the whole thing up and briefly cut the power. The power was only out for about 1 minute which basically just causes the computers to reboot.

This normally would have been fine… the servers are configured in a way that they are all supposed to jump right in where they left off. Once all of them are up, the site should automatically work again. “Should”. For some reason, the master-slave replication is broken again. The slave is trying to read a position in the log-file that doesn’t exist. So yet again, I think we’re going to have to start the replication all over. This means about an hour or more of downtime.

For now, I just disabled the slave (to prevent weird behavior) so the site is going to be working with a lot less resources than it normally is. I’m planning to do the outage at night when the traffic is lower. The site will potentially be slower than normal today since it doesn’t have the slave-server helping out.

All-in-all it could have been a <strong>lot</strong> worse since the downtime so far has only been a minute or so, but please be aware that there will be potentailly an hour or two of downtime some time tonight/tomorrow-morning.