And so ends the second age of WoL, and gone are the days of sudden laggyness or even temporary unavailability. The downtimes are short, only seconds to up to a minute at a time, but still, it didn't escape the all seeing eye of Nagios.

It has been a long, hard road, but after adding a whole lot more monitoring options and trying hard to DoS the backend to trigger the bug, I've found the cause of all evil. That area has been isolated quickly, the workaround is in place, but the evil is still in there, watching, always trying to get out. But for now, this will do.

My part of this tale is over now; it's up to the brave engineers of Sun to fix the problem in the jvm for good.

~~

Finally, we're back in business for more features and shinies. Let the coding begin!

Woohoo, gone through the bug reports forum and read them all now. Most are added to the issue tracker, trivial / critical ones fixed.

The evil that was surfaced its evil head briefly today, when I tried to reindex all reports with a few threads; adjusted some parameters and I guess I should be more patient and wait for the weekend. Oh well.

Grr. Remove one bottleneck and another will spawn pretty much directly. Hit a peak of 1700 data heavy / report requests per minute, but there was a ~30 seconds queue at the Apache level. Previously, things stopped working at ~1400.

This kinda screws up my planning to finish redoing the rankings stuff. I'll be setting up another machine for production duty tomorrow and load balance between them. And then I can watch backend die horribly. Lovely

It's time to see if the budget allows for more rackable hardware, this is like fighting an hydra with an wooden sword.

Yeah, that was in the plan. But fixing the data problem first has priority.

Anyway, we were going up and down in the past 15 minute, tweaking configs and shifting load between machines. The good news is that the frontend capacity is roughly doubled. The bad news is that backend will probably be in trouble tonight. There's only so much you can do with tuning, but the other solution, get one or two more R610 / DL360 isn't within our budget for now

I'll go tweak the queue, limiting loading to ~300 reports a minute. If you get a 503 try again later screen, try again later

The number of concurrent sessions is over the hard limit again, set by the available memory. We're adding 16G extra ram when it arrives, will update later with maintenance window / expected downtime. This should lower the amount of "Server too busy, try later again" messages you get.

The next step up will be way more expensive tho... Even with 16 DIMM slots, the time has come that we're out of room for expansion in that server. Swapping out lower capacity DIMMs for 8G, dual rank ones is insanely expensive at ~500e each.

If you like our work and want to support us in keeping the quality of service high, consider getting an subscription. Those give us a bit more room in the hardware budget department