Tuesday, June 20, 2006

After the new database went live on the 14th it quickly settled down happily.Batch updates on the night of the 14th went out OK.

The daily update for the 14th (starting at 1:20 am on the 15th) started fine.

This run failed, and despite several fixes being deployed was ultimately not sent. Apologies to all.

Friday morning’s run (for posts written on the 15th) went out successfully, although very late.

Saturday’s run was delayed as the scheduler didn’t start it while the prior run was under way. It did complete successfully, however, although again late.

Sunday morning’s run (for posts written on the 17h) went out correctly and on time.

Fetch me Murphy’s Law! Monday’s run hit a completely unrelated performance flaw that caused it to run extremely slowly. FeedBlitz’s safety protocols terminated the run automatically after 23 hours.

This morning’s run (for posts written on the 19th) started on time and successfully completed.

Decision day was last Friday, because I wasn’t prepared to allow two consecutive failures. If we hadn’t got the Friday morning run out successfully that day we were going to roll it all back and start over. The fixes worked, however, and the fact that we have had no database issues since – and performance is better - vindicates that decision.

So the bottom line is, simply, unacceptable performance. Ugh. For what it’s worth, the good news is that no database-related issues have surfaced since Friday. Not that that matters to you, of course. Crappy performance is still crappy performance, and not what we’re all used to from FeedBlitz.

So, what’s the plan? What did we learn?

Well, although the upgrade performed great in testing and even better than expected in early production on Wednesday, our testing clearly did not adequately discover the issues we’d face in production. Based on the testing we had done and the system’s performance on Wednesday up to and including the start of the overnight run we had no basis to expect anything as bad as actually happened.

Lesson learned, changes under way.

For future changes of this magnitude we must - and will - do better. Although this has been an ugly few days the net result is a more efficient, more scalable, more flexible FeedBlitz that is now positioned for the next stage of its growth. On which more later, but for now we need to get your confidence back. I’m confident in the changes we’ve made to improve reliability, but won’t declare victory until at least three consecutive overnight runs complete successfully. Any failure will reset the count, regardless of cause.

Meanwhile, a couple of upgrade related fixed to tell you about:

A defect in the dashboard that caused the grand totals to be incorrect has been fixed.

A defect that caused some feeds to be added without their titles and descriptions has been corrected.