Internet failures that have forced the Companies Office and 14 other Government websites offline for four days have been described as an international embarrassment.

Other websites caught up in the outage include the Personal Property Securities Register, Intellectual Property Office and Ministry of Consumer Affairs sites. …

The MED blamed the outage on upgrade work to the servers hosting the sites.

Unscheduled outages should be measured in minutes, not days.

The companies office site especially is a high demand site.

I have some experience with such a site. I was a director from 2002 to 2010 of the NZ Domain Name Registry Ltd, which is known as .nz Registry Services (NZRS). They operate the registry for .nz domain names.

The service level agreement with the Domain Name Commission Ltd (which I now serve on) specifies that the registry must be available 99.9% of the time (excluding scheduled outages notified in advance) on a monthly basis. This means that any unscheduled outages must last no longer than 43 minutes over a month, or the company would be in breach of the SLA.

To help achieve that, a lot of redundancy is built into the system. In fact there are parallel systems in Wellington and Auckland, so if one city is unavailable, the other system can kick in.

If NZRS had an unscheduled outage of four days, I imagine there would have been resignations from both the board and senior management, unless it was for the most exceptional and unavoidable reason. I certainly would have offered my resignation as a Director to the shareholder (InternetNZ).

The companies office website is excellent. I use it often, and it is one of the reasons we score highly on ease of business surveys. You can establish a company in under an hour, all online. But the more vital a service becomes, the more important it is that you ensure it remains up.

MED should at a minimum commission an independent report into what went wrong, and what they need to do to prevent such an outage in future.

Share this:

Related posts:

This entry was posted on Wednesday, September 28th, 2011 at 11:03 am and is filed under Internet.
You can follow any responses to this entry through the RSS 2.0 feed.
Both comments and pings are currently closed.

21 Responses to “Down for 4 days is a disgrace”

Many of the Ministry of Health’s systems were out of service (or effectively out of service) for several weeks in early 2009 after they were comprehensively trashed by a worm infection. They hadn’t patched anything in years. A four day outage looks good by comparison.

Not unusual davidp, I know of a couple of banks here in NZ who have a conficker problem they still havent managed to resolve – ring fenced is about as good as they can get it. Hell, a DHB I consulted at had mission critical apps running on servers 7 years old that had never ever been taken down – and no one on staff who knew the Db it was perched on.

I’m sorry, but this is hilarious. We have a Government which has been cutting the public service and “reprioritising” spending from back office to front line delivery. Then we have a complete meltdown of back office functions. Quelle surprise. Who could possibly have seen that one coming?

I wonder which poor bloody public servant will be left to carry the can for this? I can guarantee you won’t find a minister who will accept any responsibility.

[DPF: The companies office website operates on user-pays. It is a commercial service. But nice try at blaming National for this. You are getting desperate]

Yeah right. The problem is budget cuts Nick. Not just rank incompetence. I’ll be waiting for the report to see what the real answer is, I very much doubt that the government reduced back office budgets, and a reasonable response was “hey, let’s get rid of all the redundancy on our servers.” And if it was, then someone should be fired for that.

This is purely the responsibility of the staff and management of the servers – both ministry and outsourced.

I cannot believe that:
* There was no backout plan
* They didn’t upgrade one of the systems (Akl or Wgtn) first, then the other
* There was no test environment in which they did the upgrade first to flush out these problems

This is interesting, in light of a big Notice of Intent released about a week ago…

The Department of Internal Affairs (DIA) and a number of other agencies are collaborating on common approaches to web services, or Common Web Services (CWS).

DIA is the lead agency.

The goal of CWS is to reduce duplication of effort and streamline procurement by allowing agencies to cluster around a small number of web publishing platforms or content management systems, possibly accompanied by panel procurement arrangements.

This Notice to Prospective Suppliers (Notice) seeks information from prospective suppliers to help the CWS Project to:

• create options for government CWS solutions;
• understand market capabilities;
• provide inputs that will help identify whether or not there is a viable business case for government CWS solutions; and
• inform the construction of a business case, if one is viable.

The documentation alludes to centralisation, where appropriate, of all Govt departments.

Here’s another guarantee. Irrespective what the report says, people will continue to believe whatever they want to believe. So you’ll continue to think the evil govt caused it by asking for efficiency. I’ll continue to believe that incompetence caused it.

I’ve worked in government, I have a fair idea what goes on in their back office. It is full of bloat, but in general they’re useless at dealing with it – the reality is that you could fire a good 30% of the people in the back office and never notice they were gone. But the actual process govt uses when they seek cuts doesn’t involve getting rid of the dead wood. It generally involves offering voluntary redundancies (otherwise knowing as paying your best staff to go away), or putting on a hiring freeze and waiting for natural attrition (otherwise known as preventing any new blood coming into your organisation). It may be true that the pressure on budgets therefore impacted this – but not because it had to be that way.

The day to day management of the vast majority of these servers is outsourced nowadays, especially by government entities as they struggle to retain skilled IT staff as a generalisation plus the commercial entities to whom they outsource specialise in this.

I would be looking to see if that is the case here. If not, that may well be the problem. If so, the outsourced company would be finding its penalty clauses kicking in.

I work on large corporate IT systems — lets just say, if I were responsible for their systems going down for 4 days I’d be sacked.

I bet the problem relates to short cuts on testing. Ultimately it is a budgetary thing — I used to have a project manager who’d ask team leaders for an estimate for a change, then pretty much cut them in half. Ultimately, the final product would be full of bugs precisely because there is insufficient time to properly develop the software. And, guess who got the blame for this. The poor programmers working on unrealistic timeframes.

Because the system (think lights) are out. This is caused by some sort of failure – in the case of a light this might be a blown bulb.

Anyho…

I recall talking to an IT guy about an outage in 1999 with the Otago uni student login system which lasted days. It was going, but it took about 10 minutes to do anything that would usually take seconds. Problem is with a situation like that is, the issue may be quite simple to resolve, but those small tasks to fix it can add up to hours. You can’t necessarily restore from backup, because the backup copy may have 99% of the conditions that cause the issue.

Well in the Engineering trade at the dirty end (hammer and tongs) it is called a Fuckup.
There is no other name, it is just a Fuckup. The person who fucked up is the person responsible, nobody else.
So who fucked up? It is Taxpayers money being wasted on this fuckup so own up