Running queries on the HMRC database fiasco

Comment When it comes to talking about last week's data loss by the HMRC, I was told not to use precious words outlining my feelings of rage and bafflement that a government body can be so cavalier with so much data because, presumably, we all feel the same.

So I will simply note, for the record, that my gob has been totally smacked by this debacle. What I will do is to take a look at the technical elements of this case from the database/data perspective.

First, what was the data format?

Data transfer between systems is typically effected using a simple data format such as CSV or XML, especially if the target and source databases are hosted on different engines (XML seems less likely in this case since that would imply a department that had made it into the 21st century). It is also possible that a format such as an Access .MDB or an .XLS file was used and the data batched over several files. The bottom line is that it is unlikely to be the raw tables from an IMS database.

Of course, "they" won't tell us and, in fairness, they shouldn't. The disks are still missing and it would simply compound the disaster to supply information that would help any black hats that stumble across the data.

The bottom line is that until we are given evidence to the contrary, we can assume a fairly simple format.

Next, the level of encryption.

It is not clear how well the data was protected. Rather worryingly, the term being bandied around is "password protected" rather than "encrypted". Of course, the very fact that we are in the middle of this shambles tells us we are dealing with technically incompetent people, so they may simply not be able to distinguish between the terms. We are almost certainly not talking about RSA encryption here. However, as we are all aware, it is often the human element that torpedoes a technically secure system and the anecdotal information coming out suggests this is an area of considerable concern.

Shawn Williams, of Rose, Williams and Partners, a legal firm in Wolverhampton that deals with tax fraud cases, said his firm frequently received discs that contained personal data from the HMRC with the password included. 'Sometimes there is no security at all, sometimes there are instructions telling you how to access the data, sometimes the password is just written on a compliments slip and included with the disc'.

Why wasn't the sensitive, non-required, data removed?

According to the Telegraph, the National Audit Office (NAO) asked for the names, National Insurance numbers, and child benefit numbers of every child so that it could select 100 cases at random for its annual audit of Revenue and Customs. The NAO asked for bank and other details to be removed, but an HMRC official replied that, to keep costs down, the HMRC could only provide all of the details on the database.

Now, let's go back to Database 101 for a moment (not a theoretical Database 101; I actually designed and teach on the database course at Dundee University). The first example in the first lecture on querying shows the students how to subset the data by column and then by row. Database engines are built, from the ground up, to perform sub-setting. It doesn't get any easier than this. So, where's the problem?

But, let's assume the worst: that the HMRC uses a very complex, unwieldy engine that cannot subset data easily. Well, no matter how complex the original database was, it is reasonable to assume that it was reduced to a simple format for the CDs. In which case, importing it into an engine that can subset easily is trivial.

So, unless there are some very odd circumstances to which we are not privy, I find it impossible to believe that removing the bank and other details would have involved significant cost.

Go on, Mark, stick your neck out. How much?

Well, the Telegraph has an estimate:

However The Telegraph has established that a typical clean-up operation would cost around £5,000 and take a software engineer less than a week. A spokesman for HMRC said that the £5,000 cost of removing the information "was not a figure we recognise" and declined to discuss the cost because the matter is the subject of a review.

I don't recognise the figure either, but that's because I think the Telegraph is being far, far too generous to the HMRC. Assuming 25 million CSV records, I would estimate half a day's work to subset by column. If I was familiar with the data structure and had done the job before, maybe an hour. Any competent DBA/DBA could do it in the same time. Now DBAs are expensive, but not £10,000 per day. I'd do it for £500.

A spokesperson for the HMRC said: "We don't have infinite resources, we have to use our resources rationally."

One has to wonder about the definition of the word "rationally" here.

Finally, the ID card thing

Chancellor Alistair Darling has been quoted as saying that the disaster actually strengthened arguments in favour of ID cards. OK, I’m game, let's think this through. One of the advantages of ID cards is that they make it easier to tie data from several disparate systems together. For example, it appears still to be possible for a person in this country to be drawing benefit while working. An ID card, complete with unique identifier, should make these anomalies much easier to spot.

So, on the surface, this argument sounds reasonable. The data was being shipped so that it could be cross-correlated between two systems. ID cards make cross-correlation easier. Therefore, this is an argument for ID cards. However, there is a glorious technical flaw in this argument.

ID cards may help to identify people more accurately but they don't, in any shape or form, help with the movement of the data. ID cards would have made no difference whatsoever to the fact that the data has to be moved. Indeed, had the ID data been included it would, presumably, have been one more piece of data to delight the bad guys.

I'm used to the fact that politicians don't understand technology, but it frightens me that, in the midst of one crisis, they can still find the time to misuse it to promote another unrelated political agenda.

Finally, of course, the fact that our government has demonstrated a complete lack of ability to protect our data is, for me, a strong argument against ID cards. But then, I'm not a politician.

Summary

Once again, it is fair to say that we aren't being told the technical details and neither should we be. However, these security concerns also provide a convenient smoke screen from which can emerge bland assurances like: "It would have been very expensive to do this", "this strengthens the arguments for ID cards". While we cannot directly gainsay these, if we make reasonable assumptions, it is clear that many of them are nonsense. ®