To Excel or not to excel? Tips for avoiding data quality errors

One of the most common sources of data, in almost any environment, is Microsoft Excel.

Yet, through the use of Excel, many organisations inadvertently increase the risk of poor quality data.

Calculation Errors

Microsoft Excel is a business tool of choice. In many cases, complex models such as Commission structures or Risk models may have been developed by a business user using Excel. These business users commonly have no programming training, and the resulting spread sheets are typically not rigorously tested. In fact, very few are even checked by a second person.

The European drive for austerity in southern Europe is driven by a conclusion, attributed to an incorrect Excel formula, that there is a link between high government spending and low economic growth. Peer review shows that this conclusion was flawed but austerity, and its consequences, is here to stay. Errors in spread sheets can have devastating consequences

Even if initial calculations are correct, Excel applications often rely on manual “copy and paste” operations (or allow other changes to data). The “London Whale” trading debacle cost JPMorgan in excess of $6.2 billion. Internal analysis linked the problem to spread sheet errors.

In these examples, the line between programming and data quality errors is blurred. Poor quality data is not just caused by errors in capture – it may be system generated.

Auto format errors.

Excel is a default target for data extractions from a number of commercial applications.

If, for example, your ERP provides a default Excel report this may be used to provide an extract for data assessment, rather than having to wait for a more robust IT solution.

By default, Excel changes the format of long numeric values.

For example, Excel converts both “4901081045611230” and “4901081045611440” to “4.90108E+15”.

In this example, we have gone from two unique credit card numbers to a single duplicated number.

This problem can also creep into delimited text files (such as csv files), if Excel is used to open them (say to view them) and they are then saved.

In most cases, this problem will only affect downstream applications, but if the Excel data is used to update the source changes can become permanent.

The problem is avoided if numeric data is imported into Excel as text. However, as this is not default behaviour and most users will struggle to get this right. It is reasonable to assume that most Excel extracts will contain this kind of problem

So should we stop using Excel?

Excel will remain a common data source in spite of these issues.

What is important is that we use our awareness of the data quality risks inherent in Excel to avoid exacerbating the problem.

Have more than one person check Excel calculations.

Do month on month calculations give similar results (big changes may indicate and error)?

Where possible, avoid Excel use for data extracts.

Where exponential numbers show up in data assume this is an Excel change and compare to the original source.

Avoid overwriting the original source data with data that has been through Excel.