Data Governance Begins at the Spreadsheet

Data management professionals have long and sometimes rather Quixotically driven organizations to “get past the spreadsheet culture.” Maybe that’s misguided. The recent furor over a widely read social science paper may show how we can look to scientific peer review for a way to govern data, spreadsheets and all.

Recently, it was found that a key study underpinning debt-reduction as a driver of economic growth based its conclusions on a flawed spreadsheet. As this ArsTechnica article describes, Carmen Reinhart and Kenneth Rogoff’s Growth in a Time of Debt seemingly proved a connection between “high levels of debt and negative average economic growth”. But, per a recent study by Thomas Herndon, Michael Ash, and Robert Pollin, it turns out that the study’s conclusions drew from a Microsoft Excel formula mistake, questionable data exclusions, and non-standard weightings of base data. The ArsTechnica piece finds those conclusions fade to a more ambiguous outcome with errors and apparent biases corrected.

Data management professionals have long and sometimes rather Quixotically driven organizations to “get past the spreadsheet culture.” Maybe that’s misguided. The recent furor over a widely read social science paper may show how we can look to scientific peer review for a way to govern data, spreadsheets and all.

Recently, it was found that a key study underpinning debt-reduction as a driver of economic growth based its conclusions on a flawed spreadsheet. As this ArsTechnica article describes, Carmen Reinhart and Kenneth Rogoff’s Growth in a Time of Debt seemingly proved a connection between “high levels of debt and negative average economic growth”. But, per a recent study by Thomas Herndon, Michael Ash, and Robert Pollin, it turns out that the study’s conclusions drew from a Microsoft Excel formula mistake, questionable data exclusions, and non-standard weightings of base data. The ArsTechnica piece finds those conclusions fade to a more ambiguous outcome with errors and apparent biases corrected.

I’m not trying to make a political point here, but clearly this mistake must be distressing to politicians who cited the study as a basis for economic policy proposals. It is equally a cautionary tale to those in business whose complex spreadsheets drive their analyses, plans, and decisions.

As Jim King astutely points out, the spreadsheet is today’s de facto business analysis tool of choice due to its “low technical requirement, intuitive and flexible calculation capability, and business-expert-oriented easy solution to 80% of BI problems”. In my experience, business people view advanced BI and data visualization projects as data delivery platforms for Excel. “Can I download that report into Excel” might be the most-asked question in BI presentations to end-users. How can organizations address the risk that they might base big decisions on invalid spreadsheets?

“NBER working papers are circulated for discussion and comment purposes. They have not been peer reviewed or been subject to the review by the NBER Board of Directors that accompanies official NBER publications.”

Scientific papers published in reputable journals endure rigorous peer review, in which editors distribute the submitted draft to peer scientists for evaluation and comment. Reviews are detailed and comments can be harsh, sometimes calling for the authors to put in months of rework before resubmission. Growth in a Time of Debt hadn’t been through that process, and shouldn’t have been relied upon yet as a guide for policy makers.

The moral for business is this: make peer review a part of key spreadsheet analyses. Every important spreadsheet should undergo review by three or four peer analysts, and be corrected according to the results of their review before use as a basis for decision-making. Here is a list adapted from one description of peer review (see page 8, here) that spreadsheet reviewers might use:

Are the question or questions answered by the spreadsheet clear?

Was the approach appropriate?

Does the spreadsheet integrate data from appropriate sources?

Are the spreadsheet design, methods and analyses appropriate to the question being studied?

Does the spreadsheet add to existing knowledge, or does it repeat other previous documents that might have answered the same questions?

Are the methods described clearly enough for reviewers to understand and replicate?

Are calculations, statistical analyses, and levels of significance appropriate and correct?

Could presentation of the results be improved and do they answer the question?

If non-public information was involved, was ethics approval gained and was the analysis ethical?

The interesting thing about these review questions is that they beg larger ones, which is where data governance comes in. Instead of bemoaning persistent business dependence on “spreadmarts”, data governance advocates should forget about the tool and focus on data practices by helping the business define standards for data usage. What are appropriate data sources for a spreadsheet? What is ethical use of non-public information? What are appropriate calculations, statistical analyses, and level of significance? And so on.

Maybe data governance will take hold in more organizations when it starts promoting age-old practices applied in science to the spreadsheets upon which business depends.