Tag Archives: Data Quality

There’s consensus among data quality experts that, generally speaking data quality is pretty much bad (here, here, and here). Data quality approaches generally focus on profiling, managing, and correcting data after it is already in the system. This makes sense in a data science or warehousing context, which is often where quality problems surface. To quote William McKnight at the first of those sources:

“Data quality is no longer the domain of just the data warehouse. It is accepted as an enterprise responsibility. If we have the tools, experiences, and best practices, why, then, do we continue to struggle with the problem of data quality?”

So if the data quality problem is Garbage In Garbage Out (GIGO), then I would think that it would be easy to find data quality guidelines for app dev, and that those guidelines would be lightweight and helpful to those projects. Based on my research there are few to none such sources (please add them to the comments if you find otherwise).

At the very first TDWI Conference, Duane Hufford described a phenomenon he called “embedded data”, now more commonly called “overloaded data”, where two or more concepts are stuffed into a single data field (“Metadata Repositories,” TDWI Conference 1995). He described and portrayed in graphics three types of overloaded data. Almost 20 years later, overloaded data remains rampant but Mr Hufford’s ideas, presented below with updated examples, are unfortunately not widely discussed.

Overloaded data breeds in areas not exposed to sound data management techniques for one reason or the other. Big data acquisition typically loads data uncleansed, shifting the burden of unpacking overloaded fields to the receiver (pity the poor data scientist spending 70% of her time acquiring and cleaning data!)

One might refer to non-overloaded data as “atomic”. Beyond making data harder to use, overloaded data requires more code to manage than atomic data (see why in the sections below) so by extension it increases IT costs.

Here’s a field guide to three different types of overloaded data, associated risks, and how to avoid them: Continue reading →

Application developers and business people accessing relational databases need data dictionaries in order to properly load or query a database. The data dictionary provides a source of information about the model for those without model access, including entity/table and attribute/column definitions, datatypes, primary keys, relationships among tables, and so on. The data dictionary also provides data modelers with a useful cross reference that improves modeling productivity.

It is particularly useful for the dictionary to be a filterable/sortable Excel document, but out of the box ERwin, one of the leading data modeling tools, includes a notably inflexible reporting capability. Luckily, it is possible to directly query the ERwin “metamodel”. However, I found the ERwin documentation a bit hard to decipher and not quite accurate. Hopefully this post will save modelers some steps in figuring out how to query the metamodel.

The data integration process is traditionally thought of in three steps: extract, transform, and load (ETL). Putting aside the often-discussed order of their execution, “extract” is pulling data out of a source system, “transform” means validating the source data and converting it to the desired standard (e.g. yards to meters), and load means storing the data at the destination.

The well-publicized problems with healthcare.gov are disturbing, especially when we remember they might result in many continuing without health insurance. But it seemed a step in the right direction when recent a news report differentiated between “front end” and “back end” problems. The back end problems were data issues, like a married applicant with two kids being sent to an insurer’s systems as a man with three wives.

Coincidently, I recently responded to a questionnaire about health care data. I’ve paraphrased the questions and my responses below. Perhaps the views of someone who’s spent a lot of time in the health care engine room might provide some useful perspective. Continue reading →

Data management professionals have long and sometimes rather Quixotically driven organizations to “get past the spreadsheet culture.” Maybe that’s misguided. The recent furor over a widely read social science paper may show how we can look to scientific peer review for a way to govern data, spreadsheets and all.

Recently, it was found that a key study underpinning debt-reduction as a driver of economic growth based its conclusions on a flawed spreadsheet. As this ArsTechnica article describes, Carmen Reinhart and Kenneth Rogoff’s Growth in a Time of Debt seemingly proved a connection between “high levels of debt and negative average economic growth”. But, per a recent study by Thomas Herndon, Michael Ash, and Robert Pollin, it turns out that the study’s conclusions drew from a Microsoft Excel formula mistake, questionable data exclusions, and non-standard weightings of base data. The ArsTechnica piece finds those conclusions fade to a more ambiguous outcome with errors and apparent biases corrected. Continue reading →

As important as it is, data modeling has always had a geeky, faintly impractical tinge to some. I’ve seen application development projects proceed with a suboptimal, “good enough”, model. The resulting systems might otherwise be well-architected, but sometimes strange vulnerabilities emerge that track directly to data design flaws.

Recently I saw an example where a “good enough” data design, similar to the one pictured, enabled a significant application bug.

In some presentations, I assert that top-down data modeling should result in not only a business-consistent model but also a pretty well normalized model.

One of the basic concepts behind normalization is functional dependency. In layperson’s terms, functional dependency means separating entities from each other and putting attributes into the obviously correct entity. For example, a business person knows that item color doesn’t belong in the order table because it describes the item, not the order. Everyone knows that the order isn’t green! Continue reading →

Recently I was in a conversation about data modeling standards. I confess that I’m not really the standards type. I understand the value of standards and especially how important it is to follow them so others can interpret and use work products. It is just that I prefer to focus on understanding of the principles behind the standards. In general, it seems to me that following standards is trivial for someone who understand the principles, but impossible for someone who doesn’t. But there doesn’t seem to be general understanding of data modeling principles. Continue reading →

Data quality in most large organizations is commonly known to be rather lacking. Most would argue that things haven’t gotten much better since this 2007 Accenture study found that “Managers Say the Majority of Information Obtained for Their Work Is Useless”. To some, quotes like that are shocking, but if you think about how information is processed in most Fortune 1000 sized organizations it is surprising that data available to managers is as good as it is. These slides have been useful in my efforts to explain the persistence of data quality problems in large organizations. Continue reading →