The value of Data scrubbing - and where should it really start?

Anyone who has tried to run a report from a large data repository probably understands the value of data cleansing, or what is called "scrubbing". This is sometimes skipped as part of a Data Warehouse implementation, but it really is one of the most critical steps to having a good end product. As the expression goes, 'The system is only as good as the data and the people who use it', is too true. Unfortunately, the lack of data scrubbing is often noticed after the project is complete. At this point, there may be millions of records loaded that have data issues. This can cause all types of issues such as inaccurate counts of specific events, or mistatemement of dollars. (All of which can have a very negative impact on the organization). But what can you do as a technical implementer to reduce data issues? From my experience it has been advantageous to address these issues as part of the analysis phase. When looking at the system inputs review the actual source data. Look at the relationships between the tables to see where inaccuracies would cause duplication of entries. For example, Let's say you have a table called members and a related table for addresses. If a member's last name is spelled incorrectly it may cause duplication of that record in the members table. So now, your member if loaded into a DW would effectively link to two or more addresses. To really complicate matters, suppose the members last name is spelled incorrectly three times and the address typed in three times incorrectly as well. Something as simple as putting in ST vs STREET. Now you have three members who link to three different addresses. But, you may ask yourself, "How can I control this? The data is entered incorrectly by the users?". And that is a valid question. There are different algorithms used to effectively identify patterns in records and create a crosswalk showing which records 'look' like they should be the same. But, the value of these tools can depend on the quality of the data. I have yet to figure out how a tool such as this would identify that 'Abuquerque, NM' and 'ALBQUOAY' are the same place. Especially if the other information was incorrect as well. But these are the types of issues you face while trying to "clean up" bad data. My personal preference is to clean the data up at the source. Way back in the old days there was something called a data validation step. When applications were developed, part of the design process was to look for potential manual typing mistakes and write routines to reduce the amount of user error. Lookup lists of cities, states and zip codes that cross matched against each other for inaccuracies were standard. First and last names were compared to the current information for possible matches, the user was presented with a list and allowed to pick one. Adding an additional name took an extra step. But somewhere along the way, some applications (especially healthcare and healthplan related) are skipping that step. Many of the Health related data storages areas suffer from these data quality issues. It starts at the service areas. Lack of standardization of software to enter patient information results in very large error rates. And most of these errors are very difficult to identify using standard data matching software. So, where should these data scrubbing start? In my opinion, at the source application. Bad data fed into a large storage area of bad data, just becomes a really large storage area of bad data. Identifying and trending this data for errors would be a nightmare. There is some fairly inexpensive software out there which promises to identify and "deduplicate" your data. But, think about how long it will take to look at the results of crossmatching millions of records. And then, if the software is not applied as new data is loaded, the problem will continue to grow. Wouldn't it just be easier to reduce the errors at the source? As usual, just a little more time spent in the analysis and planning could potentially reduce a lot of time spent later. But that would be too easy, Right?