Source Data: Deleted Vs Purged.

Source Data: Deleted Vs Purged.

Hi Folks,I had an interesting thought cross my mind today.Currently we are in the process of building up and designing our first warehouse. This is also the first warehouse project that I have been involved in.In a few of our source systems, data is both deleted, in such that we pretend it never existed in the first place and data is purged, which I am going to define as still relevant data, just removed from the system to create space (our purging routines includes creating text reports of the data that is purged, which is then stored in our reports archive).In an ideal world, data that is "deleted" should be tombstoned - I.E., marked as no longer relevant and if needed, removed further down the track, but as I am finding with a lot of our legacy systems, this is rarely the case.

Based on what I have written above, is it common to classify data that is "deleted" differently to data that is "purged" in a warehouse and/or ODS?

Yes - one of the reasons you might want a data warehouse is to to keep historical records when your source transaction systems don't, i.e. after they have been "purged". To achieve this objective, purged records must remain in the warehouse, and they may look no different from "normal" records that still exist at source.

Conversely, records that are true deletions at source must be removed from the warehouse. They are qualitatively different from purged records in that they represent facts that are no longer true, rather than facts that are still true but no longer represented at source.

To make this work you need to ensure that purged records don't appear to ETL processes as deletions. This requires source system extracts to indicate when a record has been deleted at source, and *not* to indicate a record deleted when it has merely been purged (or something similar).