Subject profiling in data warehouse project

Subject profiling examines subjects in different tables or on different systems and helps to find where the information about each subject is stored.

Subject profiling by data source

Subject profiling examines subjects in different databases and helps to find where the information about each subject is stored. The objective is to understand which sources can be effectively used in data mapping, data quality rules, and for data cleansing

The basic procedure in subject profiling is to fill the subject master list with comprehensive subject metadata showing which subject can be found on each data source. These atomic data can then be aggregated in different reports to see where more subject data can be found for different subject populations, or which sources have higher data overlap.

Subject profiling by entity

Additional subject profiling provides information about presence of subject data by entity. This information is critical for data mapping and designing data quality rules. Indeed, without this information it is impossible to select correct data source for mapping without running the risk that it does not have the data for large group of subjects. The basic procedure in subject profiling by entity is to collect counts of subjects and records for each entity.

Advanced subject profiling

In depth subject profiling provides information about location of data for individual subjects. The objective is to understand which entities can be effectively used in data mapping, data quality rules, and for data cleansing. For instance, in absence of this information it is difficult to decide data in which entities can be crosschecked for data quality.

The basic procedure in subject profiling is to fill the subject master list (or a child table) with comprehensive subject metadata showing count of records for each subject in each entity. These atomic data can then be aggregated in different reports to see where more subject data can be found for different subject populations, or which entities have higher data overlap. These data are also very useful to create test cases.