5 Restartability Design Pattern for Different Type ETL Loads

Restartable ETL jobs are very crucial to job failure recovery, supportability and data quality of any ETL System. So you need to build your ETL system around the ability to recover from abnormal ending of a job and restart. So a well designed ETL system should have a good restartable mechanism. In this article lets discuss ETL restartability approaches to support different type of ETL Jobs such as Dimension loads, Fact Loads etc...

What is ETL Restartability

Restartability is the ability to restart an ETL job if a processing step fails to execute properly. This will avoid the need of any manual cleaning up before a failed job can restart. You want the ability to restart processing at the step where it failed as well as the ability to restart the entire ETL session.

Slowly Changing Dimension

Fact Table

Snapshot Table

Current State Table

Very Large Table

1. Slowly Changing Dimension Load

Key Design Factor : The key aspect of this design is the CHECKSUM Number comparison. As per this design, any incoming record with the same CHECKSUM number of the active record in the target will not be processed into the Dimension table, Hence we can restart the process with out impacting the partially processed data.

Lets see this in bit more detail.

Step 1 : In this step, we will read all the data from the staging table. This will include joining data from different tables and applying any incremental data capturing logic.

Step 2: Data will be compared between source and target to identify if any change in any of the attributes. CHECKSUM Number can be used to make this process simple.

Step 3 : If the check CHECKSUM Number is different, Data is processed further, else ignored.

Step 4 : Do any transformation required, including the error handling.

4. Current State Table Load

Just like SCD Type 1, there are scenarios, where you are interested to keep only the latest state of the data. Here we are discussing a very common and simple approach to achieve restartability for such scenarios.

5. Very Large Table Load

The approach we are discussing here is appropriate for loading very large snapshot table , which is required to be available 24/7. You can read the complete design from this article.

Key Design Factor : Switching the tables using RENAME DDL Command.

Below is the high level design.

Lets see this in bit more detail.

Step 1 : In this step, we will read all the data from the source table. This will include joining data from different tables and applying any incremental data capturing logic.Step 2 : Perform any transformations that is required, including the error handling.Step 3 : Load the data into the TEMP Table.

Step 4 : Rename the TEMP table to the Target table. This will move the data from the TEMP table to the actual target table.