I’ve finally had the chance to sit down and properly closeout this series on GoldenGate and Oracle Data Integrator. Since the last post, I’ve actually presented this topic on three different occasions. Once at each of the Rittman Mead BI Forums, in both Brighton and Atlanta, as a part of the ODI Masterclass, and once in New Orleans at KScope13. Hopefully you were able to attend one of these sessions and learn a few new things about data integration.

In the previous post, “GoldenGate and Oracle Data Integrator – A Perfect Match… Part 2: Replicate to Staging and Foundation”, I described the setup and configuration of GoldenGate via ODI 11g at a high level. Now we have the GoldenGate parameter files configured, replication is active from the source to both Staging and Foundation layer schemas, and the ODI Change Data Capture (CDC) framework is in place. In this, the final post in the series, I will switch the focus to ODI Change Data Capture (CDC), walking through several options to using change data in downstream ETL processes.

Real-Time ETL Using Oracle Data Integrator CDC

ETL mappings, built as Interfaces in Oracle Data Integrator 11g, are developed to populate the Access and Performance layer. Remember from previous posts that this layer is where the traditional star schemas are built and accessed by reporting and analytical tools, such as OBIEE. The Interfaces will use the ODI Change Data Capture (CDC) framework to flow data captured in the change tables through to the target. To recap, ODI CDC is setup to identify, capture, and deliver changes made to data in the source database. When Journalizing is started on a Model, not only are the GoldenGate parameter files created, but also the ODI CDC Framework is generated. The CDC framework includes the following:

Journals – tables (prefixed with J$) that hold references to the change records and the change type (insert/update/delete)

Journalizing views – (prefixed with JV$, JV$D) provide access to the change data by joining the journal table to the fully replicated table, and are used by IKM’s and LKM’s to access the change rows

Subscribers – entities that consume the changed data as a consistent set

When the change data is consumed for a specific set of tables, the “window” is extended for that change set. This will essentially freeze the set of change data by only allowing transactions between the minimum and maximum WINDOW_ID (equivalent to the SCN) to be consumed, ensuring that no transactions are missed and allowing for consistency across the dataset. Once the data has been processed, the set of change data for that given window is purged from the change tables. More detailed information on ODI Change Data Capture can be found here.

Using Journalized Data

Once journalizing is setup and changes are being captured, building an interface for real-time data warehousing is quite simple. Well, sort of simple. Add the journalized Datastore to the Interface as a source and check the “Journalized data only” checkbox. This will change the source of the Interface in the generated code to be the JV$ change view rather than the actual table. The JV$ view will contain only the change rows available within the window after the “extend window” process is called. The Interface is executed and the journal is purged, setting up the process for the next run.

Now, I did say this process is sort of simple. There is a restriction that only one journalized Datastore can be used per Interface. So in the case that both the EMPLOYEE and DEPARTMENT tables are journalized, only one is allowed to have the “Journalized data only” checkbox selected when used as a source in the Interface. Fortunately, there are several ways to work around this limitation.

Parent-Child Relationship

Often a join between two source tables is due to a parent-child relationship: Department->Employee, Order->Order Line, etc. In this situation, and with any join really, we want to capture the change data from each individual table and only lookup additional data from the other table via the join when necessary to complete the full record. Unfortunately, this setup is not possible in a single Interface as was just described. So how do we work around it?

We must create two Interfaces; both with the exact same logic, source Datastores, and target Datastore. The only difference will be which source Datastore is using the “Journalized data only” option. This approach will ensure that no transaction is left behind and all changes will flow through to the target with the appropriate lookup data filled in for each row.

But this might not be the best approach if there are 3 or more Datastores to join. Imagine if there were 12 source Datastores and then some logic in the where clause changes, forcing an update to the ETL code. That makes 12 Interfaces to modify just for a single change!

ODI Cookbook Example

There is another workaround, outlined in the new book “Oracle Data Integrator 11g Cookbook”, that takes a similar approach but with potentially less maintenance headaches. This method separates the final interface, and all of its logic, from the activity of ensuring that each table has its changes processed along with the additional lookup data from other supporting tables. I won’t go into details here (the book is well worth the purchase), but I can say this approach does look promising.

Subscription Views

A final option for moving data from the change tables through to the target facts and dimensions as quickly as possible can be achieved using what we call subscription views. The goal here is to always return a consistent set of data and to let the ETL developer make the choice as to how this is accomplished. In this solution, the “Journalized data only” checkbox is never checked for any of the source Datastores in the Interface. Instead, we create a view for each of the Staging schema tables and include a join to the ODI J$ change table.

The ETL developer can then reverse engineer the views into an ODI Model and use them as the source Datastores in the transformation interfaces. There are two flags that are added to the view to allow the developer a choice of data source.

STAGE_IND – Indicates the rows that are ready to be consumed for the first time. In other words, these rows have not yet been “seen” by the process.

CURRENT_IND – The most recent version of that row by natural key, which, timing aside, matches the record in the source.

Using a combination of these indicators, the developer can choose to return only change rows (STAGE_IND = ‘Y’) or return all current rows (CURRENT_IND = ‘Y’). This works well for incremental processing of data, but let’s go one step further. What if there is a need to reload the fact and dimension tables using the historical data in the Foundation layer schema? A view can be created which combines all three sets of data: Change rows (J$ table), Current replicated rows (Staging table), and Full transactional history (Foundation table).

In this case, we need to perform a bit of SQL analytics using the SCN and natural key to determine the CURRENT_IND from the set of historical data in the Foundation layer, but beyond that the view is effectively the same. Now we can use a combination of the indicators to return various datasets, such as all historical rows, current rows, or just the latest change rows.

Shouldn’t the join between the two in-line views be connected using a FULL OUTER JOIN? If I understand correctly, the table in EDW_STG is an exact replicate of the source table. So, in your example, if you delete a player from a team, that row is physically deleted from EDW_STG.OFFENSE_PLAYMAKERS, and is added to EDW_STG.J$OFFENSE_PLAYMAKERS with a JNL_FLAG = ‘D’. If EDW_STG.OFFENSE_PLAYMAKERS drives the join, then the row in EDW_STG.J$OFFENSE_PLAYMAKERS is excluded from the join, right? You would need a FULL OUTER JOIN to capture the staged “delete” transactions.

@Brian – you bring up a good point about the full outer join when using the fully replicated table in the subscription views. When using the foundation table, these rows exist in both places as all transactional history is loaded into foundation, so the full outer join would not be necessary.

@Tanveer – you can use ODI for the entire process, yes. If you don’t have an external replication or CDC tool, such as Oracle GoldenGate or Oracle Streams, ODI can implement change data capture by adding triggers to the source tables. These triggers would capture any change to the source table and load this change data into the J$ table on the source. ODI Interfaces would then need to be built to gather the source change data and load into Staging / Foundation in the data warehouse.

Using GoldenGate, you already have the data replicated from source to target, so you can eliminate the need for those additional Interfaces – as well as limit the impact on the source tables.

Regarding examples – I would start with the ODI documentation “Working with Change Data Capture”. Glad you enjoyed the post!

Hi Michael,
really appreciate your time for responding to my question. You mentioned adding triggers to the source tables but DBA may not allow that. Is there any other option to do CDC using ODI e.g. using database logs? Is ODI widely being used for large volume CDC?

@Tanveer – Take a look at the ODI documentation I mentioned in the previous comment. The other option for reading change data from DB logs, besides GoldenGate, is Oracle Streams – but beware, this technology has been deprecated.