There are many deployment models available for InfoSphere Data Replication's CDC technology of which DataStage integration is a popular one. The deployment option selected will significantly affect the complexity, performance, and reliability of the implementation. If possible, the best solution is always to use CDC direct replication (i.e. do not add DataStage to the mix).

CDC integration with DataStage is the right solution for replication when:

You need to target a database that CDC doesn't directly support and is not appropriate for CDC FlexRep

Complex transformations are required that could not be handled natively with CDC, such as complex table look-ups

When integrating with MDM

Cons of replicating from CDC to DataStage to an eventual target database:

Performance going through DataStage (no matter which integration option is chosen) will be significantly slower than applying via a CDC target directly to the database

The exception to this rule is when targeting Teradata, if you use DataStage flatfile integration, the throughput will be higher than CDC direct to Teradata

The maximum number of tables per CDC subscription is lower if targeting DataStage

The CDC External Refresh does not work when targeting DataStage. A separate process would have to be put in place to de-dup duplicate records produced during the "in-doubt" period of a refresh (the captured changes that occurred while the source date was being refreshed).

With a mere 4 weeks until IBM's 2013 Information on Demand, the data replication team thought it might be helpful to have a complete listing of all data replication sessions at IOD. From client presentations and our product roadmap to sneak peeks at new IBM Data Replication functionality, our sessions run the gamut!

Simply take a gander at the sessions below then go to the IOD agenda builder, click on Create Sign In, and then enter your confirmation number and the email address that you used to register for the conference. Create your agenda today!

First, what CDC function are you entitled to use in the DB2 Advanced Editions? The license is the always final word, but, in simple terms, you can only use the bundled CDC to build disaster recovery solutions where a primary DB2 instance* has up-to-two backup instances. For example, the following replication topology is allowed by the DB2 Advanced Edition licenses:

Furthermore, the disaster recovery use case limits your entitled use of CDC function in the following ways:

You can only use unidirectional (one-way) replication.

You can set up replication from the primary DB2 to the backup(s) but you cannot set up replication from the backup(s) to the primary. This fits with the definition of a pure disaster recovery solution since it provides for fail-over but not switchback. If you need CDC for both fail-over and switchback, you need to license the full IIDR product.

You cannot transform the data as it's replicated. Again, this fits with the definition of disaster recovery and you can license the full IIDR to be entitled to transformations as you replicate.

The question is - when do you need to buy CDC now? If you want to do anything more than what's described in this post, you'll need to buy IIDR for your DB2 Advanced Editions. The two most common replication configurations that require this are ones where you do either of the following:

Replicate between DB2 LUW and either DB2 z/OS or Oracle.

Set up an HA or Active-Active solution with IIDR's CDC technology.

If you need to understand more about these examples, we'll have pictures and add a few more examples in a future post that talks about when you need to buy CDC.

Shared Scrape (sometimes referred to as Single Scrape)

When multiple subscriptions are running in a single instance, it is usually advantageous to utilize a shared scrape mechanism. If you don't use a shared scrape, and you have 'n' subscriptions, CDC would read the log 'n' times. If you utilize shared scrape, CDC will only read the log once which will utilize fewer system resources.

On by default for InfoSphere CDC LUW

You must configure the log cache for InfoSphere CDC z

Not available on InfoSphere CDC i or CDC Informix

You need to size the shared scrape cache appropriately for optimal performance:

If the cache is too small the following will occur:

LUW – A private scraper will be launched which will consume additional resources

Set staging_store_disk_quota_gb system parameter appropriately to avoid

Z - With the log cache, each subscription attempts to read its data from the cache – it will read directly from the IFI if the data is no longer available from the cache

Use the following to configure CACHELEVEL1SIZE, CACHEBLOCKSIZE, CACHELEVEL1RESERVED

Number of Tables in a Subscription

Rule of Thumb

This is certainly not a hard limit, but in general it is best to keep the number of tables in a subscription under 1000

Considerations for the number of tables include:

With too many tables (over 1000) in a subscription, loading and managing the tables in the Management Console GUI will be slow

This may not be a consideration if you are controlling your replication via scripting/automation

If the number of tables exceed 1000 then promotion in the management console will take a significant amount of time, and additional memory would need to be allocated

From an engine perspective:

With CDC LUW if you want to go beyond 1000 tables you need to increase the memory allocated to the InfoSphere CDC Instance

If the target is flatfile or HDFS, then an upper limit on the number of tables in the subscription is 800. Additionally, you would need to allocate some additional memory if you have more than a couple hundred tables.

CDC i can accommodate well over 2000 tables in a subscription

CDC z can accommodate well over 1000 tables in a subscription

Note, the number can be significantly higher, but there are implications to the number of subscriptions you have due to limits on below the bar memory

Number of CDC Subscriptions Required

A Subscription is a logical container that describes the replication configuration for tables from a source to a target datastore. Once the subscription is created, you create table mappings within the subscription for the group of tables you wish to replicate

An important part of planning an InfoSphere CDC implementation is to choose the appropriate number of subscriptions to meet your requirements

Rule of Thumb:

Starting with the minimum number of subscriptions and only increasing due to valid reasons, is the optimal approach

This will ensure efficient use of resources as well as require a lower level of maintenance

It may require an iterative process before you have a good balance

The number of subscriptions will impact the resource utilization of the server (more CPU and RAM are needed) and performance of InfoSphere CDC

Note that tables with referential integrity or ones where the data must be synchronized at all times must reside in the same subscription since different subscriptions may be at different points in the log

The following are valid reasons to increase the number of subscriptions:

Requirement to replicate one source table to multiple targets

You need to increase the number of applies once it has been determined that it is the apply that is affecting the performance and you want further parallelism

Management of replication for groups of tables, in cases where some tables only require mirroring with a scheduled end time, while others require continuous or they are active at different times of the day

You have too many tables in a single subscription which is affecting start-up performance

You have multiple independent business applications that you need to mirror, but want to be able to deal with maintenance independently

Number of Subscriptions per CDC Instance

For best resource utilization, and easiest management, you want to keep the number of CDC Instances and Subscriptions to the minimum.

Rule of Thumb:

InfoSphere CDC LUW can generally accommodate up to 50 subscriptions per instance (either source or target)

InfoSphere CDC z can generally accommodate up to 20 combined source and target subscriptions per instance and a hard maximum of 50 subscriptions per instance

Note: For CDC z if you have three or more source subscriptions in an instance, for optimal resource utilization, you need to ensure that the log cache is configured

InfoSphere CDC i can generally accommodate up to 25 source subscriptions per instance, and 25 subscriptions in a target instance

Note that InfoSphere CDC i does not have the single scrape feature, so each additional subscription will require proportionally extra CPU resource if reading from a single journal. Thus, if you have multiple subscriptions you will achieve better efficiency if separate journals can be used for each subscription

Setting Up Notifications (Sometimes referred to as Alerts and Alarms)

There are various means of checking and understanding replication status, performance, etc. One important aspect is to be able to be notified in the event of a replication issue be it an error, or latency. Notifications can be sent for any event message that InfoSphere CDC produces.

Appropriate notifications settings will alert InfoSphere CDC administrators of issues with the environment in a timely manner so they can be addressed

Notifications can be setup for various categories on the source and target and at the datastore or individual subscription level

Messages can also be filtered based on severity: Status, informational, operational, error and fatal

Latency notifications can be setup to monitor performance issues at a subscription level. A message can be sent to the event log when a warning threshold is passed, and another message if an error threshold is passed

InfoSphere CDC for z/OS also allows users to select specific messages to be directed to the console – see CONSOLEMSGS keyword

Notification can be directed to platform specific destinations or a custom user exit program

LUW

E-mail

SMTP

Specify e-mail address and password

Unix System log

Custom Java User Exit Program

z/OS

CHCPRINT spool file

SYSLOG

User Exit

IBM i

Message Queue

User Exit

Rule of Thumb:

The general practice is to have notifications set up for all Fatal and Error messages (events), as well as to have a notification for a latency threshold