If you're looking for an excellent way to replicate changed data from a wide range of databases into a Netezza appliance, you can do so through InfoSphere Data Replication. The latest release provides an Apply program that is both native to Netezza and optimized for Netezza targets. This Apply is built from Data Replication's CDC technology and is also compatible with the CDC technology found in InfoSphere Change Data Capture and InfoSphere Classic Change Data Capture for z/OS. This means you can replicate data to Netezza from source databases ranging from Oracle, DB2, and others on UNIX or Windows to DB2* and IMS on the mainframe. Ordering information can be found in the Data Replication announcement letter on ibm.com.

* Data Replication's CDC Apply program cannot be used to feed changed data to the IBM DB2 Analytics Accelerator (IDAA).

You may have seen a recent announcement on ibm.com that says IBM would no longer be marketing it's older data replication products in 2013. That includes InfoSphere CDC. Why?
And what happens to the CDC technology?

Over the years, IBM provided its data replication technologies
through a lot of different products. For example, IBM used to offer two major
data replication products at the same time -
InfoSphere CDC and InfoSphere Replication Server. That was a little
confusing,
even to some IBM people. To simplify the situation, IBM consolidated all it's replication technologies into a single product called IBM InfoSphere Data Replication
(IIDR). Once IIDR was available, the older products no longer needed to be sold to new customers. That's why the end of marketing was announced. However, the replication technologies - CDC, Q Replication, and SQL Replication - are still alive and well. You can
continue to use them as you always
have. Of course, you may have two related questions:

One question that comes up is whether the two IMS replication products are compatible with either the new Data Replication product or the existing InfoSphere CDC products. The answer is yes - the IMS products are compatible with both new and existing products that contain the CDC technology. IMore specifically, they can provide IMS changed data to any data replication solution that you can build with IBM's CDC technology. For example, you can create unidirectional (one-way) subscriptions that feed IMS changed data to any database that can be targeted by CDC:

Two notes about this picture:

IBM recommends you use the CDC technology in IIDR if you do not own InfoSphere CDC.

The target DB2 can be DB2 for z/OS, DB2 LUW, or DB2 for System i.

You could also feed IMS changed data into other business software such as ETL, IBM's DataStage, and ESBs:

In other words, the new IMS data replication products extend the reach of IBM's CDC technology by adding IMS as a source for log-based capture of changed data. If you have technical questions, see the Classic CDC section of the Information Center.

I have had many requests to share best practices when using IBM InfoSphere Change Data Capture (from this point forward in the blog referred to as CDC). I will try to add new tips and techniques on a regular basis.

Along with many of the best practices posts, I will include items denoted by "Rule of Thumb". These are general guidelines that will help in your planning. I will endeavor to provide reasons or context for the guidance. The Rules of Thumb should not be treated as hard limits, rather as useful guidance. If your needs fall significantly outside the guidance, it certainly does not mean that it can not be done. Rather, it would be best to engage with an InfoSphere CDC subject matter expert, and potentially you may want to consider IBM Services for assistance.

For a comprehensive list of best practices, please see the parent community main page:

First, what CDC function are you entitled to use in the DB2 Advanced Editions? The license is the always final word, but, in simple terms, you can only use the bundled CDC to build disaster recovery solutions where a primary DB2 instance* has up-to-two backup instances. For example, the following replication topology is allowed by the DB2 Advanced Edition licenses:

Furthermore, the disaster recovery use case limits your entitled use of CDC function in the following ways:

You can only use unidirectional (one-way) replication.

You can set up replication from the primary DB2 to the backup(s) but you cannot set up replication from the backup(s) to the primary. This fits with the definition of a pure disaster recovery solution since it provides for fail-over but not switchback. If you need CDC for both fail-over and switchback, you need to license the full IIDR product.

You cannot transform the data as it's replicated. Again, this fits with the definition of disaster recovery and you can license the full IIDR to be entitled to transformations as you replicate.

The question is - when do you need to buy CDC now? If you want to do anything more than what's described in this post, you'll need to buy IIDR for your DB2 Advanced Editions. The two most common replication configurations that require this are ones where you do either of the following:

Replicate between DB2 LUW and either DB2 z/OS or Oracle.

Set up an HA or Active-Active solution with IIDR's CDC technology.

If you need to understand more about these examples, we'll have pictures and add a few more examples in a future post that talks about when you need to buy CDC.

The IBM Redbook titled "Smarter Business: Dynamic Information with IBM InfoSphere Data Replication CDC" is now available and can be found: http://www.redbooks.ibm.com/Redbooks.nsf/RedbookAbstracts/sg247941.html?Open

This Redbook covers a wide range of topics from InfoSphere CDC use cases, solution topologies, features and functionality, performance, environmental considerations and automation. This is a great source of information if you are wondering how best to set up InfoSphere CDC, how do you fit it into a resilient environment, etc.

Setting Up Notifications (Sometimes referred to as Alerts and Alarms)

There are various means of checking and understanding replication status, performance, etc. One important aspect is to be able to be notified in the event of a replication issue be it an error, or latency. Notifications can be sent for any event message that InfoSphere CDC produces.

Appropriate notifications settings will alert InfoSphere CDC administrators of issues with the environment in a timely manner so they can be addressed

Notifications can be setup for various categories on the source and target and at the datastore or individual subscription level

Messages can also be filtered based on severity: Status, informational, operational, error and fatal

Latency notifications can be setup to monitor performance issues at a subscription level. A message can be sent to the event log when a warning threshold is passed, and another message if an error threshold is passed

InfoSphere CDC for z/OS also allows users to select specific messages to be directed to the console – see CONSOLEMSGS keyword

Notification can be directed to platform specific destinations or a custom user exit program

LUW

E-mail

SMTP

Specify e-mail address and password

Unix System log

Custom Java User Exit Program

z/OS

CHCPRINT spool file

SYSLOG

User Exit

IBM i

Message Queue

User Exit

Rule of Thumb:

The general practice is to have notifications set up for all Fatal and Error messages (events), as well as to have a notification for a latency threshold

For InfoSphere CDC z, there is no command available. Generally not a requirement as most z shops keep logs around for 10 days. If required, you can utilize the earliest open position indicated in the event log when InfoSphere CDC z starts replication

You need to consider and accommodate for cases when replication will be down for a period of time

Rule of Thumb:

Successful implementations typically have 5+ days of logs retained

If you do not have sufficient log retention, you need to be prepared to do table refreshes if something unexpected happens in your environment

Number of Subscriptions per CDC Instance

For best resource utilization, and easiest management, you want to keep the number of CDC Instances and Subscriptions to the minimum.

Rule of Thumb:

InfoSphere CDC LUW can generally accommodate up to 50 subscriptions per instance (either source or target)

InfoSphere CDC z can generally accommodate up to 20 combined source and target subscriptions per instance and a hard maximum of 50 subscriptions per instance

Note: For CDC z if you have three or more source subscriptions in an instance, for optimal resource utilization, you need to ensure that the log cache is configured

InfoSphere CDC i can generally accommodate up to 25 source subscriptions per instance, and 25 subscriptions in a target instance

Note that InfoSphere CDC i does not have the single scrape feature, so each additional subscription will require proportionally extra CPU resource if reading from a single journal. Thus, if you have multiple subscriptions you will achieve better efficiency if separate journals can be used for each subscription

The following items need to be considered and taken into account when you are planning a replication architecture.

§Target table triggers

–Often if the target is a mirror image of the source, you may have triggers on target tables that if fired will have an affect on other tables that InfoSphere CDC is replicating into (CDC would have mirrored the source trigger effect and will get duplicate actions). To alleviate this, you should disable the trigger on the target table.

–Similar to trigger, having cascade deletes set on the target will cause replication to try and delete a record (based on the delete that CDC would have replicated from the source log) that the database may have already deleted (or visa-versa). The following strategy can be deployed to deal with cascaded deletes:

•

Disable the RI constraints on target prior to starting replication

Please note that re-enabling these constraints may take some time during cut-over if you need to fail over to the target

Strategy: test how long re-enabling the RI constraints takes. If re-enabling all RI constraints takes too long and would impact your RTO (Recovery Time Objective), investigate whether it is possible to leave the RI constraints enabled and just change the CASCADE DELETE flag at cut-over time.

Shared Scrape (sometimes referred to as Single Scrape)

When multiple subscriptions are running in a single instance, it is usually advantageous to utilize a shared scrape mechanism. If you don't use a shared scrape, and you have 'n' subscriptions, CDC would read the log 'n' times. If you utilize shared scrape, CDC will only read the log once which will utilize fewer system resources.

On by default for InfoSphere CDC LUW

You must configure the log cache for InfoSphere CDC z

Not available on InfoSphere CDC i or CDC Informix

You need to size the shared scrape cache appropriately for optimal performance:

If the cache is too small the following will occur:

LUW – A private scraper will be launched which will consume additional resources

Set staging_store_disk_quota_gb system parameter appropriately to avoid

Z - With the log cache, each subscription attempts to read its data from the cache – it will read directly from the IFI if the data is no longer available from the cache

Use the following to configure CACHELEVEL1SIZE, CACHEBLOCKSIZE, CACHELEVEL1RESERVED

Now that IBM has packaged its major data replication technologies into a single product, InfoSphere Data Replication, a lot of people are asking what they can take advantage of that they couldn't with the older products (InfoSphere CDC and InfoSphere Replication Server). Other than the obvious point of having access to multiple technologies, you can now use IBM's table compare utility, asntdiff, with CDC. asntdiff is a general-purpose utility that compares the data from two queries. IBM provides it through several product - Replication Server, the IBM Data Server Client, and all editions of DB2 and InfoSphere Warehouse.*

Long-time CDC users may ask what's happening to CDC's differential refresh and why they would want to use asntdiff instead of differential refresh. First understand that differential refresh is alive and well and it's not going anywhere :) asntdiff is just an option available to you.

To understand when you might want to use asntdiff, understand the basics of how it works.

asntdiff accepts two queries as input and compares the result sets.

You can use almost any query you can write against source and target tables.

So, the first reason to consider asntdiff is times when differential refresh's restrictions could be overcome by writing queries to get the result sets you need. For example, asntdiff may be an alternative if one of the following differential refresh restrictions applies to your replication configuration:

Differential refresh is only available for tables that use Standard replication.

Derived columns in the source table are not supported.

Target columns are ignored if they are mapped to derived expressions, constants, or journal control fields.

Key columns of the target table must be mapped directly to columns in the source table.

Next, asntdiff is independent of data replication and can be started from a command line. Among other things, this means:

It can made part of a z/OS batch job and scheduled.

It can be used while a CDC subscription is running

One major point to be aware of with asntdiff is how it works with heterogeneous data. For example, when you want to compare data being replicated from Oracle to DB2. asntdiff was originally written for DB2 databases. As a result, it requires IBM data federation technology to query databases such as Oracle. The good news is that InfoSphere Data Replication provides data federation for use with data replication configurations.

If you're not familiar with asntdiff and want to give it a try, see the ChannelDB2.com blog post titled Compare the Rows of Two Tables. If you have questions, feel free to post them in the CDC message board here on developerWorks.

--* Yes, technically, you could already use asntdiff with CDC on UNIX or Window since it comes in so many IBM products on UNIX and Windows. However, if you wanted to use it on z/OS, you could only get it through Replication Server. It's now in InfoSphere Data Replication as well.

With a mere 4 weeks until IBM's 2013 Information on Demand, the data replication team thought it might be helpful to have a complete listing of all data replication sessions at IOD. From client presentations and our product roadmap to sneak peeks at new IBM Data Replication functionality, our sessions run the gamut!

Simply take a gander at the sessions below then go to the IOD agenda builder, click on Create Sign In, and then enter your confirmation number and the email address that you used to register for the conference. Create your agenda today!

Number of CDC Subscriptions Required

A Subscription is a logical container that describes the replication configuration for tables from a source to a target datastore. Once the subscription is created, you create table mappings within the subscription for the group of tables you wish to replicate

An important part of planning an InfoSphere CDC implementation is to choose the appropriate number of subscriptions to meet your requirements

Rule of Thumb:

Starting with the minimum number of subscriptions and only increasing due to valid reasons, is the optimal approach

This will ensure efficient use of resources as well as require a lower level of maintenance

It may require an iterative process before you have a good balance

The number of subscriptions will impact the resource utilization of the server (more CPU and RAM are needed) and performance of InfoSphere CDC

Note that tables with referential integrity or ones where the data must be synchronized at all times must reside in the same subscription since different subscriptions may be at different points in the log

The following are valid reasons to increase the number of subscriptions:

Requirement to replicate one source table to multiple targets

You need to increase the number of applies once it has been determined that it is the apply that is affecting the performance and you want further parallelism

Management of replication for groups of tables, in cases where some tables only require mirroring with a scheduled end time, while others require continuous or they are active at different times of the day

You have too many tables in a single subscription which is affecting start-up performance

You have multiple independent business applications that you need to mirror, but want to be able to deal with maintenance independently

–All log-based replication products require additional logging on the database which will result in additional storage needs. The following are some of the base logging requirements for InfoSphere CDC:

For DB2 on z/OS and LUW, the DB2 table is altered for Data Capture Changes

For DB2 on IBM i, journaling is enabled requiring before and after image

Number of Tables in a Subscription

Rule of Thumb

This is certainly not a hard limit, but in general it is best to keep the number of tables in a subscription under 1000

Considerations for the number of tables include:

With too many tables (over 1000) in a subscription, loading and managing the tables in the Management Console GUI will be slow

This may not be a consideration if you are controlling your replication via scripting/automation

If the number of tables exceed 1000 then promotion in the management console will take a significant amount of time, and additional memory would need to be allocated

From an engine perspective:

With CDC LUW if you want to go beyond 1000 tables you need to increase the memory allocated to the InfoSphere CDC Instance

If the target is flatfile or HDFS, then an upper limit on the number of tables in the subscription is 800. Additionally, you would need to allocate some additional memory if you have more than a couple hundred tables.

CDC i can accommodate well over 2000 tables in a subscription

CDC z can accommodate well over 1000 tables in a subscription

Note, the number can be significantly higher, but there are implications to the number of subscriptions you have due to limits on below the bar memory

There are many deployment models available for InfoSphere Data Replication's CDC technology of which DataStage integration is a popular one. The deployment option selected will significantly affect the complexity, performance, and reliability of the implementation. If possible, the best solution is always to use CDC direct replication (i.e. do not add DataStage to the mix).

CDC integration with DataStage is the right solution for replication when:

You need to target a database that CDC doesn't directly support and is not appropriate for CDC FlexRep

Complex transformations are required that could not be handled natively with CDC, such as complex table look-ups

When integrating with MDM

Cons of replicating from CDC to DataStage to an eventual target database:

Performance going through DataStage (no matter which integration option is chosen) will be significantly slower than applying via a CDC target directly to the database

The exception to this rule is when targeting Teradata, if you use DataStage flatfile integration, the throughput will be higher than CDC direct to Teradata

The maximum number of tables per CDC subscription is lower if targeting DataStage

The CDC External Refresh does not work when targeting DataStage. A separate process would have to be put in place to de-dup duplicate records produced during the "in-doubt" period of a refresh (the captured changes that occurred while the source date was being refreshed).

§Using ‘Standard’ replication achieves much higher throughput performance than using ‘Consolidation’ or ‘Summarization’

–

Standard replication can do optimizations such as arraying, commit grouping, etc that can not be performed when using the other replication methods

–

Note some optimizations will also be disabled if using Adaptive apply or Conflict Detection & Resolution

§Be aware when you are parking tables/subscriptions

–

An inactive (not currently replicating) subscription that contains tables with a replication method of Mirror will continue to accumulate change data in the staging store from the current point back to the point where mirroring was stopped. For this reason, you should delete subscriptions or remove tables that are no longer required, or change the replication method of all tables in the subscription to Refresh to prevent the accumulation of change data in the staging store on your source system.

–

The same is true with a parked (idle) table. You need to insure that the replication method is set to Refresh