For those of us who have followed the various technologies in market for years (15 actually... we just had our birthday in January), it's a fun question to answer. I especially like getting that question these days, since over the last two years IBM has done some very unique things that bring a wealth of value to our customers and truly differentiate our suite. While there are several ways we can take that discussion, the question prompted me to get out the blog tonight so I can share some thoughts on one very specific set of distinctiveness .... data integration styles.

So, how many ways are there to move data?

You'd be surprised.... there really are several. As you can imagine, these align to the ultimate business goal an organization is trying to achieve. For instance, if you are interested in a topic like "real-time data warehousing" then you will need to look at a very different set of features than someone who may be looking at a project like "an application migration". For organizations that are looking for a data integration platform that can serve as the middleware layer supporting a diverse set of projects, they may need to consider a set of capabilities that reach beyond the current project at hand.

There are several ways that we can logically divide data integration styles, but my top level swags are always "batch" vs "real-time". Those categories are almost universally understood, so it gives a good construct with which to then dive into more details. In the following sections we'll look at some details on batch style processing, and then turn to the real-time styles.

Batch processing styles

Type 1: Normal Batch This is where it all began. When data warehousing was growing up, batch style ETL processing was the the only game in town. This is still the most popular style of processing, supporting thousands of organizations data integration focused projects. Batch processing has the most demanding performance and scalability requirements. It requires ETL platforms can not only run native bulk operations against source and target DBMS systems, but also leverage a variety of techniques for parallelism. Many ETL tools in the market will support data pipelining(think of an assembly line with each step being a different process). The market leaders will also support another style of parallelism ... data partitioning(think of multiple assembly lines). Data partitioning allows you to divide the data that must be processed into a set of smaller sets that can be worked on independently. DataStage provides some very unique features in this area, including a shared nothing archictecturefor true scalability, as well as a concept known as dynamic repartitioning. Dynamic repatrtitioning (at least the way IBM means it.... don't let your other ETL vendor fool you ) is the ability to provide options at runtime to determine the degree of data partitioning that will be run across any set of hardware ... SMP, MPP or grid. Without the ability to do this, an organization that finds they have a large file to process will likely need to go back to their development environment and redesign their job to now run in N-ways instead of M-ways. This can be a costly proposition and one that Information Server customers don't have to worry about.

Type 2: Micro-Batch "Micro-batch" is running typical batch style designs on a periodic basis. In some cases, organizations will be interested in micro-batch as a first step toward making their business project closer to real-time... perhaps running these jobs every 10 minutes. In other cases, the organization will have a very specific use case in mind. The classic example when the organization receives files on a regular basis throughout the day and they need to process whatever has arrived as soon as they can. These projects will also generally require more parameterization of the data sources, targets and even data schemas (which is a great use case for DataStage's runtime column propagation). Also, we shouldn't assume that the data volume will be small just because we are running every few minutes. Particularly in certain industries we find that the data sources may still be very large, so there is still a great demand for a performant and scalable data integration engine. One unique feature of DataStage in this regard is to wildcard scan for the number of data sources that are now available and then scale that out over N degrees of parallelism. Finally, there are unique scheduling and workflow requirements around triggering activity based on file arrival or sleeping for a specified period of time. The DataStage sequencer provides these features to support the micro-batch scenario.

Type 3: ELT (or mixed TETLT) The industry has settled the long war of ELT vs ETL and recognized that each technique has a set of scenarios where it is optimal. I have two scenarios that I use to describe this. Scenario 1 is the conversion of an XML file to a CSV. There is no good reason to move this informaiton into a database only to extract it again - this is clearly predisposed to ETL. Scenario 2 is the creation of an aggregate table from a large detail table. There's no good reason to extract this data from the database since the system is optimized to do this type of transformation and an ETL topology will introduce a good deal of network latency. So, you can imagine that "data proximity" is a good yard stick to the processing type that may be your best fit . The right data integration technology will be able to provide both ETL and ELT and allow developers to switch easily between the two. The Information Server implementation (called "Balanced Optimization") is different than other vendors in that it leverages a single design canvas and single set of widgets/stages to represent both paradigms. This means that to go from ETL to ELT there is no redesign work required - the user simply hits the "optimize" button and they will have the ELT version of their logic. Information Server is also unique in it's blended approach that allows the user to push processing to the source db, target db, or both, as well as subset some processing on the ETL server where required (like our integrated quality components).

Real-time processing style

Type 4: Change Data Capture integration Change Data Capture (CDC) technology monitors DBMS inserts, updates, and deletes so they can be replicated to another data store. The InfoSphere CDC technology minimizes impact to the DBMS by triggering off of what is written to the database log files without any requirement to further stage the information. Organizations have historically adopted CDC technologies primarily to support Real-Time Data Warehousing, often in support of web based store fronts and call center portals. There is now also a growing interest to adopt CDC technology to replace portions of batch processing that was focused on finding deltas between full volume extracts. CDC offers a great value proposition for minimizing resource investments in these cases. These delta records can be piped directly into a data integration/quality job so they can be transformed and cleansed to the extent required. Information Server can then write these to a heterogenous set of data targets while providing two-phase commitsupport to guarantee deliveryof the data. The integration of data replication, data transformation and data quality into a single runtime architecture (for complete scalability), metadata store (for complete lineage), and design enviornment (to maximize time to value) is unparalleled.

Type 5: Message-queue based integrationFor many years IBM has been combining data integration and application integration software to provide unique solutions. Organizations use WebSphere MQ (the leading market technology) for delivering messages between their applications and other information initiatives. For many years we have been working wtih organizations who recognized our development tooling as providing a great time-to-value for satisfying complex transformation capabilities in this domain. Tens of millions of dollars are tranferred using Information Serverevery day for the largest financial institutions in this fashion. Information Server connects directly to one or more queues as sources and then leverages the full compliment of integration and quality components to transform and cleanse this information. Similar to the CDC scenario, this data can then be commited with guaranteed deliver to a heterogenous set of targets. We refer to this as our distributed transaction stage and it can target all of the leading DBMS vendors along with a range of others data stores. I feel confident with the bold statement that no other ETL vendor is as well integrated with message based technologies.

Type 6: Information ServicesInformation Services is a form of data virtualization, providing a common access layer to any type of data source and information processing. A key advantage to adopting such a virtualization strategy is to insulate the end user from variations in the source technologies and thus accelerating their ability to deliver business value. Services should be accessible through a variety of invocation bindings - EJB, HTTP, JMS, REST, RSS - and in a variety of formats for the consuming applications - such as SOAP, Text, and XML. Over the last 8 years that Information Server has provided these features, some companies have adopted information services for very specific requirements, such as to provide a person matching function within an ERP or MDM solution, while others are building out an information fabric for distributed and mainframe data access. Since our information services are fully integrated with our design and runtime environment, information services can include multiple dozens of data source types (or "information service providers") for dbms, sequential files, xml, vsam, message queue, Hadoop, ERP, etc... and any of our transformation, profiling and cleansing components. This category can be broken down further into three topologies, each which has it's own targetted use cases

input/output service: services that are always running/active and receive and can receive an input row at any moment in order to deliver an output result. This topology is typically used to process high volumes of smaller transactions where response time is important. It is tailored to process many small requests rather than a few large requests.

output service: the service call invokes a job to run where the return values can consist of an atomic value (one column), a structure (multiple columns), or an array of structures (multiple rows). These jobs typically initiate a batch process from a real-time process that requires feedback or data from the results. It is designed to process large data sets and is capable of accepting job parameters as input arguments.

service triggered batch jobs: the service call invokes a batch job to run on demand. Organizations who use this may be integrating into a business process where an authorized user triggers an event that then launches the delivery of data to a downstream process.

So, how many data integration styles do I need?

Within any single project you are probably looking at just one of two of these. As the next several projects start rolling, you will have probably started to consider where growing into the other data integration styles can help accelerate business value to your organization. A key factor in capitalizing on this value, is having the ability to leverage the existing tools, infrastructure and skill set you have already established. Information Server provides you an on ramp for not only this set of capabilities, but also a wealth of others around scalability, connectivity, profiling, cleansing, quality monitoring, metadata, governance, etc... If you are interested in learning more about these features and the flexibility of Information Server to handle this breadth of data integration scenario you can learn more in our Information Center or send me an email. Always happy to talk.

Information Server 8.7 Extreme Productivity ... among other things

This past week I joined nearly 11,000 people who attended the Information on Demand (IOD) conference. What a fantastical week !!! Each day of the conference was filled with business and technical sessions, meeting with experts, customer interactions and more. It's always a very busy week trying to squeeze every minute out of these days. Needless to say, I slept quite a lot this past weekend :)

Now, if you didn't get to IOD last week or didn't see the announcement the week before, you may have missed the latest information for our newly released Information Server 8.7. I'm very excited about this release as it includes some particular features that will make a tremendous impact for our customers. As a matter of fact, many of the customers I met with at IOD who attended one of the "What's New" sessions or had the chance to work with the technology in the Demo Room commented that they will be starting their 8.7 upgrades asap in order to take advantage ot these new items.

So... what's new?

You can read the full details of "What's New" in our whitepaper of course, but let me highlight a few of my favorites in the following list. As you read through, I hope you get a sense of the priority we placed on productivity... solve tough challenges in intuitive ways that are then easy to use.

Operations Console - (my favorite) dashboard technology that puts all the answers of your runtime environment within a couple of clicks through our new web interface (pictured right). Start with summary graphs that provide rolled-up information across all your projects and then drill into the specifics based on your access permissions.

Interactive Parallel Debugger - accelerates the development lifecycle with an integrated troubleshooting framework. Run in a debug mode, pause on data conditions or record counts, investigate data values by node, etc.... What is truly market-unique about this technology is the ability to debug at any degree of parallelization across any SMP, MPP or Grid configuration.

New Netezza Connectivity - Information Server and Netezza now work together better than ever to deliver break-through analytics. You can read my blog for some other details, including some awesome benchmarks for reading and writing to Netezza

Information Analyzer Data Rules Stage - introduces a new stage to the DataStage/QualityStage job canvas to enable a shared rule framework between data profiling, data cleansing and data integration. Take the same insight the subject matter expert or business analyst built to monitor quality and use those rules to choose what data to remediate in your job flow. Or... flip that around and have your developers create the rules and then offer them to the data steward to apply for data quality monitoring.

QualityStage Match Specification Wizard - through a few simple selections, allows a user to very rapidly build solutions for the most common data quality use cases. I spoke to one of our 8.7 beta customers at IOD and he shared that a new hire of theirs fresh out of college was using this technology successfully within a day !!!!

Business Glossary Collaboration - really a phenomenal amount of advancement in one release .... integrated UI for all roles, new label construct for classification, logical model integration, and workflow/approval process for term authoring.

So... what now?

If these features sound interesting to you, you might be considering what are the next steps to find out more about Information Server or even to start your 8.7 journey. Here's some things you might consider:

If you want even more details on the specifics, take a look at our Information Center for 8.7 which has all of our product details (one of those things I like about IBM - it's always easy to find information about our products).

Reach out to your IBM Account team - always a good place to start for deep dives and next steps.

If you are ready to start your 8.7 install or upgrade, shoot a note to our Concierge team concis85@us.ibm.com who will provide best practice guidance as you go.

Or, shoot me an email.... tcurcio@us.ibm.com ... and I'd be happy to field any 8.7 questions of plug you into others who can get you started.

Announcing the Next Generation of Netezza Connectivity for Information ServerBetter, Stronger and a whole lot Faster !!!!

I pleased to announce that the we have just released our next generation connectivity for IBM Netezza. This is a fantastic evolution to what was already a compelling integration provided by the original Netezza Enterprise Stage. InfoSphere Information Server and Netezza Appliances now work together better than ever to deliver break-through analytics .... yes, a little Six-Million Dollar Man like.... better, stronger and a whole lot faster !!!!

Better - The New Netezza ConnectorThe Netezza Connector provides shared metadata and native connectivity across all data integration (both server and parallel), data quality and data profiling use cases. It provides a rich set of out-of-the-box features for creating tables with different distribution key options, support of executing UDX functions and Netezza statistics collection.

What I find the most cool are the automation features which have been extended to the Netezza Connector. These capabilities allow developers to solve complex problems with simple selections. They include:

Multi-action modes that automatically attempt a secondary operation, including “Update then insert” and “Delete then insert”, if the first operations results in an error

Action columns that allow a specific data value in the data integration stream to drive the operation (insert, update, delete) on the database

Generated Netezza specific SQL for both the table actions and the creation of database objects based on the data fields being passed through the data integration job.

Dynamic SQL Execution using job parameters or through a SQL file, executing the sql in a single node or multiple node.

Partitioned reads through scaling out multiple read requests and automatically merging those results.

Staging Table Management with choice of loading directly into target table or staging table, and automatically promotion from staging table to the target table

Stronger - Balanced Optimization for Netezza

Balanced Optimization allows job logic to be executed in full or in part on the source or target database. The benefits of this flexibility are obvious, most specifically in cases where data integration tasks require homogenous processing (i.e. the source and target for the logic is the same database). In these scenarios, processing a set of logic directly on the database can have tremendous payback.

The new Netezza Connector makes it possible to leverage the features of our DataStage Balanced Optimization product. After following the same job design as he would for any other activity, the user chooses which pieces of job logic they would like performed on the database and dynamically choose which job to run based on appliance and ETL server’s load utilization. The product makes it simple to make this choice by offering intuitive selections like "push data reduction processing to the database". The unique Information Server approach allows these capabilities to be used in-flight with the Information Server engine - for the unique transformations and quality elements it provides – all without ever landing data to disk and while maintaining parallel processing throughout. This combination makes it possible to leverage massive scalability at whatever step necessary to achieve business objectives asap.

Faster .... oh yeah ....

In bringing this next generation to life, the development team spent considerable time in optimizing the code to maximize throughput rates and then prove them out through some grueling benchmark tests. At peak configuration, the connector for Netezza load and unload achieved performance throughput in excess of 2 TB/hr!!! If you are interested in that specifics of that benchmark, please drop me a note at tcurcio@us.ibm.com. I'd be happy to share a view on the commodity hardware we used to achieve that.

If you are an existing Information Server and Netezza customer, you might be interested in knowing how that stacks up against the original Netezza Enterprise stage which had also used native connectivity features. Thanks to the optimization efforts in the new connector, our lab test have shown adouble-digit performance gain in reading and loading information. So, if you are already using Information Server with Netezza the new stage should also prove of great value to you.

It's All About GrowthSaving "Z time" with InfoSphere Information Server and Netezza

One of the main reasons I enjoy my job is that it allows the opportunity to meet, work with, and build friendships with people across a wide range of organizations. Some of these great folks are in the biggest and most well known companies in the world, while others are at growing companies albeit with less recognizable names. In both these cases, there are very serious IT agendas and, not surprisingly, these agendas share very common requirements.

While preparing for the Informaton On Demand conference this past week, I had a conversation that summarized one such requirement that I've been hearing lately from a variety of customers. What he said was very simple and directly to the point..."it's all about growth".

In this particular conversation we were discussing the advantages they've realized from their Information Server grid architecture. Better utilization of available hardware. The ability to deploy new business projects quickly. Shared resource pools to leverage more compute power for any given task with the outcome being that business results are achieved faster. While this was a great validation of this customer's success with their Information Server grid, I came away from that conversation thinking more about the succinct comment about growth.

Why is it "all about growth"? I think it can be explained in this way ...

"organizations that cannot immediately support growth for increasing data volumes and unexpected/compelling business projects will lose value at a rate corresponding to the time required for them to scale to those requirements"

You could argue that is a bit obvious. It's the same economics you'd think about when considering bringing a project in on time. If the project is late 2 months, that's 2 months of revenue that get pulled out of the original revenue case that justified that project.

I thought this customer's focus on on the scalability of the infrastructure as a primary enabler hit the point dead on. Basically, growth cannot be a tomorrow agenda. Perhaps an organization doesn't need the extra capacity on hand today... but they do need to have that plan for growth in place and the planned execution time needs to be quantifiable. Can I get from X to Y in Z time? What will it cost me to get there? Considering the leaps in business value I could capitalize on, what revenue might I lose in "Z time"?

We've been doing a lot of work with the Netezza team lately and they are another set of folks that understand the value of "Z time". With appliances that can be deployed inside a day, and capacity that can be added quickly and consistently, they present a very compelling value proposition to the organizations concerned about being nimble on their growth agenda. Furthermore, benchmark testing in our lab has proven that the massively scalable architectures of both Information Server and Netezza allow them to exchange information at very impressive sizes and speeds.

If you are considering how to plan for growth - or if you are having issues turning around today's data volumes - there are proven data integration and data analytic solutions that can help you realize the business opportunities that are your goal. If you'd like to discuss where you are on this growth journey and how you are looking to minimize "Z time", drop me a note at tcurcio@us.ibm.com.

Information Integration for your IBM Smart Analytics System 7700

I'm often asked by customers whether IBM offers an "appliance" for Information Server as we do with several of the other products within Software Group. This question derives from organizations who have recognized increased value from moving to purchasing packaged hardware and software solutions. The advantages are clear in this approach:

Time to Value - appliances are delivered "ready-to-go" for immediate use by the project teams

Optimized - the hardware components (CPU, memory, disk) and their setup maximize the power of that combination

Standardized - configured by product experts based on best practices to gain long term benefits from that commonality across the industry

I'm pleased to share that today IBM has announced a new addition to its successful business-ready analytics solution family with the availability of a pre-installed and pre-configured Information Server. This offering is known as IBM® Smart Analytics System 7700 Information Integration Software. In combination with the other product modules already included in the IBM Smart Analytics System 7700, this offering delivers the following advantages :

Reliability and availability: Fully integrated and optimized components reduce client time and cost of integrating and optimizing analytics solutions for business use.

Comprehensiveness: Leverages the industry's most comprehensive portfolio of hardware, software, storage, and service offerings to accelerate value from growing IBM portfolio of powerful analytic and data integration modules.

If you are looking for more information about this offering, please visit the following links for the announcement and the product page.

RollupAs noted above, three new and powerful stages were added to 8.5 just after it shipped. These stages have been included in the Fix Pack to streamline applying these features into your environment. These stages are

XML StageI wrote about this new stage in a prior blog entitled Scalable and Intuitive XML. As a summary, if you are looking to find a way to process and scale out multi-gigabyte, complex XML, this is a technology you should look at.CDC Transaction StageProvides a very active design, metadata and runtime integration of InfoSphere Change Data Capture
with the ETL capabilities of DataStage. This functionality allows the replicated database log information to be directly piped into a running DataStage job in order to support complex transformation and heterogeous targets. This also includes novel 'bookmarking' technology to provide guarenteed delivery of the information throughout the solution. You can find more informaiton in the InfoCenter.DRS ConnectorAdvances the mature 'Dynamic RDBMS' concept to the newer Connector framework. Allows parameterization of the source/target database type fully so that jobs can be written once and run against multiple DBMS types (DB2, Oracle, ODBC).

New FeaturesI seem to be building this blog in sets of three, so in staying with that form, here are a few of the new features packaged into DataStage as part of FP1:

NEW - Job Log Detail Window in DesignerThe Director's Log Detail window has been integrated into Designer (see picture to right). With this enhancement, users can access the full set of job runtime information directly alongside their job design, thus simplifying the development lifecycle. You will also note that additional job control buttons (job stop/job reset) are provided in the log toolbar.

NEW - Bloom Filter StageThe Bloom Filter stage is based on the algorithm developed by Burton Howard Bloom and provides highly performant and resource efficient duplicate key filtering. This is particularly useful in Telco data integration patterns involving the enormous volumes of call detail records.

NEW - Environment Variable Import/Export FeaturesThe
DataStage Administrator includes two new buttons that control the export
and import of environment variables. These files can be shared between
projects in the same Information Server instance or across mulitple instances as you
may require in promoting from dev to test to prod. This feature ensures that
customizations made to one project's runtime environment can be applied
consistantly across your enterprise.

Follow-up

If you'd like more information on Information Server 8.5 or specifically about Fix Pack 1 feel free to drop me a note at tcurcio@us.ibm.com.

These offerings are suited to help enterprises as they set about various types of projects. In this blog, I'll spend a few minutes sharing some thoughts about Workgroup Edition specifically... and a bit about a great weekend here in Charlotte.

Right Sized Pony for your birthday partyThree weeks ago I spoke with some friends who have a few horses and asked if they could help me surprise my daughter for her eight year old birthday party. They were kind enough to help out and the party earlier today was great. I have to say Kimberly was very surprised when the pony came walking into the backyard !!! Obviously, the party was on the top of my mind much of the week and as I sat down to write tonight, I was reminded of one of the questions from our friends. They wanted to know "how old will all the kids be". The reason for this question was pretty obvious once I thought about it.... they wanted to make sure they brought the right sized horse for the kids. If the horse was too big, the kids would have had a hard time getting on, and some may have been scared to participate. If the horse was too small, it would have tired out and not had enough energy to make it through the party.

Right Sized Data Integration PlatformThis is a good analogy for the questions moderate sized organizations face when they consider moving forward with data integration software. These companies may not have enormous data volumes, but the business requirements they must satisfy are just as complex as those of larger enterprises. So, when they consider a data integration platform, they need the same robust capabilities to understand, cleanse, transform and deliver their data. However, their investment level needs to align to their business size. For these organizations, IBM is pleased to now offer Information Server Workgroup Edition.... not too big, easy to get on, and enough power to easily make it through the party. Workgroup Edition includes the breadth of data integration, profiling, advanced data analysis, quality and governance capabilities you need to make your project successful at a price and performance level aligned to your business.

If you work in a mid-sized business, or a department in a larger enterprise with a limited scope, you don't have to sacrifice industry leading data integration capabilities. If you're concerned about how your project may grow and want a product that can scale with your business, Workgroup Edition will allow you to move directly into the enterprise version of Information Server. This paves a very smooth road into the future for your business.

If you'd like to know more about Information Server Workgroup Edition, contact your local IBM team .... or feel free to send me a note (tcurcio@us.ibm.com). I can also tell you where to find a few good horses.

Many organizations I speak with are increasingly concerned about the privacy of their information. The reasons are obvious... information leaks, regulations, law suits, and the risk of penalties top the list. However even beyond those, companies now recognize that protecting the information of their customers can be a key differentiator against competitors and help prevent situations that would lead to customer loss. As these organizations adopt policies to control their risk, they find that data masking requirements appear in various quadrants of their IT organization, including the data integration space.

To help our customers respond to these challenges, IBM has released the new InfoSphere DataStage Pack for Data Masking. You can find that announcement at the following link:

In this blog, I'll preview some of the primary features of this new stage and then provide some sample illustrations of how it may be used for in-line masking for the protection of data in an ETL process.

Capabilities

The Data Masking Pack is fully integrated with DataStage and operates just like any other stage on the DataStage job canvas. This makes it very simple to insert into any process where data elements such as customer names, addresses, national identifier numbers, credit card account numbers, email addresses, and the like must be protected.

The pack has been built upon the same obfuscation technologies used in the market leading InfoSphere Optim products. Combined with the features of DataStage, it provides some key capabilities to help prevent the exposure of sensitive business data, including:

capabiliies to simulate realistic data in situations where data type and format must be preserved

the performance, scalability, and reusability of InfoSphere DataStage

support for masking complex file types, including mainframe and EBCIDIC

Sample Masked Data

For folks who are new to this topic, you may want to see an example of masked information to have better context for the rest of the conversation. In the image below, we are masking with the email address policy. The original values are on the left of the screen and the masked values just to the right. The three rows illustrate options that are available for this particular policy... preserve domain name, preserve user name, and generate both. Depending on where this data will be reused, you may prefer either one of these policies.

Job IllustrationOnce installed, the Data Masking (DM) stage appears as any other on the Designer pallete. The screen shot below illustrates how that may look in a sample job. In this case, we are simply moving data from one file to another and applying the masking rules as the only transformation, but this job could have any series of transformations, aggregations, pivots and the like. The DM stage is capable of setting the masking policy for any number of fields, so in most use cases you would only need one such stage in a job. Of course, not every field requires a mask, and those fields can simply pass through this stage unaffected. The DM stage also includes validation checking. For instance, if you are masking a Social Security Number, you may want to reject any data that doesn't conform to a standard pattern for SSN. In those cases, the user can set a property to either send that data down a reject link (not drawn in this particular job). Alternatively, the job can be set to abort on those conditions or simply pass the data through unaffected. This provides very robust handling for exceptions.

To configure the stage, the user enters the Masking Policy Editor (shown below) where they are presented with three separate sections:

output column: lists all columns in the record stream and allows the developer to choose which ones require a masking policy

masking policy: any of a series of policies for the obfuscation of data, including National ID for a variety of countries, Credit Card Number, Random, Repeatable, etc...

mask policy options: depending on the policy selected, the relevant options for configuration of that policy

The developer simply works through the drop down lists to select the columns and policies that are required. In the screen shot to the right the masking policy is set to "Hash Lookup" which gets a column or multiple columns from predefined lookup tables. This feature is important where the customer requires that a particular data value will always map to the same substitute value. The pack includes substitute data for several reference sets, including first name, last name, company name, and address.

You can see from these few illustrations that the masking policy is very straightforward to configure. It's also important to note that the runtime components scale across the DataStage engine like any other parallel stage. So, regardless if you're running on an SMP, MPP or Grid, you can leverage the entirety of that compute power to mask huge volumes of data in-line.

SummaryIf your organization is challenged with data privacy issues related to moving data throughout your organization, I'd enjoy discussing with you the unique benefits DataStage can introduce into those scenarios. As always, feel free to drop me a line anytime.

Unleashing the Power in DataStage 8.5 Issue #4: Scalable and Intuitive XML

You know the story of the ugly duckling? A mother duck's eggs all hatch and one of the baby "ducks" is rather odd looking. The bird is teased for being different. However, after winter ends and spring arrives, the bird has matured into a beautiful swan.

If you have used the previous DataStage XML pack you would probably agree it was a little awkward; XML in ETL tools has never been simple. It could handle moderate to complex requirements, but it required a lot from the user during the job design process. Additionally, since it relied on a standard industry parser, it wasn't easily scalable.

The perspective on 8.5 is very different... winter has ended and spring has arrived. However, I don't want to call the new XML pack a "swan". Sure, it includes a very attractive UI component, but its runtime engine component is also incredibly powerful. So, if you don't mind me mixing bird metaphors, let me call it a "phoenix" and I'll use this blog to focus on both aspects... pretty and packing a punch.

New UI Components - Schema Library Manager and Assembly Editor The 8.5 XML pack has been built specifically with hierarchical data in mind. This means that the various components of the data integration process - metadata representation, data operations, root selection, field mappings, etc... - all present the hierarchical model in a very intuitive way in order to simplify the job design process and thus accelerate the user's time to value.

Pictured to the right is the "Schema Library Manager" (please click on the pic to see it - the blog emulation doesn't scale well). In 8.5 the xsd files are imported into the repository in their native format. This ensures that the metadata is preserved entirely. The UI has been enhanced to show the tree structure and extended attributes and factes for each node of the hierarchy. Users may build multiple "libraries" in order to group related xsd's into the most logical buckets. The Schema Library Manager is accessible from both the main Designer menu as well as from within the Assembly Editor (discussed below) in order to present an integrated method to move from metadata import into job design.

Once the metadata has been imported, the developer works in the XML Stage's "Assembly Editor" to define the logic for the job. An "assembly" defines a series of steps/tools that parse, compose, and transform hierarchical data. That last point is one I should stress. These tools are designed to work within the hiearchical data. That means that a join can operate to combine two XML data streams at a specific level of the hiearachy. This enables very complex integration tasks to be handled very simply.

Pictured to the left is the the Assembly Outline and Assembly Pallete. The pallete contains each of the tools that can be selected by the user. The user drags any of these tools directly into the outline. Jobs that will read an xml document will use the XML_Parser to extract the fields that are required from the doc. Jobs that are writing data will use the XML_Composer step to form the new xml. Additionally, a user may use both of these tools within a single assembly in order to transform an xml doc from one form to another - without ever having to convert this data into a relational format.

As you would expect, each step contains a set of properties that are necessary to define the logic for that stage. For example, the Sort step requires that the user selects the level of the hierarchy that is to be sorted (i.e. what level of the doc is being sorted) and the key field within that scope. Pictured to the right is the Mappings section for the Output step (in this scenario we are converting the XML data into a relational stream). At the output link level, the user specifies the primary scope element from the hierarchy and can then either map each field individually, or use the Auto Map features which selects the best candidate column through a context sensitive scoring algorithm. You can find a much more comprehensive view of these features in our publically available Information Center.

Engine Runtime Component Like the user interface, the engine component has been built specifically for processing hierarchical data. Rather than relying on open source parsing technology which has historically had scalability issues, IBM has engineered a highly performant engine component for this solution. Some of the distinguising features of this runtime are:

Streaming – Support for any document size without specific memory requirements

Parallelism – Multi-threaded - Employ multiple threads in a single stage – Partitioning - DataStage partition parallelism can be used to run multiple instance of the stage on multiple cpus/hosts – Document partitioning - Partition parallelism for a single large document

Large Object support – Can stream large objects from source Connectors through XML stage and to target Connectors

Customers who have already adopted this XML pack have had truly fantastic results. In comparison to using a standard industry parser, one customer found that without even implementing any of the parallelism techniques, they were already able to process the same use case in 1/9th of the time and with a fraction of the resource requirements. Another customer (known for processing very large data) did something of a bake-off by running gigabytes of XML data through both the new XML stage and custom code they had previously written for their use case. While both solutions performed similarly, the DataStage features were compelling since the performance came without having to invest significant time in tuning custom code that only applied to a single solution.

Summary If you have the need to process large and complex XML, I encourage you to take a look at DataStage 8.5. If you're looking for more information, I would encourage you to explore the Information Center link I supplied above or Ernie Ostic's "Real-Time Data Integration" blog. Ernie is an expert resource on several topics and has a wealth of material he has made available on his site related to XML.

No waterfowl or mythological winged creatures were harmed in the writing of this blog

I've been looking forward to writing this blog on the Vertical Pivot feature that we've introduced in DataStage 8.5. While it's not quite as multipurpose as the Looping Transformerand Caching mechanisms that we've previously reviewed, it sure is elegant. Yes, I said 'elegant'.... "cleverly simple; ingenious". I should point out that I'm not tooting my own horn of course, but rather bragging a bit on the development team that worked on making this happen. They built a very simple user design paradigm with the flexibility that is required to make a vertical pivot a rather simple operation.

While I expect most folks reading this will be familiar with the concept of a pivot, let me provide a brief description for those not yet familiar:

restructuring data to move from or to repeating fields on a single record

Repeating Fields and Pivots

A "repeating field" is when a particular data item occurs more than once on any given record. There are a variety of places across the enterprise where repeating fields normally show up. I've seen them on a number of projects dealing with sending files to or receiving files from external parties as well as in dealing with data sources from legacy applications. An example of this might be a set of fund identifiers and fund amounts as in the following table:

ID

Name

fund_id1

fund_amt1

fund_id2

fund_amt2

fund_id3

fund_amt3

1

Smith

ABC

100.00

QQQ

200.22

ZYX

133.33

2

Jones

QQQ

444.44

ZYX

87.93

Each row can store three sets of values for fund_id and fund_amt, but in row 2 we have only populated two sets. The lack of populated data in one or more field sets is a common occurence with repeating fields.

If we were to perform a horizontal pivot (traditionally the easier of the two pivot types) on this data, we would expect the output to look like the table below. You can see that we move the data from the wider, repeating representation to a taller, narrower one. { At this point, we could spend some time talking about "normalization" and specifically "third normal form" (3NF), but we'll leave those topics for research outside this blog. }

ID

Name

fund_id

fund_amt

1

Smith

ABC

100.00

1

Smith

QQQ

200.22

1

Smith

ZYX

133.33

2

Jones

QQQ

444.44

2

Jones

ZYX

87.93

In the vertical pivot case, we do the opposite by moving this taller, narrower data into a wider form. In order to perform the pivot, there are three basic requirements.

ability to group data by a key or set of keys

restructure the column metadata and data into a repeating group

perform aggregation functions on fields from the repeating group and non-key fields

In this example, we would say that "ID" is the key (#1), and "fund_id" and "fund_amt" are part of the repeating group (#2). Aggregation functions can apply to fund_amt if we needed to calculate an average for all funds for this ID for example. What might be less obvious is that we may also need to perform an aggretion on the "Name" field. If the last row in our narrow table above had a value other than "Jones" (or perhaps a NULL value), we might want to apply the first or max aggregate function to the Name field in order to get a consistant result across all data.

Building the Vertical Pivot

Pictured to the right is a screen shot of the stage properties that illustrate how to build the scenario I just described. I'll describe this design in 5 steps:

Step 1: on the Properties tab, set the "Pivot Type" to vertical.

Step 2: click the "Load" button in order to choose the set of available fields from the input link that you want considered for this pivot function (ID, Name, fund_id, fund_amt).

Step 3: select/check the fields that should be grouped (ID) or pivoted (fund_id and fund_amt).

Step 4: select the "Aggregation functions required" field to choose one or more of the available functions - first, last, min, max, avg, sum.

Step 5: Enter the number of repeating groups for the output record in the "Array Size" field.

That's it. Basically just a few clicks to solve the vertical pivot challenge. Good job to the dev team for supplying us such an elegant solution. If you have the opportunity to use the new feature, drop me a note - would love to get your opinion.

Also, if you're interested in learning more about DataStage 8.5, I provided an overview of the release to the DSXchange community last week. Dennis James, founder and editor of the 35,000+ developer strong community, was kind enough to record this session and make it available at the following link. If you're not a community member yet (for this or DSXchange), I'd encourage you to enroll.

Unleashing the Power in DataStage 8.5

Issue 2 - Caching in the Transformer

This is the second in a series of blogs on the new features of InfoSphere DataStage 8.5. In this series I'll provide some detail on the new functionality and explain areas where I think customers will receive significant benefit. The first issue, "Looping Transformer" is available at the following link. This blog will build on some of those concepts, so you may want to review that post before reading this one.

One of the guiding principles for the DataStage roadmap is how we can simplify even the most complicated data integration challenges. When we can achieve that, and make data processing more efficent, it's the best of both worlds. The new cache mechanism is a benefit to both of those goals. We'll illustrate this with an example that illustrates calculating percentages within a group.

What's a Transformer Cache?

The Transformer Cache is an in-memory storage mechanism that is available from within the Transformer stage and is used to help solve complex data integration scenarios. The cache is a first-in/first-out (i.e. FIFO) construct and is accessible to the developer via two new functions:

SaveInputRecord: stores an input row to back of the cache

GetInputRecord: retrieves a saved row from the front of the cache

These functions should be called from the stage variable or loop variable sections of the transformer in most cases.

Developers will find the cache most useful when a set of records need to be analyzed as a single unit and then have a result of that data appended to each record in the group. Let me describe some scenarios where using a cache will prove VERY helpful:

The input data set is sorted by fund id and valuation date in ascending order. We have an unknown number of records for each fund. The requirement is to output the five most recent valuations for any fund and if there are not at least five, do not output any.

There is a varying number of clients (N) related to each salesperson. The requirement is to label each such client detail record with a label that reads "1 of N".

An input file contains multiple bank accounts for each customer. The requirement is to show the percentage of the total balance for each individual account record.

Perhaps one or more of these sounds familiar to you. I've needed to implement the third scenario a few times myself and have seen many customers do so with a similar design pattern. While that standard pattern was effective, the cache solution will prove more resource efficient. I'll describe this solution below. You may also refer to the Information Server InfoCenter for more detail on this solution.

Percentage of Total Scenario

I've included a picture of the transformer with a few illustrations that will help aid understanding. The logic of this transformer is as follows:

input data is read and stored in the cache via the "SaveInputRecord" function

key breaks are tested via the "LastRowInGroup" and "LastRow" functions

a summing balance is calculated while the group is the same and the total balance field is set at key break

if there is a key break, the "RowsAtBreak" variable is set to the number of records on the cache and we'll skip to step 6

if there is no key break, the Loop Condition will test false (since RowsAtBreak will be 0). This will cause the next input record to be read and processed (note the loop pictured on the top left)

finally, the transformer will loop while "@ITERATION <= RowsAtBreak" and read the cache records (this is the loop pictured on the bottom right)

for each output record processed, the percentage is calculated as Balance/TotalBalance.

Most of this logic will seem fairly intuitive but two items need additional clarification.

In step 5, we noted that RowsAtBreak must be "0" until we are ready to process output data. This will prevent the transformer from entering the Loop construct since @ITERATION is equal to "1" before we enter the loop. Since 1 <= 0 will test false, the transformer will not enter the loop construct and instead will read the second input record. Assuming we then hit a key break, RowsAtBreak will be set to 2 and we will then iterate through the loop twice.

The second item to clarify is a simpler one. In the loop variable we call the GetSavedInputRecord function. When this is called, the output derivations for "DSLink2" redirect to the row pulled from cache. This allows us to code a single output derivation (for instance "DSLink2.Balance/TotalBalance") which would work for either a cached row or one coming direclty from the input link.

At this point, we've solved scenario 3. You probably can surmise the variations that would be required to solve scenario 1 and 2... they only require a few small tweaks to this model.

Earlier I mentioned that there are other ways to solve these data integration challenges in Datastage. One of the more common design patterns would have required splitting the data down two links, aggregating by ID on one of those links and then joining the data back together to put the ID total balance on each detail record prior to calculating the percentage. In comparison, the cache solution described in this post requires less system resource. Developers familiar with 3GL coding will also likely appreciate the loop condition construct as a more natural way to solve the challenge.

OK, so now here's a challenge for you. I have a hardcopy of the redbook "Deploying a Grid Solution with IBM InfoSphere Information Server" that I'll send to the first person who posts a comment to this thread where they describe a variation on this cache solution that requires one less stage variable (I have one in mind ... perhaps you'll think of another).

Hi. Welcome to my blog. If we haven't had the chance to meet yet, let me introduce myself. I'm Tony Curcio the Product Manager for DataStage, Information Services Director and the Balanced Optimizer product within Information Server. In this role, I have the chance to meet many of our customers, understand the types of projects they are applying the technology to, receive feedback from their experiences and then use that information to prioritize the product roadmap. It is a very rewarding job - and one that I am very passionate about. In my posts, I hope to provide an inside look on data integration as a general practice and what we are specifically doing at IBM to help our customers.

This will be the first in a series of posts related to some of the new features we've included in InfoSphere DataStage 8.5. As I've had the opportunity to speak with many of our customers who participated in the beta program and are now upgrading, I've been able to hear first hand some of the "oohs and aahs" when folks begin working with these new features.

In this series I'll provide some detail on the new functionality and explain areas where I think customers will receive significant benefit. I'll be very interested in hearing back on this blog (or privately - see my contact details) on other scenarios where you are benefitting from these enhancements.

Looping Transformer - The One Inch Punch

I'll quote a bit of wikipedia here for those not familiar with the One Inch Punch....

The one inch punch is a skill which uses fa jin (translated as explosive power) to generate tremendous amounts of impact force at extremely close distances.

I like that definition for the Looping Transformer for a couple of reasons.

Extremely close distances - is very applicable since this is actually built into the Transformer Stage that developers have been using for years, so you can begin using it in any existing job and there's only a few new things to know

Explosive power - this feature allows very complex data integration challenges to be solved elegantly from a design experience, and efficiently from a resource perspective

One inch - well, that's about the size of the stage on the canvas (sorry, I'm likely to insert corny jokes from time to time)

Last week at the DataStage Deep Dive session at the Information OnDemand conference I presented a few scenarios where looping transformer moves certain patterns from complex to simple. I'll review my favorite one in this post (one of the others will be included in my next post on "Caching in the Transformer").

Variable Length Records with Embedded Payloads

The following example derived from a customer I was speaking with last month. They had a very interesting challenge related to a variable length string that included multiple record types and payloads embedded in the data. Here's a sample record that I drew up:

ID

String

1

A005aaaaaA005bbbbbB010cccccccccc

2

....

You can see there is a series of record types (the first being "A"), payload lengths ("005") and payloads ("abcde"). In the output the customer wanted to convert that data to the following:

ID

Rcd Type

Data Record

1

A

aaaaa

1

A

bbbbb

1

B

ccccccccccc

2

…

…

What makes this a challenging problem is the fact that the length is defined in the data and the number of segments can vary tremendously (record 2 may have 100 payloads in that string). In DataStage 8.5, the looping transformer handles this very easily by simply introducing a loop condition. Here's the transfomer logic for solving this:

If you are a DataStage user, you'll probably recognize most of what's on the screen and then double-take on the fact that there's this new "loop condition" section on the canvas. You'll notice there is a loop test (which is a "while" condition) that controls how many times the output logic should be iterated over. The logic that has been circled includes a new variable named "@ITERATION" which is a counter indicating what pass through the loop this is. One other item that will appear new are the Loop variables - basically the same as Stage variables, but these get evaluated each time through the loop. The rest of this logic is old hat ... basic substringing and calculations to delineate the record payloads. The test for RemainingRecord <> "" allows us to exit the loop when all bytes in the string have been consumed.

You can see this is a very intuitive way to solve this particular issue and incrementally only added a single loop condition. It avoids several other transformers and funnels the customer is using in the current implementation. The savings therefore apply not only to the initial design experience and runtime performance, but then also the ongoing maintenance of this job as related requirements in the organization change.

As I mentioned earlier, if after you review this info you recognize an opportunity to optimize a similar data challenge in your enterprise, I'd enjoy discussing it.

If you are interested in learning more about Information Server or DataStage 8.5, I'd recommend a few items to read: