Inside Oracle's Data Integration community

Thursday Aug 29, 2013

Carrying on from the blog post on performing file transformations within ODI using nXSD, in this post I will show how a large file can be debatched into smaller chunks/shards. The example uses a COBOL data file as input and debatches this file into parts using the debatching capability of the adapter.

COBOL is a great example of a complex data source for Hadoop and Big Data initiatives since there are plenty of systems that companies wish to unlock potential gems from. Its big data and there is tonnes of it! The ODI file transformer tool on java.net has been extended to include an optional parameter to define the number of rows in the chunk/shard. Having this parameter now lets you take control of the relative size of the files being processed - perhaps the platform has certain characteristics that work better than others. Hadoop has challenges with millions of small files and with very very large ones, so being able to prepare the data is useful.

The COBOL copybook was used as an input in the Native Format Builder and an nXSD generated for the copybook, I used the following copybook;

02 DTAR020.

03 DTAR020-KCODE-STORE-KEY.

05 DTAR020-KEYCODE-NO PIC X(08).

05 DTAR020-STORE-NO PIC S9(03) COMP-3.

03 DTAR020-DATE PIC S9(07) COMP-3.

03 DTAR020-DEPT-NO PIC S9(03) COMP-3.

03 DTAR020-QTY-SOLD PIC S9(9) COMP-3.

03 DTAR020-SALE-PRICE PIC S9(9)V99 COMP-3.

The data file came from an example on the web. Below you can see for the ODIFileTransformer tool, an actual example of the command executed, the SHARDROWS parameter defines the number of rows to be written to each data file. The tool simply adds an integer index to the end of the output parameter - a little basic I know, the source is on java.net if you feel like being creative.

Executing this generates many output files with for example 2 rows in each, this is a contrived example just to illustrate the 2 rows in the generated file;

<ROOT xmlns="http://TargetNamespace.com/CBL">

<DTAR020>

<DTAR020-KCODE-STORE-KEY>

<DTAR020-KEYCODE-NO>69684558</DTAR020-KEYCODE-NO>

<DTAR020-STORE-NO>+020</DTAR020-STORE-NO>

</DTAR020-KCODE-STORE-KEY>

<DTAR020-DATE>+0040118</DTAR020-DATE>

<DTAR020-DEPT-NO>+280</DTAR020-DEPT-NO>

<DTAR020-QTY-SOLD>+000000001</DTAR020-QTY-SOLD>

<DTAR020-SALE-PRICE>+00000019.00</DTAR020-SALE-PRICE>

</DTAR020>

<DTAR020>

<DTAR020-KCODE-STORE-KEY>

<DTAR020-KEYCODE-NO>69684558</DTAR020-KEYCODE-NO>

<DTAR020-STORE-NO>+020</DTAR020-STORE-NO>

</DTAR020-KCODE-STORE-KEY>

<DTAR020-DATE>+0040118</DTAR020-DATE>

<DTAR020-DEPT-NO>+280</DTAR020-DEPT-NO>

<DTAR020-QTY-SOLD>-000000001</DTAR020-QTY-SOLD>

<DTAR020-SALE-PRICE>-00000019.00</DTAR020-SALE-PRICE>

</DTAR020>

</ROOT>

This is a small extension to the earlier version of the tool and provides a useful debatching capability. The ZIP containing the tool is on the java.net site, see the earlier post on some background details. ...and then what? Well that would be telling.

Monday Aug 26, 2013

ODI 11g introduced the complex file technology which supported the ability to define a parser for data files and plug it into ODI. I have created an ODI Open Tool (ODIFileTransformation) which utilizes part of this capability just to do native file to XML file transformation - so just file to file based on the nXSD parser definition. The native input file (such as COBOL or whatever) is parsed based on the nXSD definition and the resultant XML file streamed directly (to disk, a pipe or wherever). Btw...its very easy to extend ODI in a number of ways, adding a tool is an easy way to integrate existing capabilities.

Why? It gives us different opportunities for consuming the parsed data and it is now in a convenient form (XML). As an illustration I processed a 200Mb data file and the resultant XML file was generated in 120 seconds whereas using the complex file technology directly in ODI and staging the same native data file in a database took 9000 seconds. There is no staging with the tool, it simply reads the data, transforms using the nXSD and produces the generated XML. So.... you could write the resultant file on to a disk such as HDFS then process it using Hadoop.

The tool has the following signature and is useful for running from the command line for performing file to file transformation or from within a package. This tool was based on a test tool that was blogged here and I changed to remove the XML DOM nature so that it streams to the result file.

The tool is on the java.net site, to install, you will need the java mail JAR (download here) and the tool itself (download zip containing JAR and source here). The JARs get copied to the ODI userlib directory and then you restart the client and register the open tool. The tool can have more options added including debatching into chunked files. Hopefully some useful bits and pieces here.

Friday Aug 23, 2013

Today’s
24/7/365, fast-paced world brought the need for high data availability and with
that came redundant distributed copies of the data. In our complex IT
environment, maintaining data consistency across these distributed systems is
not a trivial task unfortunately. Data discrepancy occurs for many reasons, including
database configuration differences between source and target, configuration
errors during data migration, user errors, gaps in data replication or
replication lags.Even when using
products that replicate data reliably, such as Oracle GoldenGate, there remain
potential causes of data discrepancy.

When data
discrepancies occur, businesses can see its impact in the form of incorrect
decision-making; failed service-level agreements; and ultimately, operational,
financial, and legal risk.

Deployed together with the Oracle GoldenGate
real-time data replication product or separately, Oracle GoldenGate Veridata
ensures data consistency is maintained across databases. The software enables
periodic checks—as frequently as desired—between heterogeneous source and the target
databases without taking either system offline.

Some of the benefits of the product include:

Reduced risk
exposure. Knowing about data inconsistencies can help address issues before
they impact business operations. Oracle GoldenGate Veridata identifies data
discrepancies so that corrective measures can be quickly taken.

Improve IT Productivity.Oracle GoldenGate Veridata's automated data comparison capabilities reduces the time and resources required for data validation. Graphical illustrations and detailed reports clearly
convey the key results and specific data disparities so that IT can take
immediate corrective action. If there are no inconsistencies, users can
document the accuracy of their data for regulatory compliance reporting needs.

Continuous operation during validation. With its ability to compare in-flight transactions, Oracle
GoldenGate Veridata supports databases that are in use without interrupting
business operations. Combining a light footprint and groundbreaking technology,
the application imposes minimal impact on technology infrastructure.

Thursday Aug 15, 2013

Data Quality is a hot new topic that may not get covered much on our data integration blog [we’re now changing that!]. Oracle Enterprise Data Quality an essential element in the data integration portfolio. You can think of Oracle Enterprise Data Quality as your favorite one-size-swiss-army-multitool which can be applied to any number of data management and governance situations. These cases range from Master Data Management, application integrations, system migrations, BI and data warehousing, and finally governance, risk and compliance solutions. Let’s look at a couple of examples for each case:

Master your Data. No matter the type of information you are mastering and syncing—product, customer, supplier, employee, site, citizen, or any other type—you need a data quality process for initial cleanup and load as well as for ongoing duplicate prevention and governance. A sophisticated MDM program will even verify data as it is entered into spoke systems, preventing data quality problems at the source. For some organizations, the idea of implementing a data hub is just too daunting. Implementing an enterprise data quality (EDQ) solution to clean up legacy systems and drive consistent standards won’t deliver as much benefit as a true MDM solution, but it’s a cheaper, faster alternative that can help in the short term and pave the way for a more comprehensive MDM strategy down the road.

Drive Application Innovations. Business applications are only as useful as the data they present. The more complex your application landscape, the more likely you are to need a data quality program to drive consistent data across all your systems. EDQ solutions can clean up legacy systems and put preventative verification and governance processes in place to ensure a steady stream of high-quality data for all kinds of systems, such as customer relationship management, human resources, product lifecycle management, and search (especially for solutions based on Oracle Endeca, which thrives on well-structured, well-standardized data).

Simplify Migrations. Merging disparate data from many sources requires a disciplined approach to profiling, discovery, standardization, match, merge, case management, and governance. EDQ solutions pave the way to a smooth implementation by providing the tools you need to make sure you maintain data quality as you migrate systems.

Improve Business Insights. As with other business applications, BI is a case where putting garbage data in means you’ll get garbage data out. EDQ solutions can be used to standardize and deduplicate the data being loaded into a data warehouse.

Deliver on Compliance. Regulatory compliance of all kinds—including policies related to taxes, privacy, antiterror, and antimoney-laundering—require matching up data pulled from a variety of sources. With EDQ solutions, organizations can meet regulatory mandates with capabilities that support everything from simple deduplication of customer lists to matching data against government lists of suspected terrorists.

You can learn more about our Enterprise Data Quality multitool in our upcoming webcast! Unlike swiss army knives it is guaranteed never to rust or stop you in an airport metal detector.

Tuesday Aug 06, 2013

To organize your groovy scripts and package up utilities there are various approaches to take. This one will exploit a little known fact in the groovy studio support. The code executed has some contextual variables available that you may be aware of - such as the odiInstance variable which represents a handle to the repository in the SDK. There is also another contextual variable which is 'this', this happens to be the groovy shell reference, which is very useful to execute other scripts from. For example the command below if executed will execute the groovy script from the filesystem from my script when executed.

this.evaluate(new File("d:\\accelerators\odi_create_model.groovy"))

This is a useful approach if you want to execute your scripts from a launchpad, or quick start/accelerator control. The image below shows a launchpad with buttons for various activities, one of which is the create model accelerator groovy script I created (this was blogged here).

When the '1. Create Model' button is pressed, the groovy script to execute the create model groovy is executed. This can do whatever you want, below is what my groovy script does to create a model;

Each button in the accelerator launchpad is defined as below, the script representing the action is evaluated when the button is pressed, its quite straightforward;

button(action: action(name: '1. Create Model', closure: {

gshell.evaluate(new File("d:\\accelerators\odi_create_model.groovy"))

})

You can quite easily invoke your companies specific accelerator routines with this approach.

Monday Aug 05, 2013

Here we will see how a parser is defined to project a schema based on a data file with many records in a single line in the file. The complex file adapter provides a plethora of capabilities and is probably one of those under utilized and little known components - plus I think its one that the more examples provided for it the better! This example I will show has a department defined on every line and each department can have many employees (as many as you want) defined on the same line. Replace Department and Employee for your own concepts; buyers, sellers, or whatever. We define the parser in an NXSD file, which is based on the standard XSD with additional information to define how the data is parsed and projected as XML. If you know XSD you are in business, if not, with a little knowledge you can become very dangerous - resort to the examples in the Oracle documentation if not.

I want to define a parser that creates departments for each line (ODI is a department and OWF is a department, its a 3 character string) and also has employees (David, Bob etc a 7 character string) with their job roles (ARCHITECT, DEVELOPER etc a 9 character string) and the department they are in.

My employee records above happen to fixed length based on the length of the name and the role, there are many of them, so they are in an unbounded array and the array is terminated by an EOL character. When I reverse this into ODI I get datastores for DEPARTMENTS, DEPARTMENT and EMPLOYEE; EMPLOYEE has columns Name and Role, DEPARTMENT has column Name.

I can define this NXSD in my favorite editor and develop and test the parsing using ODI, this avoids installing and using JDeveloper and gives you a shortcut to quickly developing and building parsers using this capability. Previous blogs on the topic are here and here.