Sign up to receive free email alerts when patent applications with chosen keywords are publishedSIGN UP

Abstract:

Methods, systems, and computer program products for generating code from
a data flow associated with an extract, transform, and load (ETL)
process. In one implementation, the method includes identifying a data
exchange requirement between a first operator and a second operator in
the data flow. The first operator is a graphical object that represents a
first data transformation step in the data flow and is associated with a
first type of runtime engine, and the second operator is a graphical
object that represents a second data transformation step in the data flow
and is associated with a second type of runtime engine. The method
further includes generating code to manage data staging between the first
operator and the second operator in the data flow. The code exchanges
data from a format associated with the first type of runtime engine to a
format associated with the second type of runtime engine.

Claims:

1. A computer-implemented method for generating code from a data flow
associated with an extract, transform, and load (ETL) process, the method
comprising: identifying a data exchange requirement between a first
operator and a second operator in the data flow, the first operator being
a graphical object that represents a first data transformation step in
the data flow and being associated with a first type of runtime engine,
the second operator being a graphical object that represents a second
data transformation step in the data flow and being associated with a
second type of runtime engine; and generating code to manage data staging
between the first operator and the second operator in the data flow
associated with the ETL process, the code to exchange data from a format
associated with the first type of runtime engine to a format associated
with the second type of runtime engine.

2. The method of claim 1, wherein identifying a data exchange requirement
between a first operator and a second operator in the data flow includes:
receiving user input inserting a data station operator into the data flow
to link the first operator and the second operator, the data station
operator representing a staging point in the data flow operable to
exchange data from the format associated with the first type of runtime
engine to the format associated with the second type of runtime engine.

3. The method of claim 2, wherein data station operator is operable to
exchange data from a format associated with the first type of runtime
engine to a format associated with the second type of runtime engine in a
same single job.

4. The method of claim 3, wherein: the first type of runtime engine is a
runtime engine operable to process structure query language (SQL) based
operators; and the second type of runtime engine is a runtime engine
operable to process non-SQL based operators.

5. The method of claim 4, wherein: the first type of runtime engine is a
database server; and the second type of runtime engine is a DataStage ETL
engine.

6. The method of claim 2, wherein the data station operator includes one
or more of the following attributes: a data station type that specifies a
format of data staged within the staging point, a pass through flag
indicating if the data station operator can be ignored, a name indicating
a staging table name or a staging file name, a data station lifetime
specifying a time when the data station object is removable from the data
flow, or a performance hint including pre-determined information to
improve performance of execution of the data flow.

7. The method of claim 6, wherein the format of data staged within the
staging point is one of a temporary database table, a permanent database
table, a database view, a JDBC (Java Database Connectivity) result set,
or a flat file.

8. The method of claim 1, wherein identifying a data exchange requirement
between a first operator and a second operator in the data flow includes:
automatically inserting a data station operator between the first
operator and the second operator without user input, the data station
operator representing a staging point in the data flow operable to
exchange data from the format associated with the first type of runtime
engine to the format associated with the second type of runtime engine.

Description:

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application is a divisional of co-pending U.S. patent
application Ser. No. 11/621,521, filed Jan. 9, 2007. The aforementioned
related patent application is herein incorporated by reference in its
entirety.

FIELD OF THE INVENTION

[0002] The present invention relates generally to data processing, and
more particularly to modeling data exchange in a data flow associated
with an extract, transform, and load (ETL) process.

BACKGROUND OF THE INVENTION

[0003] Extract, transform, and load (ETL) is a process in data warehousing
that involves extracting data from outside sources, transforming the data
in accordance with particular business needs, and loading the data into a
data warehouse. An ETL process typically begins with a user defining a
data flow that defines data transformation activities that extract data
from, e.g., flat files or relational tables, transform the data, and load
the data into a data warehouse, data mart, or staging table. A data flow,
therefore, typically includes a sequence of operations modeled as data
flowing from various types of sources, through various transformations,
and finally ending in one or more targets, as described in U.S. patent
application entitled "Classification and Sequencing of Mixed Data Flows"
incorporated by reference above. In the course of execution of a data
flow, data sometimes needs to be exchanged or staged at intermediate
points within the data flow. The staging of data typically includes
saving the data temporarily either in a structured physical storage
medium (such as in a simple file) or in database temporary tables or
persistent tables. In some cases, it may be optimal to save rows of data
in the processing program's memory itself, especially when large and fast
caches are present in the system (such "staging" is often referred to as
"caching").

[0004] ETL vendors conventionally support data exchange and staging
internally inside of an ETL engine in a proprietary fashion, especially
if the ETL engine is running outside of a relational database. For
example, the DataStage ETL engine permits users to build "stages" of
operations--i.e., discrete steps in the transformation sequence--and
physically move rows between different stage components in memory. (Note:
The term "stage" as used in the context of the DataStage engine--does not
refer to the concept of saving rows to a physical media, but rather to
unique operational steps). This method, typically allows for some types
of performance optimizations; however, the rows of data being moved
between the different stages are usually in an internal format (stored in
internal memory formats in buffer pools) and the only way a user can view
the rows of data is to explicitly define a File Target (or a Table
Target) in the data flow and force the rows of data to be saved into a
file (or a table)--i.e., only the target of such a data flow can
physically export the rows into a user recognizable format.

[0005] Accordingly, a common problem of conventional data exchange and
staging techniques is that users are not able to specify staging points
explicitly and directly in the middle of a data flow, but only as the end
of a transformation sequence using target operators. Target operators
typically do not serve as an exchange operator--since target operators
are destinations. For example, if a user needs to extract rows from a SQL
(structured query language) table and pass the rows as input to another
type of system which requires a file as input, then the user would have
to represent such a process with a first job--as a Table Source operation
followed by a File Target or Export operation having a specific file
name. The user would then have to schedule a second (separate) job to
invoke an operation that uses the file as input.

BRIEF SUMMARY OF THE INVENTION

[0006] In general, this specification describes methods, systems, and
computer program products for generating code from a data flow associated
with an extract, transform, and load (ETL) process. In one
implementation, the method includes identifying a data exchange
requirement between a first operator and a second operator in the data
flow. The first operator is a graphical object that represents a first
data transformation step in the data flow and is associated with a first
type of runtime engine, and the second operator is a graphical object
that represents a second data transformation step in the data flow and is
associated with a second type of runtime engine. The method further
includes generating code to manage data staging between the first
operator and the second operator in the data flow associated with the ETL
process. The code exchanges data from a format associated with the first
type of runtime engine to a format associated with the second type of
runtime engine.

[0007] Particular implementations can include one or more of the following
advantages. In one aspect, a data station operator is provided that can
be inserted into a data flow of an ETL process, in which the data station
operator represents a staging point in a data flow. The staging is done
to store intermediate processed data for the purpose of tracking,
debugging, ease of data recovery, and optimization purposes. In one
implementation, the data station operator also permits data exchange
between two linked operators that are incompatible in a same single job.
Relative to conventional techniques that requires two separate jobs to
perform a data exchange between two operators that are incompatible, it
is more optimal to use one single job that encompasses both systems,
especially if the job is run in parallel and in batches--e.g., if
upstream producers and downstream consumers work in sync in a parallel
and batch driven mode, the end performance is better.

[0008] The details of one or more implementations are set forth in the
accompanying drawings and the description below. Other features and
advantages will be apparent from the description and drawings, and from
the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

[0009] FIG. 1 illustrates an extract, transform, and load (ETL) system in
accordance with one implementation of the invention.

[0010]FIG. 2 is a block diagram of a data processing system including a
code generation system in accordance with one implementation of the
invention.

[0011]FIG. 3 is a block diagram of a data flow in accordance with one
implementation of the invention.

[0012]FIG. 4 is a flow diagram of a method for inserting a data station
operator into a data flow in accordance with one implementation of the
invention.

[0013]FIG. 5 is a flow diagram of method for processing a data flow in
accordance with one implementation of the invention.

[0014]FIG. 6 illustrates an example data flow including operators
associated with two different runtime engines in accordance with one
implementation of the invention.

[0015]FIG. 5 illustrates an example logical operator graph mapped to a
query graph model (QGM) in accordance with one implementation of the
invention.

[0016]FIG. 6 illustrates an example logical operator graph including
operators associated with two different types of runtime engines in
accordance with one implementation of the invention.

[0017]FIG. 7 illustrates an example query graph model (QGM) graph chunked
into several corresponding sub-graphs in accordance with one
implementation of the invention.

[0018]FIG. 8 is a block diagram of a data processing system suitable for
storing and/or executing program code in accordance with one
implementation of the invention.

[0019] Like reference symbols in the various drawings indicate like
elements.

DETAILED DESCRIPTION OF THE INVENTION

[0020] The present invention relates generally to data processing, and
more particularly to modeling data-exchange in a data flow associated
with an extract, transform, and load (ETL) process. The following
description is presented to enable one of ordinary skill in the art to
make and use the invention and is provided in the context of a patent
application and its requirements. The present invention is not intended
to be limited to the implementations shown but is to be accorded the
widest scope consistent with the principles and features described
herein.

[0021] FIG. 1 illustrates an extract, transform, and load (ETL) system 100
according to one implementation. The ETL system 100 includes a database
server 102 that acts as an ETL engine to integrate data (e.g., from data
sources A, B, . . . N) through an extract phase, a transform phase, and a
load phase. The extract phase includes extracting data from source
systems (e.g., from data sources A, B, . . . N). Most data warehousing
projects consolidate data from different source systems. Each separate
source system may also use a different data organization/format. Common
data source formats include, for example, relational databases and flat
files, and include non-relational database structures such as IMS. The
extract phase includes converting the data into a format for
transformation processing. The transform phase applies a series of rules
or functions to the extracted data to derive the data to be loaded. The
load phase loads the data into a data warehouse (e.g., data warehouse
104). Data integration typically begins with a user describing a data
flow of an ETL process using a UI (user interface) tool. A data flow
represents a logical transformation and flow of data. A code generation
system generates code from the data flow, which generated code is then
sent to the database server 102 for execution.

[0022]FIG. 2 illustrates a data processing system 200 in accordance with
one implementation of the invention. The data processing system 200 can
comprise the IBM DB2 Data Warehouse Edition (DWE) product available from
International Business Machines Corporation of Armonk, N.Y. Data
processing system 200 includes input and output devices 202, a programmed
computer 204, and a storage device 206. Input and output devices 202 can
include devices such as a printer, a keyboard, a mouse, a digitizing pen,
a display, a printer, and the like. Programmed computer 204 can be any
type of computer system, including for example, a workstation, a desktop
computer, a laptop computer, a personal digital assistant (PDA), a cell
phone, a network, and so on.

[0023] Running on the programmed computer 204 is an integrated development
environment 208. The integrated development environment 208 is a software
component that assists users (e.g., computer programmers) in developing,
creating, editing, and managing code for target platforms.

[0024] In one implementation, the integrated development environment 208
includes code generation system 210 that (in one implementation) is
operable to generate code to manage data exchange and data staging within
a sequence of operations defined in a data flow of an ETL process, as
discussed in greater detail below. In one implementation, the code
generator 210 generates code using techniques as described in U.S. patent
application entitled "Classification and Sequencing of Mixed Data Flows,"
Ser. No. 11/372,540, filed on Mar. 10, 2006 (the '540 application), which
is incorporated by reference above.

[0025] In operation, a data flow 212 (e.g., an ETL data flow) is received
by the code generation system 210, and the data flow 212 is converted by
the code generation system into a logical operator graph (LOG) 214. The
logical operator graph 214 is a normalized, minimalist representation of
the data flow 212 that includes logical abstract collection of operators
(including, e.g., one or more of a splitter operator, join operator,
filter operator, table extract operator, bulk load operator, aggregate
operator, and so on). In some implementations, all of the contents of the
data flow 212 may be used "as-is" by the code generation system 210 and,
therefore, the logical operator graph 214 will be the same as the data
flow 212. The code generation system 210 converts the logical operator
graph 214 into a query graph model (QGM graph) 216. The QGM graph 216 is
an internal data model used by the code generation system 210 for
analysis and optimization processes, such as chunking (in which a subset
of a data flow is broken into several pieces to improve performance) and
execution parallelism (in which disparate sets of operations within a
data flow are grouped and executed in parallel to yield better
performance). After analysis, the QGM 216 is converted into an extended
plan graph 218. The extended plan graph 218 represents code generated by
the code generation system 210 and is sent to a runtime engine (e.g., an
ETL engine) for execution.

[0026] In one implementation, the integrated development environment 208
includes a data flow graphical editor (not shown) that enables users to
build data flows (e.g., data flow 212). In one implementation, the data
flow graphical editor provides a new operator--i.e., a data station
operator--that a user can directly drag and drop into a data flow to link
a preceding ("upstream") operator and one or more subsequent
("downstream") operators, which data station operator specifies a data
staging point in the data flow. In general, operators are represented in
a data flow as graphical objects. In one implementation, the data station
operator can be used as a link between a first operator (or operation)
associated with first runtime engine (e.g., a relational database
management system) and a second operator associated with a second runtime
engine (e.g., a DataStage ETL engine). FIG. 3 illustrates the data flow
212 including a data station operator 300 linking an upstream operator
302 to a downstream operator 304. In one implementation, staging refers
to writing data to a disk, and permits a user to store intermediate data
at various points of a data flow in a relational table, flat file or a
view. A staging point indicates a location on the data flow where the
data is staged. In one implementation, a staging point occurs on an
operator's output port.

[0027] In one implementation, the code generation system 210 is operable
to automatically place individual data station operators (into a sequence
of operations defined by a data flow) whenever a data exchange
requirement is identified during the code generation process. In one
implementation, the identification and insertion of the data
exchange/staging points are seamless to the end user. Accordingly, in
such an implementation, the code generation system 210 is operable to
automatically generate code that manages data staging and data exchange
on an ETL system that is capable of integrating various data processing
runtime engines. For example, if a particular runtime engine can work
with flat files as well as database tables, depending on certain
optimization considerations, there may not be an exchange necessary, or
if flat files are determined to be processed faster, then a file staging
from an upstream operation (e.g., associated with a relational database
engine) may be decided by the code generation system 210 to be more
appropriate--or a decision could be made based on current system loads. A
dynamic decision (based on various cost-benefit analyses) on whether a
data station operator is required, may be best decided by the code
generation system 210. In such cases, any suitable cost-benefit criteria
can be implemented. In some cases, however, (expert) users or database
administrators may have better knowledge than the code generation system
210 because of an understanding of expected data and expected system
stress, e.g., when data is range partitioned and the administrator would
be aware of which particular database partition nodes would be stressed.
In such cases, it may be more appropriate for a user to explicitly
override any staging options automatically selected by the code
generation system 210 (or for a user to explicitly define a different
staging format when the code generation system 210 does not add one by
default).

[0028] Accordingly, unlike a conventional system in which a user must
represent data staging using two or more jobs in order to exchange data
from one runtime format (e.g., database table) to another runtime format
(e.g., a flat file) in a data flow, the data processing system 200
permits a user to exchange data from one system-format to another in the
same single job through the data station operator. Users can, therefore,
use such data stations to explicitly identify points of staging or
exchange interest, e.g., for diagnostics, for performance improvements,
or for overriding any default choices made by the code generation system
210.

[0029]FIG. 4 is a flow diagram illustrating a computer-implemented method
400 for inserting a data station operator into a data flow (e.g., data
flow 212) in accordance with one implementation of the invention. The
method 400 begins with a data processing system (e.g., data processing
system 200) receiving user input inserting a first operator (or
operation) associated with a first type of runtime engine into a data
flow (e.g., a data flow 212) (step 402). There are various types of
runtime engines that may be used to process ETL operations--e.g., a
relational database engine or a DataStage ETL engine among others. In
addition, there are many different types of operators representing
corresponding ETL operations (such as structured query language
operations and ETL DataStage operations) that can be inserted by a user
into a data flow. Some operators associated with a relational database
engine include, for example, a table extract operator, a join operator, a
de-duplicate operator, a bulk load table operator, a file target
operator, and so on. Some operators associated with a DataStage ETL
engine include, for example, a file extract operator, a filter operator,
and so on.

[0030] User input is received inserting a second operator associated with
a second type of runtime engine into the data flow (step 404). In one
example, the first operator can be associated with a relational database
engine and the second operator can be associated with a DataStage ETL
engine. The first operator and the second operator can be transform
operators that represent data transformation steps in the data flow. User
input is received inserting a data station operator (e.g., data station
operator 300) into the data flow between the first operator and the
second operator to link the first operator and the second operator (step
406). Thus, the data processing system permits the user to explicitly add
a data staging operator into a data flow, in which the data staging
operator exchanges data from a format associated with the first runtime
engine into a format associated with the second runtime engine in a same
single job.

[0031]FIG. 5 illustrates 5 a method 500 for processing a data flow in
accordance with one implementation of the invention. The method 500
begins with a code generation system (e.g., code generation system 210)
receiving a data flow defining a sequence of operations for an ETL
process (step 502). A data flow represents a logical transformation and
flow of data, and is typically built based on user input. For example,
versions of the IBM DB2 Data Warehouse Edition (DWE) product have a data
flow graphical editor that enables users to build data flows. The data
flow can include one or more data stations operators to link
corresponding upstream operators and downstream operators within the data
flow, as discussed above. The data flow is converted to a logical
operator graph (LOG) (e.g., logical operator graph 214) by the code
generation system (step 304). As discussed above, (in one implementation)
the logical operator graph (LOG) is a minimalist representation of the
data flow and includes an abstract collection of operators. The logical
operator graph is converted (e.g., by code generation system 210) into a
query graph model (QGM) (e.g., QGM graph 216) (step 306). As discussed
above, the query graph model is an internal data structure used by the
code generation system for analysis and optimization purposes. In one
implementation, the code generation system automatically inserts one or
more additional data station operators into the sequence of operations of
the ETL process based on a cost-benefit analysis. In another
implementation, the code generation system automatically inserts one or
more data station operators into the sequence of operations of the ETL
process based on pre-determined criteria.

[0032] Pre-determined criteria upon which the code generation system (or a
user) may decide to insert a data station operator into a data flow
include, for example, criteria associated with optimization, error
recovery and restart, diagnostics and debugging, and cross-system data
exchanges. With regard to optimization, intermediate (calculated) data
may be staged to avoid having to perform the same calculation multiple
times, especially in cases where the output of a single upstream
operation is required by multiple downstream operations. Even when there
is only one downstream consumer of the output data of a given operation,
it may be prudent to stage rows of the output data, especially to a
physical storage, in order to either free up memory or avoid stressing an
execution system (for example, to avoid running out of database log
space). With respect to error recovery and restart, in complex systems,
errors during the execution of data flows may occur either due to bad
(dirty) data which may cause database inconsistencies, or fatal errors
due to software failures, power loss etc. In many cases, manual
intervention is required to bring databases and other systems back to a
consistent state. Thus in one implementation, the code generation system
(or user) inserts data station operators at specific consistency check
points in the data flow, so that staging can be performed on intermediate
results in a physical media (for example, in database persistent tables
or files). Accordingly, restarts (either manual or automatic) can be
performed starting at these check points, thus, saving quite a bit of
time.

[0033] In terms of diagnostics and debugging, staging may allow
administrators to identify the core cause of problems, for example, an
administrator can inspect staged rows to find bad data, which may even
require the administrator to re-organize ETL processes to first clean
such data. Users may also explicitly add data stations in a data flow to
aid in debugging of the data flow, e.g., during development and testing
cycles. An inspection of such staged rows will provide a validation of
whether the corresponding upstream operations did indeed perform as
expected. With regard to cross-system exchanges, a data processing system
that is capable of integrating various data processing engines such as
the one described in the '540 application, the data being processed in
such a data processing system is a mix of various data types and formats
that are specific to a given underlying (runtime) data processing engine.
Some runtime processing systems may be equipped to process data inside
database tables, others may only work with flat files, while others may
perform better using Message Queues. In some scenarios, external systems
in a different (remote) site may be required to complete part of an
operation--e.g., a "Name Address Lookup" facility which may be provided
by an online vendor for cleansing customer addresses. Such an external
vendor may even require a SOAP-based web service means of data movement.

[0034] For example, FIG. 6 illustrates an example logical operator graph
600 including a first section 602 and a second section 604. The first
section 602 includes operators 606-610 operable to process data having a
format compatible with a first type of runtime engine (e.g., a relational
database engine), and the second section 604 includes operators 612-620
that are operable to process data having a format compatible with a
second type of runtime engine (e.g., a data stage ETL engine). For data
that is produced by an operator in the first section 602 to be able to be
consumed by an operator in the second section 604, staging objects 622
are generated by a code generation system to exchange data between the
first section 602 and the second section 604 of the logical operator
graph 600. The code generation system may convert the operators 606-610
into corresponding SQL/PL statements, and convert the operators 606-610
into an extensible mark-up language presentation.

[0035] Referring back to FIG. 5, an execution plan graph (e.g., execution
plan graph 218) is generated (e.g., by code generation system 210) based
on the query graph model (step 308). In one implementation, the execution
plan graph includes code (e.g., for generating staging objects) to manage
data exchange and data staging corresponding to locations in the sequence
of operations of the ETL process in which the user or the code generation
system have inserted a data station operator.

[0036] Provided below is further discussion regarding implementations of a
data station operator and uses thereof.

Model of a Data Station Operator

[0037] In one implementation, a data station operator models a data
exchange/staging object, and a code generator system generates code that
supports staging and data exchange functionalities based on the data
station operator. In one implementation, a data station operator is
modeled using a data flow operator modeling framework as described in the
'540 application. More generally, the concept of an operator is generic
to many different ETL or transformation frameworks and, therefore, the
concept of a data station operator can be extended for systems other many
types of data processing systems. In one implementation, a data station
operator has one input port, and one output port, and includes one or
more of the following attributes as shown in Table 1 below.

TABLE-US-00001
TABLE 1
Attrib-
ute Definition
data The type specifies the format of the staged data - e.g.,
station temporary database table, permanent database table, database
type view, JDBC result set and flat file.
pass The pass though flag is a Boolean flag indicating if the
through associated data station operator can be ignored. This flag can
flag be used to turn off a data station operator without a user having
to physically remove the data station operator from a data flow.
name of The name corresponds to, e.g., a staging table name or staging
staging file name.
object
data The lifetime permits a user to specify when a staging object
station should be cleaned and removed after a flow execution - i.e.,
lifetime removing a staging object at the end of an ETL flow execution,
or keeping the staging object permanently.
perfor- Performance hints include information such as DB partition
mance information according to the source or target tables, index
hints specification, whether to preserve incoming data orders etc.
Allowing a user to specify performance hints gives the user
flexibility to control data flow execution.

Advantages of a Data Station Operator

[0038] Advantages of a data station operator include the following. With
respect to performance, depending on the underlying runtime engine in
which an ETL process is executed, staging intermediate data can yield
better performance by controlling where and how data is flown through.
For example, when the underlying ETL engine is a database server (e.g.,
DB2), the execution code of one data flow can be represented to one or
several SQL statements. One single SQL statement can contain several
levels of nested sub-queries to represent many transform operations.
However, one single SQL statement could lead to runtime performance
problems on certain DB servers. For example, two common problems could
occur which are caused by one long SQL statement: 1) the log size that is
required to run the SQL can be large if the number of nested queries
reaches a certain level; 2) a single (nested) query is limited to DB
vendor's query processing capability. In some cases, a single SQL
statement will not work. In such case it is desirable to break the single
SQL statement into smaller pieces for better performance.

[0039] With regard to data (format) exchange, when a data flow includes a
mix of SQL operators and non-SQL operators, it is generally not possible
to represent the data flow using one common language. Data flows through
must be "staged" in order to transit from one type of operator to
another. For example, consider a data flow that extracts data from a JDBC
(Java Database Connectivity) source, goes through a couple of
transformations, and then ends with the data being loaded into a target
table. The code representing JDBC extraction is a java program, whereas
the transformations and loads can be presented by SQL statements. In such
cases, the output row sets from the JDBC extractor are staged into a DB2
table prior to sending the row sets to the following transform node.

[0040] A data station operator also permits tracing of data within a data
flow. Providing a tracing functionality in a data flow permits users to
monitor and track data flow runtime execution, and helps users diagnose
problems when errors occur. Providing a data station operator permits a
user to explicitly specify a staging point for an operator in a data flow
at which a stage table/file will be created to capture all intermediate
data that have been processed up to the staging point. Additional
diagnostic information for the staging point can also be captured, such
as number of rows processed, code being executed, temporary tables/files
created, and so on. A data station operator also provides error recovery
capability for a data flow. For example, when the execution of a data
flow fails, the code generation system, or user, can select to begin a
recovery process from a staged point where intermediate processed data is
still valid. This permits for faster recovery from a failure relative to
having to restart from the beginning of a data flow.

Pre-Determined Criteria for Inserting a Data Station Operator into a Data
Flow

[0041] A data exchange/staging point identifies a position where data
exchange/staging is required in a data flow--e.g., either on a link or an
output port. In one implementation, a staging point in a data flow is
identified when one of the following conditions arises:

[0042] An explicit exchange/staging point specified by user. A staging
point can be explicitly specified by the user using a data station
operator. For example, a user can specify a staging point where the user
wants to examine intermediate data sets processed during runtime, which
helps for debugging and diagnosis purposes when error occurs. Optionally,
the user can specify the data station repository type as well.

[0043] An implicit exchange/staging point identified by a code generation
system. There are situations where implicit staging points are required.
For example, in one implementation, staging points are required for an
operator in a dataflow that requires chunking--e.g., a splitter operator
requires chunking if there are multiple output streams going into
different targets. Custom operator can also specify if input streams
and/or output streams need to be chunked. In general, operators that
typically require staging include splitter operators, operators that
support the discarding of rows, and custom operators that require
staging. Implicit staging point may also be required for those operations
of a given operator that need to be broken into multiple parts to improve
performance. The following operators are example candidate operators for
which a staging point may be required Inner join operator--an inner join
operator can have multiple inputs, and perform a SQL join on multiple
tables. Performance of a SQL join operation depends on the underlying
database query processing. It is, therefore, desirable to split one large
join into multiple ones with smaller join cardinalities. In such a case,
staging points are required at intermediate join stages. The type of data
station can be a global temporary table for optimal performance. Key
lookup operator--a key lookup operator is implemented using a SQL inner
join operation and, therefore, key lookup operators can be processed
similar to inner join operators.

[0044] Two "incompatible" operators are linked together. One operator can
be incompatible with another operator when the runtime implementation
code for each of the operators is different. For example, if a JDBC
extract operator is implemented as a java program, it is viewed as not
compatible with a join operator that is implemented using SQL. In such
case, a data station operator is placed between the two operators so the
data can be passed from one type to the other.

Example Staging Types of a Data Staging Operator

[0045] In one implementation, when a code generation system chunks a data
flow into several small pieces, staging tables and staging files are
created and maintained to hold intermediate row sets during an ETL
process--e.g., data between extract and transform, between transform and
load, or a chunking point inside a data flow. In one implementation,
staging tables are database relational tables, and depending on how
staging tables are used, a given stage table can be either a permanent
table on ETL transform database, or a temporary table created in the data
transformation session. In one implementation, staging files are flat
files that hold intermediate transformed data in the text format. Staging
tables and staging files can be created on a transform engine. A user can
also input other specifications of a staging object, such as (table)
spaces, indexes used for staging tables, location for staging files.

Staging Tables

[0046] In one implementation, staging tables are used to hold intermediate
row sets during an ETL process. A code generation system can maintain a
staging table, including DDL (Database Definition Language) associated
table spaces and indexes. The "lifetime" of a staging table (e.g., the
duration of a stage table and when should the staging table be deleted)
can be externally specified by a user or internally determined by a code
generation system depending on the usage of the staging object. For
example, if a staging table is generated internally by a code generation
system, and is used only for a specific dataflow stream, the staging
table can be created at the beginning of the data flow execution as a
database temporary table, which temporary table will be deleted when the
session ends. If however, an internal staging table is used to chunk a
data flow into multiple parallel execution pieces, the staging object can
be defined as database permanent table to hold intermediate row sets
until the end of an ETL application execution.

Staging Files

[0047] In one implementation, staging files are flat, text files. A flat
file is a text-based ASCII file that is commonly used as a bridge between
non-relational data sets and relational database tables. Staging flat
files can be generated by a database export utility (such as DB2 SQL
export) to export data from relational DB tables, or can be generated
using a custom operator interface provided by a code generation system.
Flat files can be loaded into target tables through a database load
utility such as DB2 load.

JDBC Result Sets

[0048] JDBC result sets are the exchange point between two or more
operators. The results of a previous (upstream) operator are represented
as JDBC result sets and consumed by following (downstream) operators.
JDBC result sets are memory objects and, in one implementation, the
handles/names of the memory objects are determined by the code generation
system.

Automatically Placed Data Station Operators

[0049] For internally generated staging points (e.g., those staging points
not explicitly defined by a user), a code generation system can analyze
the internal presentation of a data flow (e.g., through a QGM), identify
staging points and insert data station operators that chunk the data flow
into multiple smaller pieces (or sub-flows). Between these sub-flows,
staging tables can be used to temporarily store intermediate transformed
result sets. For example, when a chunking point is identified, a QGM can
include staging tables/files (e.g., represented as table/file boxes) that
link to other QGM nodes. QGM In one implementation, the name of each
staging table within a QGM is unique. In one implementation, DDL
statements for all staging tables generated within a data flow will be
returned.

[0050] In one implementation, a code generation system (e.g., code
generation system 210 of FIG. 2) chunks a QGM graph into several smaller
graphs according to a chunking level (passed as an input parameter). In
one implementation, a chunking level indicates the maximum number of
nested queries and, accordingly, the number of QGM query boxes (e.g.,
select, aggregate, union/except/intersect) traversed can be counted. For
example, FIG. 7 illustrates a QGM graph 700 chunked into two smaller
sub-graphs 702, 704, in which the chunking level is 3. When a chunking
point is identified, a table box (e.g., table 706) can be added to
represent the staging table, which staging table is also the top box of a
chunked QGM sub-graph (e.g., sub-graph 704). The same staging box
(staging table) is referred to as the starting point of a new sub-graph.
The sequence of each chunked sub-graph is also tracked.

[0051] One or more of method steps described above can be performed by one
or more programmable processors executing a computer program to perform
functions by operating on input data and generating output. Generally,
the invention can take the form of an entirely hardware embodiment, an
entirely software embodiment or an embodiment containing both hardware
and software elements. In one implementation, the invention is
implemented in software, which includes but is not limited to firmware,
resident software, microcode, etc.

[0052] Furthermore, the invention can take the form of a computer program
product accessible from a computer-usable or computer-readable medium
providing program code for use by or in connection with a computer or any
instruction execution system. For the purposes of this description, a
computer-usable or computer readable medium can be any apparatus that can
contain, store, communicate, propagate, or transport the program for use
by or in connection with the instruction execution system, apparatus, or
device.

[0053] The medium can be an electronic, magnetic, optical,
electromagnetic, infrared, or semiconductor system (or apparatus or
device) or a propagation medium. Examples of a computer-readable medium
include a semiconductor or solid state memory, magnetic tape, a removable
computer diskette, a random access memory (RAM), a read-only memory
(ROM), a rigid magnetic disk and an optical disk. Current examples of
optical disks include compact disk--read only memory (CD-ROM), compact
disk--read/write (CD-R/W) and DVD.

[0054]FIG. 8 illustrates a data processing system 800 suitable for
storing and/or executing program code. Data processing system 800
includes a processor 802 coupled to memory elements 804A-B through a
system bus 806. In other implementations, data processing system 800 may
include more than one processor and each processor may be coupled
directly or indirectly to one or more memory elements through a system
bus.

[0055] Memory elements 804A-B can include local memory employed during
actual execution of the program code, bulk storage, and cache memories
that provide temporary storage of at least some program code in order to
reduce the number of times the code must be retrieved from bulk storage
during execution. As shown, input/output or I/O devices 808A-B
(including, but not limited to, keyboards, displays, pointing devices,
etc.) are coupled to data processing system 800. I/O devices 808A-B may
be coupled to data processing system 800 directly or indirectly through
intervening I/O controllers (not shown).

[0056] In one implementation, a network adapter 810 is coupled to data
processing system 800 to enable data processing system 800 to become
coupled to other data processing systems or remote printers or storage
devices through communication link 812. Communication link 812 can be a
private or public network. Modems, cable modems, and Ethernet cards are
just a few of the currently available types of network adapters.

[0057] Various implementations for modeling data exchange in a data flow
associated with an extract, transform, and load (ETL) process have been
described. Nevertheless, various modifications may be made to the
implementations, and those variations would be within the scope of the
present invention. For example, with respect to various implementations
discussed above, different programming languages (e.g., C) can be used to
stage intermediate processing data into a proprietary data format.
Accordingly, many modifications may be made without departing from the
scope of the following claims.