Tuesday May 21, 2013

The purpose of this document is to provide enough
information one might need while attempting an implementation of one or more offerings
and subject areas using the Oracle BI Applications Universal Adaptors. There
are several myths around what needs to be done while implementing Universal
Adaptors, where can things go wrong if not configured correctly, what columns
are to be populated as a ‘must’, how to provide ‘delta’ data set while shooting
for incremental ETL runs and so on. All of these topics are discussed in this
document.

Understanding the ‘entry points’ required to implement a
subject area is the key to a successful implementation of Universal Adaptors.

Oracle BI Applications provide packaged ETL mappings against
source OLTP systems like Oracle Fusion Applications, Oracle E-Business Suites,
PeopleSoft, JD Edwards and Siebel across various business areas such as Human
Resources, Supply Chain & Procurements, Order Management, Financials,
Service and so on.However, Oracle BI
Applications does acknowledge that there can be quite a few other source
systems, including home-grown ones, typically used by SMB customers. And to
that extent, some of the enterprise customers may also be using SAP as their
source. Until it gets to a point where Oracle BI Applications can deliver
pre-built ETL adaptors against each of these source systems, the Universal
Adaptor becomes a viable choice.

A mixed OLTP system where one of them has pre-built adaptor
for and the other doesn’t – is also a scenario that calls for the usage of
Universal Adaptors. For instance, the core portion of Human Resources may be in
PeopleSoft systems, but the Talent Management portion may be maintained in a non-supported
(yet) application such as “Taleo”.

In order for customers to enable pulling in data from
non-supported source systems into the Data Warehouse, Oracle BI Applications
have created a so called “Universal Adaptor”. The reason this was doable in the
first place was the fact that the ETL architecture of Oracle BI Applications
had the evident support for this. Oracle BI Applications’ Data Warehouse
consists of a huge set of facts, dimensions and aggregate tables. The portion
of the ETL that loads to these ‘end’ tables are typically Source Independent
(loaded using the folder SILOS, acronym of “Source Independent Load Objects”).
These ETL maps/interfaces start from a staging table and load data
incrementally into the corresponding end table. Aggregates are created
upstream, and have no dependencies to the source system where the data came
from. The ETL logic in SDE folder (acronym of “Source Dependent Extract”) that
extracts into these staging tables (also called Universal Stage Tables) are the
ones that go against a given source system, like EBS or PSFT FUSION and so on.
For Universal, they go against a similarly structured CSV file. Take any
Adaptor – the universal stage tables are exactly the same, structurally. The
grain expectation is also exactly the same for all adaptors. And no wonder,
while all these conditions are met, the SILOS logic will load the data
(extracted from Universal) from the universal stage tables seamlessly.

Why did Oracle BI Applications decide to source from CSV
files? In short, the answer to this is “to complete the end-to-end
extract-transform-load story”. We will cover this in a bit more details and
what the options are, in the next section.

One myth that implementers have while implementing Universal
Adaptors is “Data for the universal staging tables should always be presented
to Oracle BI Applications in the required CSV file format”.

If your source data is already present in a relational
database, why dump it to CSV files and give it to Oracle BI Applications? You
will anyway have to write brand new ETL mappings that read from those
relational tables to get to the right grain and right columns. Then why target
those to CSV files and then use the Oracle BI Applications Universal Adaptor to
read from them and write to the universal staging tables? Why not directly
target those custom ETL maps to the universal staging tables? In fact, when
your source data is in relational tables, this is the preferred approach.

However, if your source data comes from 3rd party
sources which you have outsourced, and probably have agreements with them to
send you data files/reports/feeds once in a while, and if that 3rd
party source doesn’t allow you to access their relational schema, then probably
CSV files is the only alternative. A typical example would be Payroll data. A
lot of organizations typically outsource their Payroll to 3rd party
companies like ADP systems and so on. In those cases, ask for the data in the
same manner that you expect in the Oracle BI Applications CSV files. Another valid example might be SaaS
applications which usually do not allow direct database accesses. File is the
best option for them.

Also, if your source data lies in IBM mainframe systems,
where it is typically easier to write COBOL programs or whatever to extract the
data in flat files, presenting CSV files to Oracle BI Applications Universal
Adaptor is probably easier. Irrespective of how to populate the universal
staging tables (relational sources or CSV sources) five very important points
should always be kept in mind:

Grain of the universal staging tables are met
properly.

The uniqueness of records do exists in the
(typically) INTEGRATION_ID columns.

The mandatory columns are populated the way they
should be.

The relational constraints are met well while
populating data for facts. In other words, the natural keys that you provide in
the fact universal staging table, must exist in the dimensions. This is with respect
to the FK resolution (dimension keys into the end fact table) topic.

Incremental extraction policy is well set up.
Some overlap of data is OK, but populating the entire dataset to the universal
staging tables will prove to be non-performing.

Note: For the rest of
the document, we will assume that you are going the CSV file approach, although
re-iterating, it is recommended that if your source data is stored in a
relational database you should write your own extract mappings.

There are several entry points while implementing a subject
area using Universal Adaptors. The base dimension tables and base fact tables
have their corresponding CSV files where you should configure the data at the
right grain and expectations. Other kinds of tables include ‘Exchange Rate’ and
‘Codes’ (Domains). Exchange Rate (W_EXCH_RATE_G) has its own single CSV file,
whereas the Codes table (W_CODE_D) has a CSV file, one per each code category.
To get to see all code-names well enough in the dashboards/reports, you should
configure all the ‘required’ code CSV files for the subject area in question.

Note: The Codes table
has been replaced with Domains architecture in newer releases of BI
Applications.

Key points:

Start with populating the offering specific
common dimension CSV files, like Job, Pay Grade, HR Position, GL Account, etc.,
depending on your offering (such as HR, or FINANCE etc). It is not necessary
that all offerings would have “their set of common dimensions used in all facts
within the offering”.

Then configure subject area specific dimensions,
like Job Requisitions, Recruitment Source etc (when implementing Recruitment)
or Learning Grade, Course (when implementing Learning) or Pay Type etc (when
implementing Payroll) or Absence Event, Absence Type Reason (when implementing
Absence) and so on. These are examples from HR offering. Similarly, it applies
for Financials subject areas such as AR or AP etc., where you should consider
area specific dimension needs at this time.

Consider other shared dimensions and helper
dimensions like Status, Transaction Types, and Exchange Rate etc. Special
handling of time dimension should be addressed here (such as Fiscal or
enterprise calendars etc).

Then consider the code dimensions. By this time
you are already aware of what all dimensions you are considering to implement,
and hence, can save time by configuring the CSVs for only the corresponding
code categories. For newer releases, you
would configure the domain set up at this point.

For facts, start with configuring the primary fact
for your offering. Since the dimensions are already configured, the natural key
of the dimensions are already known to you and hence should be easy to
configure them in the fact. For some offerings, there isn’t a concept of a
“primary fact”. If so, go to the next step.

Now that all the CSV files for facts,
dimensions, and helper tables are populated, you should move your focus towards
Domains. For E-Business Suite & PeopleSoft Adaptors, we do mid-stream
lookups against preconfigured lookup CSV files. The map between source
values/codes to their corresponding domain values/codes come pre-configured in
these lookup files. However, for Universal Adaptor, no such lookup files exist.
This is because of the fact that we expect that the accurate domain
values/codes will be configured “along-with” configuring the base dimension
tables where they apply. Since everything is from a CSV file, there is no need
to have the overhead of an additional lookup file acting in the middle. Domain
value columns begin with “W_” [excluding the system columns like W_INSERT_DT
and W_UPDATE_DT] and normally they are mandatory, cannot be nulls, and the
value-set cannot be changed or extended. We do relax the extension part on a
case by case basis, but in no way, the values can be changed. The
recommendation at this stage is that you go to the DMR guide (Data Model
Reference Guide), get the list of table-wise domain values, understand the
relationships clearly in cases there exists any hierarchical or orthogonal
relations, identify the tables where they apply and then their corresponding
CSV files, look at the source data and configure the domain values in the same
CSV files. Note that if your source data is in a relational database and you
have chosen to go the ‘alternate’ route of creating all extract mappings by
yourself, the recommendation is to follow what we have done for E-Business
Suite Adaptors and PeopleSoft Adaptors and create separate domain value lookup
CSV files, and do a mid-stream lookup.

Note that
the above discussion on Domains has been revamped in newer releases on BI
Applications. Now, we only have the CODE columns in our data model and the NAME
and DESC columns have been taken away. We still do a mid-stream lookup in
EBS/PSFT adaptors to resolve the target domain code against the source domain
code, but no longer use lookups to resolve the target domain names and
descriptions based on source/target domain codes. Rather, these
name/description lookups happen on the fly at the RPD level, where the name or
description is brought back to the report depending on the code and language
preferred in reports. The domain mapping happens in Configuration Manager now,
and no longer in CSV files. You will be taken to the step of configuring or
mapping them in Configuration Manager.

Last,
but not the least, configure the ETL parameters. Read up the documentation for
these parameters, understand their expectations, study your own business
requirements and then set the values accordingly.

Domain values constitute a very important foundation for
Oracle Business Intelligence Applications. We use this concept heavily all
across the board to ‘equalize’ similar aspects from a variety of source
systems. The Oracle Business Intelligence Applications provide packaged data
warehouse solutions for various source systems such as E-Business Suite,
PeopleSoft, Siebel, JD Edwards and so on. We attempt to provide a “source
dependent extract” type of a mapping that leads to a “source independent load”
type of a mapping, followed by a “post load” (also source independent) type of
mapping. With data possibly coming in from a variety of source systems, this
equalization is necessary. Moreover, the reporting metadata (OBIEE RPD) is also
source independent. The metric calculations are obviously source independent.

The following diagram shows how a worker status code/value is
mapped onto a warehouse domain to conform to a single target set of values. The
domain is then re-used by any measures that are based on worker status.

Domain values help us to equalize similar aspects or
attributes as they come from different source systems. We use these values in
our ETL logic, sometimes even as hard-coded filters. We use these values in
defining our reporting layer metrics. And hence, not configuring, incorrectly
configuring, or changing the values of these domain value columns from what we
expect, will lead to unpredictable results. You may have a single source system
to implement, but still you have to go through all the steps and configure the
domain values based on your source data. Unfortunately, this is small price you
pay for going the “buy” approach VS the traditional “build” approach for your
data warehouse.

One of the very frequently asked question is “what is the
difference between domain value code/name pairs VS the regular code/name pairs
that are stored in W_CODE_D (or
W_DOMAIN_MEMBER_G and W_DOMAIN_MEMBER_MAP_G along with their corresponding
Translation tables in newer versions of BI Applications)”.

If you look at the structure of W_CODE_D table, it appears
to be also capable of standardizing code/name pairs to something common. This
is correct. However, we wanted to give an extensive freedom to users to be able
to do that standardization (not necessarily equalization) of their code/names
and possibly use that for cleansing as well. For example, if the source
supplied code/name are possibly CA/CALIF or CA/California, you can choose the
W_CODE_D approach (using Master Code and Master Map tables – see configuration
guide for details) to standardize on CA/CALIFORNIA. The equivalent of the above is the W_DOMAIN_MEMBER_MAP_G table in newer
versions of BI Applications.

Now, to explain the difference of domain value code/name
pairs Vs the regular source code/name pairs, it is enough if you understand the
significance of the domain value concept. To keep it simple, wherever we
(Oracle Business Intelligence Applications) felt that we should equalize two
similar topics that give us analytic values, metric calculation possibilities
etc, we have “promoted” a regular code/name pair to a domain value code/name
pair.

If we have a requirement to provide a metric called “Male
Headcount”, we can’t do that accurately unless we know which of the headcount
is “Male” and which is “Female”. This metric therefore has easy calculation
logic: Sum of headcount where sex = Male. Since PeopleSoft can call it “M” and
EBS can have “male”, we decided to call it a domain value code/name pair,
W_SEX_MF_CODE (available in the employee dimension table). Needless to say, if
you didn’t configure your domain value for this column accurately, you won’t
get this metric right.

The Oracle BI Applications mostly use Name and Description
columns in the out-of-the-box dashboards and reports. We use Codes only during
calculations, wherever required. Therefore, it is obvious that if the names and
descriptions didn’t resolve against their codes during the ETL, you will see
blank values of attributes (or in some cases, depending on the parameter
setting, you might see strings like <Source_Code_Not_Supplied> or
<Source_Name_Not_Supplied> and so on). In newer
versions of BI Applications where names and descriptions are resolved using on-the-fly
RPD lookups and not in the ETL, the reports might show values such as
‘__UNSASSIGNED__’, if not mapped well enough in Configuration Manager.

Another point to keep in mind is that all codes should have
distinct name values. If two or more codes have the same name value, at the
report level you will see them merged. The metric values may sometimes appear
in different lines of the report, because OBIEE Server typically throws in a
GROUP BY clause on the lowest attribute (code).

Once implemented, you are free to promote the source code
columns from the logical layer to the presentation layer. You might do this
when you know your business users are more acquainted to the source code values
rather than the name values. But that is a separate business decision. The
general behavior is not like that.

Although you can choose to supply the entire dataset during
incremental runs, for all practical reasons, this is not recommended. Firstly
because then the ETL has to process all the records and determine what needs to
be applied and what can be rejected. Secondly, the decision ETL takes may not
be accurate. ETL decisions are based on the values of the system date columns
like CHANGED_ON_DT, AUX1_CHANGED_ON_DT, AUX2_CHANGED_ON_DT, AUX3_CHANGED_ON_DT
and AUX4_CHANGED_ON_DT columns only. We do not explicitly compare
column-by-column and determine whether an update is required. We believe that
if something has changed, probably one of the four date columns must have
changed. And in that case, we simply update. If all 5 date columns are same, we
pretty much tend to reject. The base of this decision is the correctness of the
date columns. If your source system does not track the last updated date column
on a record well enough, it becomes your responsibility to force an update, no
matter what. An easy way to do this is to set SESSSTARTTIME in one of these
columns during extract. This will force to detect a change, and we will end up
updating.

No wonder, this is not be the best idea. By all means, you
should provide the true “delta” data set during every incremental run. A small
amount of overlap is acceptable, especially when you deal with flat files. Our
generally accepted rules for facts or large dimensions are either:

·Customer does their own version of “persisted
staging” so they can determine changes at the earliest opportunity and only
load changes into universal staging tables

·If absolutely impossible to determine the
“delta” or to go the “persistent staging” route, Customer only does full load.
Otherwise doing a full extract every time and processing incrementally will
take longer.

Follow the principles below to decide on your incremental
strategy:

(Applies to relational table sources) Does your
source system capture last update date/time accurately in the source record
that change? If so, extracting based on this column would be the best idea.
Now, your extract mapping may have used 2 or 3 different source tables. Decide
which one is primary and which ones are secondary. The last update date on the
primary table goes to the CHANGED_ON_DT column in the stage table. The same
from the other two tables go to one of the auxiliary changed on date column in
the stage table. If you design your extract mapping this way, you are almost
done. Just make sure you add the filter criteria
“primary_table.last_update_date >= LAST_EXTRACT_DATE parameter”. The value
of this parameter is usually maintained within the ETL orchestration layer.

(Applies to CSV file sources) Assuming there is
a mechanism which you can trust that gives you the delta file during each
incremental load; does the ‘delta’ file come with a changed value of system
dates? If yes, you’re OK. But if not, then you should add an extra piece of
logic in the out of the box SDE_Universal** mappings that sets SESSSTARTTIME to
one of the system date columns. This will force an update (when possible) no
matter what.

(Applies to CSV file sources) If there are no
mechanisms to easily give your delta file during incremental, and it seems
easier to get a complete ‘dump’ every time, you have actually couple of choices:

a.Pass the whole file in every load, but run true
incremental loads. Note that this is not an option for large dimensions or
facts.

b.Pass the whole file each time and run full load
always.

c.Do something at the back-office to process the
files and produce the ‘delta’ file yourselves.

The choices (a) and (b) may sound a bad idea, but we’ve seen it to be a
worthwhile solution compared to (c), if the source data volume is not very
high. For an HR Analytics implementation (as an example), this could be OK as
long as your employee strength is no more than 5000 and you have no more than 5
years of data. The benchmark might be different for a Financial Analytics
implementation.

The choice (c) is more involved but produces best results. The idea is
simple. You store the last image of the full file that your source gave you
[call ‘A’]. You get your new full file today [call ‘B’]. Compare A & B.
There are quite a few data-diff software available in the market, or better if
you could write a Perl or python script on your own. The result of this script
should be a new delta file [call ‘C’] that has the lines copied from B that has
changed as compared to A. Use this new file C as your delta data for your
incremental runs. Also, discard A and rename B as A, thereby getting ready for
the next batch.

Having said that, it is worthwhile to re-iterate that the “Persisted Staging”
is a better approach as it is simpler and uses the ETL logic to do the
comparison. Oracle BI Applications have used this technique in HR adaptors for
E-Business Suite and PeopleSoft, in case you wanted to refer to them.

If there are other options not considered here, by all means, try them
out. This list is not comprehensive, it is rather indicative.