ETL Power - Transforming the raw ingredients of business intelligence

Saturday, May 18, 2013

Data profiling
best practices is the first thing you need if you are planning to create a data
warehouse for your company or organization. Data profiling, sometimes called
“DP” in ETL professional circles, means a way of analyzing and processing data
from raw data sets to collect statistics
and decipher trends and patterns in them. These patterns or statistics can be
used to:

Assess whether the information can be used to meet the
needs of the organization.

Assess thedata
quality

Check the potential for using the data in a new or
different application.

Assess the risks that any kind of data processing,
storage, or management might have on the data.

Have an overall view of managing and governing data
for the warehouse.

If you stay in
the ETL business, you will find that data profiling often becomes the backbone
of your work in planning, designing, implementing, and managing a data
warehouse. In order to guide you in your projects, you can follow these top
four data profiling best practices, all very helpful tips in this line of work.
They can be applied to your circumstances, too.

Data Profiling Best Practice 1: Do it before data
warehousing

Do data
profiling before data warehousing. This is probably the first and foremost data
profiling best practice that you should know. When you find that your company
or organization needs to set up a data warehouse, then your first order of
business is to profile the information that you now have, before embarking on
any of the extract-transform-load processes. This first step allows you to look
long and hard objectively into the information that you have and to see what
its quality is. This first data profiling step provides metadata that you will
usually need when designing the rest of your data warehouse. This data
profiling best practice will ensure that your ETL best practices are as
efficient as they can be. In most cases, a data profiling done before the
actual data warehousing project, will even tell you if the project is worth
pursuing—saving you thousands of dollars in manpower hours.

Data Profiling Best Practice 2: Know the full array of
tools

It is
necessary for you to know the full array of tools that you can use to create
your data profiling. Best practice is to get an idea of what programs or
platforms are in the market then compare their functionality, price points, and
usage so that you will have an idea of which tool is best suited for your data
profiling purposes. Some tools available in the market are DataFlux, Trillium, and
Informatica and even the basic data profiling tools that come with SQL Server.
Currently, some providers have also released automated profiling tools. Most
ETL experts agree that the last option is already suitable for basic, light
data profiling work. For the more advanced kinds of projects, however, a tool
dedicated or mainly designed for data profiling work is best. Aside from the
kind of work, settling on a data profiling tool should, of course, be based on
the level of standardization that you want. What you should do as a data
profiling best practice is to get on to the Internet and faithfully do your
research about each of these tools and read user/programmer feedback. Remember
that choosing a data profiling tool really is a commitment to a kind of
standard and style for data profiling.

If you really
want to get the most out of your data profiles, you need to remember that the
information you’ll be working with does not only come from inside, but must
also be validated by outside sources. That means that you have to cross-check
the facts that you have from data profiling with the feedback and other
information that you are getting from other sources, which may not be so easy
to tabulate and input in a data warehouse. These outside ways of checking data
profiles may come from customer feedback, simple questionnaires for your
stakeholders, trends relating or corresponding to the items in your data
profile, or maybe a general survey of what your industry benchmarks are. This
data profiling best practice can help you check the results of your inside
work, as well as possibly point out errors or open up new ways of looking at
your available data—all for the better understanding of information about your
company.

Data Profiling Best Practice 4: Review segmentation

As a last data
profiling best practice, you must continue running this process even after you
have designed your data warehouse. Profiling data regularly can help you look
for new ways of interpreting the data that you have, which usually brings a
fresh insight into how the company or organization is doing, revealing changing
trends and sounding off on any need to innovate or adapt so that the company
will perform better. More than just regular, periodic data profiling, however,
you also have to regularly review the way you segment the information for your
data profiling work. This means that you have to cut through your data in new
ways so that you get fresh insights into the same information. This will help
you broaden or focus your data segmentation so that the metadata will give you
either a widened or a specified look into your market and your business
transactions. Sometimes, adjusting your data profile segmentation can mean just
doing the data profile on a relevant part of information within your data
warehouse. This is done so that you can be more efficient at studying the
different related information or information chains. Data profiling best
practices will make the data identification process faster and will yield more
information about the more important aspects of your matrix of data.

Friday, April 26, 2013

Is it imperative that you learn about the basics of fact
table granularity? To answer that, you must ask yourself: Are you embarking on
a data warehousing project for your company or organization? If you answered
yes, then it is imperative that you read about the basics of fact table
granularity so that you can make important decisions for the design of your
data warehouse. Understanding the basics of fact table granularity will affect
the effectiveness of your data warehouse in the long run.

Fact Table
Granularity Explained

One of the first basics of fact table granularity that
you must understand is its meaning. Fact table granularity refers to the level
of intricacy of detail that a single fact table contains. A higher granularity
means that there are a higher number of details in that one table, while a
lower granularity means less detail. For someone who is just learning the
basics of fact table granularity, the easiest way to see how high or low the
granularity of a fact table that your company’s database has is by looking at
the number of rows in it. A higher number of rows in a table usually means a
higher granularity. Granularity can also mean the lowest level of depth in a
set of data for a fact table. For example, the grain for time can be year,
quarter, month, day, or hour, with the year being the highest level, in which
there is low fact table granularity.

Granularity in Fact
Table Design

Understanding granularity is a must for you if you are
involved in designing an efficient, functional fact table for a business or
organization. Whether you are the designer or theproject supervisor, your decisions based on
the basics of fact table granularity will have an impact on how your fact table
will work. This is because granularity is one of the most essential steps in
designing a fact table, next to identifying analysis points for the business,
measures, dimensions, and locations. Knowing how to properly choose granularity
for a table will help you identify enough data for your needs without bogging
down your system with unnecessary data.

Choosing the Level of
Granularity

Knowing how deep you will go in terms of grain level for
your fact table is among the basics of fact table granularity. Choosing to stay
in the high levels of granularity will mean that you will have very limited
data available to you. For example, choosing to stay in the year, one of the
shallowest levels of granularity for time, in your fact table means that there
is just one data being added for every year. This brings almost no knowledge to
you, and is inefficient in helping you analyze or drill down into the data.
Meanwhile, going too deep, in the level of hours, for example, for the fact
table might mean that you will have 24 inputs 365 days a year, multiplied by
the various other items for which the fact table is programmed. That can mean a
lot of useless reports that end up bogging down your system as well as your
analysis. There will be too much data to sift through and you will end up not
optimizing their use and value. In many instances, a medium granularity is best
to be used for a company or organization’s uses. A daily input is manageable in
terms of size and analysis capabilities but it also gives enough detail from
which you can extrapolate patterns, trends, and conclusions. Deciding on the
level of granularity is a skill that must be learned over time and by gaining
an understanding of your business processes.

Sunday, April 7, 2013

Technology
has made it easier for businesses to organize and manipulate data to get a
clearer picture of what’s going on with their business. Notably, ETL tools have
made managing huge amounts of data significantly easier and faster, boosting many
organizations’ business intelligence operations.

There
are many third-party vendors offering ETL solutions, but two of the most
popular are PowerCenter Informatica and Microsoft SSIS (SQL Server Integration
Services). Each technology has its advantages but there are also similarities
on how they carry out the extract-transform-load processes and only differ in
terminologies.

If
you’re in the process of choosing ETL tools and PowerCenter Informatica and
Microsoft SSIS made it to your shortlist, here is a short comparative discussion
detailing the differences between the two, as well as their benefits.

Package Configuration

Most
enterprise data integration projects would require the capacity to develop a
solution in one platform and test and deploy it in a separate environment
without having to manually change the established workflow. In order to achieve
this seamless movement between two environments, your ETL technology should allow
the dynamic update of the project’s properties using the content or a parameter
file or configuration.

Both
Informatica and SSIS support this functionality using different methodologies.
In Informatica, every session can have more than one source and one or more
destination connections. There are different kinds of connections the primary
being relational connections. Every session can be set up dynamically by
changing parameters contained in a parameter file.

The
same thing can be achieved in SSIS via Configurations. Using the SSIS
Configuration Wizard, configuration data is saved in XML files. Unlike
Informatica wherein there can be multiple connections, SSIS only allows a
single defined connection that can be applied across all tasks in a package.

In
short, Informatica parameters are defined at the session level while SSIS
configurations are set at the package level.

Data Staging

When
you use SSIS, you will use the Connection Manager to generate a connection
defining the physical location of the file. Multiple files that need to be
loaded from multiple connections would require multiple connections. All
information set in the connection manager can be incorporated in the
Configuration File and can be dynamically updated during run-time.

On
the other hand, if you’re using Informatica, you will use the Workflow Manager
tool to assign a location to each file. Every task that needs to access that
file can be configured with the location and name of that specific file.

Value Extraction

One
of the main functions of ETL tools is being able to extract meaning from the
information that is currently being ran or to supplement that information with
extra information obtained from the current information in the data processing
pipeline.

Both
SSIS and Informatica have this functionality through the use of derived columns
or the capacity to draw new information from existing data. Informatica does
this via its expression transformation component while Microsoft SSIS does this
via the Derived Column transformation.

The
logic used to complete both operations is the same and the syntax involved is
also identical. The difference between the two technologies lies in the
expression language utilized to obtain the new data and the notation style
involved. Informatica uses Character, Conversion, Data, Numerical, Scientific,
Special and Test. On the other hand, SSIS uses Mathematical, String, Date/Time,
NULL, Type Casts and Operators.

Sorting

Simply
defined, sorting is having the ability to sort information into a chronological
data set. While the order of the information may appear to be immaterial for
loading into a relational data warehouse or database, it may matter for the
other tasks later on in the transformation process.

The
difference on how SSIS and Informatica carry out this functionality cannot be subtler.
Informatica’s Sorter and SSIS’ Sort can both chronologically organize data and
eliminate duplicate data. In SSIS, de-duplication can be done by setting the eliminate
duplicates option to TRUE. For Informatica, this can be done by selecting the distinct
option.

Detection of Data
Quality Issues

Similar
to all data integration solutions, ETL technologies can be susceptible to data
quality problems. Fortunately for users of Microsoft SSIS, it allows for the
creation of checkpoints within the data transformation process that can reveal
and repair data quality problems. SSIS has a feature called Fuzzy Lookup
transform that pairs incoming “dirty” information – unexpected abbreviations,
null fields, inserted or missing tokens, truncations, misspellings and other
data abnormalities – with clean records contained in a reference table. There
is also the Fuzzy Grouping tool that finds similarities among input rows and unites
duplicate data.

Unfortunately,
Informatica does not have the same functionality out of the box. If you want to
recreate this capability in Informatica, it would require human intervention to
make a mapping table that contains every referrer value that came across the
input stream and their equivalent mapping values.

However,
take note that even if you’re using SSIS, you may still need to manually
intervene to detect and repair data quality issues. Even the most advanced
algorithm may miss something so you still need to manually check for the
accuracy and integrity of your data.

Modularity

Modularity
is concerned with the manner in which the work units that make up an end-to-end
ETL solution are created and reused.

There’s
a slight difference between PowerCenter Informatica and Microsoft SSIS on how
they build modular data integration and ETL solutions.

Informatica
involves a bottom-up framework to ETL implementation by permitting a library of
components – mappings, mapplets, transformations, targets, and sources – that
can be employed across numerous worklets in the solution. A worklet is composed
of a sequence of mapping instances.

On
the contrary, SSIS uses a top-down approach in wherein a general sequence of
tasks is defined before setting the specifics on how these tasks are going to
be carried out. Reusability of ETL components is achieved by creating libraries
of packages which can then be implemented together with a master package. A
package is the counterpart of Informatica’s worklet.

Tracking Changes in
Slowly Changing Dimensions

Slowly
changing dimensions address the issue of capturing and documenting a history of
modifications or changes to entities within a database that are not reflected
in a System of Record for that particular data. A common example of slowly
changing dimensions is an item moving to another product category in a
department store. This will modify the said product’s category attribute but
its SKU will remain unchanged.

Informatica
and SSIS both have the functionality to track these changes with very similar
features. Informatica is equipped with a “slowly changing dimension wizard”
that will allow you to create the sources, transformations, ports and
destinations that are pertinent to accommodate these slowly changing
requirements. SSIS also comes with a slowly changing dimension wizard that
works similarly. Aside from the ability to keep track of slowly changing
attributes, it can also recognize changes to attributes that are not supposed
to change. These are known as fixed attributes. You have the option to enable
the wizard to raise an error warning accordingly when this happens.

Dimension Loading

In
terms of dimension loading, a surrogate key is necessary. As a substitute to a
natural key, a surrogate key is where every join between fact tables and
dimension tables are based.

Informatica
and SSIS have varying ways on how they generate surrogate keys.

Out
of the box, Informatica PowerCenter comes with a component dubbed as Sequence
transformation that has the capacity to create a surrogate key. It produces an incremental
value for every row in the pipeline which can then be incorporate into a
destination table via a surrogate key column.

SSIS
does not have a Sequence transformation component. Instead, it uses its Script
transform component to generate surrogate keys.

Fact Table Loading

Populating
fact tables usually involved two processes: (1) aggregating the data to the
needed granularity and (2) retrieval of dimensional surrogate keys.

Informatica
PowerCenter carries out these operations via a transformation dubbed as
“Aggregator.” This component cuts across groupings of values from chosen input
columns. SSIS has the same capability through the component called “Aggregate.”
The slight difference between the two is that SSIS only has the most used
functions such as Minimum and Maximum, Average, Count Distinct, Count and Sum.
Informatica has all these and other extra capabilities such as Variance,
Standard Deviation, Percentile, Median, First and Last.

Which ETL Technology
is Right for Your Business?

There
are other aspects that illustrate the difference between Informatica
PowerCenter and Microsoft SSIS such as Design Time Debugging, Collaborative
Development, Security, Integration with Packaged Applications, and Web Services
& Service Oriented Architecture. However, the things discussed above cover
the basic concepts in ETL technology.

As
businesses encounter bigger challenges to synergize data from a constantly
increasing number of different systems, your choice of an ETL solution to fit
your needs is more crucial than ever.

As
for the choice between Informatica PowerCenter and Microsoft SSIS, many
analysts consider Informatica as the leader in ETL technology while reputable
research firm Forrester firm once called SSIS’ price-to-performance ration
“downright seductive.”

However,
proclaiming a winner in this battle between two ETL technology giants would
greatly depend on your business requirements. Of course, there are pricing
differences between the two technologies and notable difference in their
features, capabilities and differences as well as their level of usability. It’s
for you to analyze which technology is the perfect fit.

Thursday, April 4, 2013

Good
business managers don’t make internal and external business decisions out of
thin air. More often than not, these business strategies and tactics are based
on empirical and numerical data. This is where business intelligence or BI comes
in.

Simply
defined, business intelligence is a group of technologies, architectures,
processes, methodologies and theories that are aimed to cull, interpret and convert
business data into relevant and useful information. This information is used
for an array of business purposes such as identifying new opportunities,
crafting of new business policies and creating strategies for business
sustainability.

The Basics of ETL

One
of the quintessential and vital tools that are vital to the success of your
business intelligence projects is ETL, short for extract-transform-load. Basically,
ETL is a software that allows businesses to consolidate disparate data and
transfer it from one source to another application or from different sources to
a single data warehouse for interpretation and analysis.

As
its name suggests, an ETL software involves a three-pronged process:

1.Extract. The software pinpoints the relevant data sets
based on a query and extracts the information from different internal or
external, and structured or unstructured sources.

2.Transform. This step involves preparing and modifying
the extracted data for loading to the designated data warehouse or data
warehouses. The level of transformation the extracted data needs vary. There
are data that needs little to no transformation while there are those that have
to be cleaned, restructured and organized. The most common types of data
transformation include:

·Standardization

·Recording

·Pivoting

·Aggregation

·Cleaning

·Filtering

·Sorting

·Re-mapping

3.Loading. A number of experts agree that this is the
least complex step in the ETL process given that the extraction and
transformation of data were completed without a glitch. However, depending on
the size of your datasets a partitioning strategy may need to be implemented. Based
on the various loading options available, data loading may overwrite current data,
append the existing data with the new data or merge information.

Key Benefits of ETL
in Business Intelligence Projects

The
success of your business intelligence project is highly dependent on the
quality and integrity of data you have and the efficiency of how you collect, manipulate
and consolidate your business information. This is where the benefits of ETL
tools primarily lie.

ETL
tools are especially advantageous in situations wherein you need to integrate
data from different source systems that are usually in different formats. Since
you can program an automated data processing schedule, ETL is beneficial when
you need to process data repeatedly such as on a real time, hourly or daily
basis.

Below
are the more specific benefits of using ETL tools in your business intelligence
projects:

1.Collect and integrate data from different locations. If you’re a company
with operations spanning different geographical locations, ETL can be used to consolidate
data into a central data warehouse for easy aggregation. One of the best
examples to demonstrate this benefit is Motorola. The company used ETL to mine
data from 30 geographically disparate systems and load the information into a
single supply chain management data warehouse. This enabled the technology
company to see its total procurement expenses without difficulty. Imagine if this
data in a multitude of different formats had to be aggregated manually. It
could have easily taken hundreds of man hours to complete.

2.Sharing of business information across different business
functions. Another
basic function that ETL tools play in your business intelligence projects is
sharing of information across your different operational units. For example,
your CRM system will define your customers using specific parameters. What if
your accounting department needs this data to improve how it handles billing? Chances
are, different units in your business will require a different data format. With
ETL, extracting this information, converting it to a new format and loading it
to the new data warehouse can be done seamlessly and with minimal room for
error.

3.Ease of migrating, cleansing and warehousing bulk data
for more comprehensive business intelligence projects. While it is common
practice in business intelligence to choose data samples within bigger data
populations, logic still dictates that the more information you have, the more
informed and educated your business decisions are. ETL’s capability to extract,
cleanse and write data with ease will allow you to use as much data as you can
with ease.

Information
is prerequisite to intelligence. Transitively speaking, the amount, quality and
integrity of business information you get is directly proportional to the
success of your business intelligence projects. There are only a few tools that
can rival the efficacy of ETL software in terms of data processing for business
intelligence. Review your business intelligence needs and identify gaps that
ETL tools can bridge.

Blog Archive

About Me

Will Byron has had a passion for technology, nutrition and competition for over 25 years. He enjoys writing articles on business intelligence, self-development, nutrition and participating in triathlons.