Note

These characters are all replaced with "_" in the file system and you may end up
creating duplicate items.

Set column analysis metadata (purpose, description and author name) in the
corresponding fields and click Next to
proceed to the next step.

Selecting the database columns and setting sample data

Expand DB connections and in the desired
database, browse to the columns you want to analyze.

Note

When profiling a DB2 database, if double quotes exist in the column
names of a table, the double quotation marks cannot be retrieved when
retrieving the column. Therefore, it is recommended not to use double
quotes in column names in a DB2 database table.

Select the columns and then click Finish
to close the wizard.

A file for the newly created column analysis is listed under the Analysis node in the DQ
Repository tree view, and the analysis editor opens with the
analysis metadata.

This example analyzes full names, email addresses and sales
figures.

In the Data preview view, click Refresh Data.

The data in the selected columns is displayed in the table.

In the Data preview view, select:

Option

To...

New Connection

open a wizard and create a connection to the data
source from within the editor.

list in the table N first data records from the
selected columns or list N random records from the
selected columns.

Refresh Data

display the data in the selected columns according to
the criteria you set.

In the Limit field, set the number for
the data records you want to display in the table and use as sample
data.

In the Analyzed Columns view, use the
arrows in the top right corner to open different pages in the view if you
analyze large number of columns.

You can drag the columns to be analyzed directly from the DQ Repository tree view to the Analyzed Columns list in this view.

If one of the columns you want to analyze is a primary or a foreign key,
its data mining type becomes automatically Nominal when you list it in the Analyzed Columns view. For more information on data mining
types, see Data mining types.

If required, right-click any of the listed columns in the Analyzed Columns view and select Show in DQ Repository view to locate it in the
database connection in the DQ Repository
tree view.

When you select to analyze Date columns and run the analysis
with the Java engine, the date information is stored in the studio and in the datamart as regular
date/time of format YYYY-MM-DD HH:mm:ss.SSS for date/timestamp and of
format HH:mm:ss.SSS for time. The date and time formats are slightly
different when you run the analysis with the SQL engine.

How to set indicators on columns

The second step after defining the columns to be analyzed is to set either system
or user-defined indicators for each of the defined columns.

How to set system or user-defined indicators

To set system indicators for the columns to be analyzed, do the
following:

From the Data preview view in the
analysis editor, click Select
indicators to open the [Indicator
Selection] dialog box.

From the [Indicator Selection] dialog box:

In the Data preview
section, place the cursor on a row to display the complete
data. This section lists the sample data you define in the
analysis editor.

Click in the cells next to indicators names to set
indicator parameters for the analyzed columns as needed. You
can assign system or user-defined indicators to the
columns.

Select the Hide non applicable
indicators check box to hide the system and
user-defined indicators that are not compatible with the
engine you select to execute the analysis.

If required, change the order of columns by dropping them
with the cursor.

The order of the columns will be changed accordingly in
the analysis editor.

Note

- It is not very useful to use Pattern Frequency Statistics on a
column of a Date type in databases when executing the analysis with
the SQL engine. No data quality issues are returned by this
indicator as all dates will be displayed using one single format.
For further information, check the article Why do we get 99-AAA-99 when profiling Date columns in
Oracle.

Indicators are accordingly attached to the analyzed columns in the
Analyzed Columns view.

The analysis in this example provides/computes the following:

simple statistics on all columns. For further information
about these indicators, see Simple statistics,

the characteristics of textual fields and the number of most
frequent values for each distinct record in the
fullname column . For further
information, see Text statistics and Advanced statistics respectively,

patterns in the email column to show
frequent and rare patterns so that you can identify quality
issues more easily. For further information about these
indicators, see Pattern frequency statistics,

the range, the inter quartile range and the mean and median
values of the numeric data in the
total_sales column. For further
information about these indicators, see Summary statistics.

the frequency of the digits 1 through 9 in the sales figures
to detect fraud. For further information, see Fraud Detection.

How to set options for system or user-defined indicators

You can define expected thresholds on the indicator's value. The threshold you
define is used for measuring the quality of data. If the indicator's value is
outside the defined threshold, then the data is of bad quality. You can define
only one threshold or no threshold at all. You may set these thresholds either
by value or by percentage with respect to the row count.