Category Archives: Develop

During the Development Phase, the system developer takes the detailed logical information documented in the previous phase and transforms it into machine-executable form, and ensures that all of the individual components of the automated system/application function correctly and interface properly with other components within the system/application. As necessary and appropriate, system hardware, networking and telecommunications equipment, and COTS/GOTS software is acquired and configured. New custom-software programs are developed, database(s) are built, and software components (COTS, GOTS, and custom-developed software and databases) are integrated. Test data and test case specifications are finalized. Unit and integration testing is performed by the developer with test results appropriately documented. Data conversion and training plans are finalized and user procedures are baselined, while operations, office and maintenance procedures are also initially developed. The Development Phase ends with a Stage Gate Review to determine readiness to proceed to the Test Phase.

Adding a forging key to tables in Netezza / PureData is a best practice; especially, when working with dimensionally modeled data warehouse structures and with modern governance, integration (including virtualization), presentation semantics (including reporting, business intelligence and analytics).

Foreign Key (FK) Guidelines

A primary key must be defined on the table and fields (or fields) to which you intend to link the foreign key

Avoid using distribution keys as foreign keys

Foreign Key field should not be nullable

Your foreign key link field(s) must be of the same format(s) (e.g. integer to integer, etc.)

Definition of a Foreign Key

A foreign Key (FK) is a constraint that references the unique primary key (PK) of another table.

Facts About Foreign Keys

Foreign Keys act as a cross-reference between tables linking the foreign key (Child record) to the Primary key (parent record) of another table, which establishing a link/relationship between the table keys

Foreign keys are not enforced by all RDBMS

The concept of referential integrity is derived from foreign key theory

Because Foreign keys involve more than one table relationship, their implementation can be more complex than primary keys

A foreign-key constraint implicitly defines an index on the foreign-key column(s) in the child table, however, manually defining a matching index may improve join performance in some database

The SQL, normally, provides the following referential integrity actions for deletions, when enforcing foreign-keys

Cascade

The deletion of a parent (primary key) record may cause the deletion of corresponding foreign-key records.

No Action

Forbids the deletion of a parent (primary key) record, if there are dependent foreign-key records. No Action does not mean to suppress the foreign-key constraint.

Set null

The deletion of a parent (primary key) record causes the corresponding foreign-key to be set to null.

Set default

The deletion of a record causes the corresponding foreign-keys be set to a default value instead of null upon deletion of a parent (primary key) record

I had reason today to get the number of the day of the week, in PureData / Netezza, which I don’t seem to have discussed in previous posts. So, here is a simple script to get the number for the day of week with a couple of flavors, which may prove useful.

Basic Format

select extract(dow from <<FieldName>>) from <<SchemaName>>.<<tableName>>

Example SQL

SELECT

CURRENT_DATE

, TO_CHAR(CURRENT_DATE,’DAY’) AS DAY_OF_WEEK

—WEEK STARTS ON MONDAY

, EXTRACT(DOW FROM CURRENT_DATE)-1 AS DAY_OF_WEEK_NUMBER_STARTS_ON_MONDAY

—WEEK STARTS ON SUNDAY

, EXTRACT(DOW FROM CURRENT_DATE) AS DAY_OF_WEEK_NUMBER_STARTS_ON_SUNDAY

—WEEK STARTS ON SATURDAY

, EXTRACT(DOW FROM CURRENT_DATE)+1 AS DAY_OF_WEEK_NUMBER_STARTS_ON_SATURDAY

What is a Composite Primary Key?

A Composite Primary key is Primary key What a primary key, which is defined by having multiple fields (columns) in it. Like a Primary Key what a composite Primary Key is depends on the database. Essentially a Composite Primary Key:

Is a combination of Fields (columns) which uniquely identifies every row.

The APT_TSortOperator warning happens when there is a conflict in the portioning behavior between stages. Usually, because the successor (down Stream) stage has the ‘Partitioning / Collecting’ and ‘Sorting’ property set in a way that conflicts with predecessor (upstream) stage’s properties, which it is set to preserver. This can occur when the successor stage has the “Preserve Partitioning” property set to:

‘Default (Propagate)’

‘Propagate’, or

‘Set’

Preserve Partitioning Property – list

Message ID

IIS-DSEE-TFOR-00074

Message Text

<<Link Name Where Warning Occurred>>: When checking operator: Operator of type “APT_TSortOperator”: will partition despite the preserve-partitioning flag on the data set on input port 0.

Warning Fixes

First, if the verify that the partitioning behaviors of both stages are correct

If so, set the predecessor ‘Preserve Partitioning’ property to “Clear”

If not, then correct the partitioning behavior of the stage which is in error

Microsoft doesn’t provide macro guidance for naming convention, however, sometimes it is useful to have a place to start. Also, there are times when flexibility with naming conventions are necessary. So, here is a quick set of SQL Server naming conventions, which may be helpful if you find yourself working with a customer who doesn’t have an established set of naming convention standards and you need to assemble a set fast.

Default Repository Structure

When a project is created, there is a default repository structure created for use in the DataStage designer client.

Default DataStage Repository Structure

However, some additional organization will be required for most DataStage projects. Usually, this organization occurs in in these areas:

Addition of structure within the “Jobs” folder

Addition of a “Parameter Sets” folder

Addition of structure within the “Table Definitions” folder

Addition of a “Developer Work Area” folder

Repository Structure within the “Jobs” folder

Below is a sample of a folder structure for multiple applications that share a common Repository. Pattern includes, but does not illustrate all other delivered folders. In addition to the core folder structure, developers can create individual working, test, and in progress folders, which do not migrate, but keep work segregated.

Jobs Folder Pattern Datastage Repository Structure

Parameter Sets Folders

The parameter set folders or for two sets of information.

First, are the database parameters, which include data connections and the attached parameter sets.

The second, for job parameters, which may include parameter sets, for things like e-mail parameters, surrogate key file paths, etc.; which is a best practice, rather creating them as project level parameters.

Parameter Sets Folder Pattern Datastage Repository Structure

Table Definitions

The Tables Definition folder have folders added to segregate the imported meta data for source and target system and, in some case, may need folders to logically organize imported meta which may reside within the same database and/or schema, but belong to different logical layer.

Peer code review happens during the development phase and focus on the overall quality and compliance to standards of code and configuration artifacts. However, the hard part of performing a Peer code review isn’t, performing the review, but rather to achieving consistency and thoroughness in the review. This is where a checklist can contribute significantly, providing a list of things to check and providing a relative weight for the findings. I hope this template assists with your DataStage job review process.

A peer review is an examination of a Software Development Life Cycle (SDLC) work product by team members, other than the work Product’s author, to identify defects, omissions, and compliance to standards. This process provides an opportunity for a quality assurance, knowledge sharing, and product improvement early during the SDLC life cycle.

Fill out the appropriate information on the General tab and the proceed to the Parameters Tab:

In the Parameters Tab, enter in the Parameters you wish to include in this Parameter Set

On the Values tab, specify a Value File name (please follow naming convention standards to prevent rework and other problems). This is the name of the file that will automatically be created on the Engine tier. This tab also allows you to view/edit values located in the value file.

Click OK to save the Parameter set.

Create a Parameter Set from the navigation of DataStage designer

This is, perhaps, the more traditional way of creating a parameter set.

To create a new Parameter Set

Select: File > New > Other and select “Parameter Set”

This will Launch a Dialog

Fill out the appropriate information on the General tab and the proceed to the Parameters Tab:

In the Parameters Tab, enter in the Parameters you wish to include in this Parameter Set.

Note: Existing Environment Variables can also added.

Create a Parameter Set from a job
On the Values tab, specify a Value File name (please follow naming convention standards to prevent rework and other problems). This is the name of the file that will automatically be created on the Engine tier. This tab also allows you to view/edit values located in the value file.

Click OK to save the Parameter set.

This approach is, perhaps, less traditional, but is equally effective, if you find yourself creating additional jobs and now need to share the same parameters. This is a quick and easy to generate a parameter set from an existing job.

To create a new Parameter Set from a job

Open the job that you want to create a parameter set for.

Click “Edit > Job Properties” to open the “Job Properties” window.

Click the “Parameters” tab.

Press and hold the Ctrl key, then select the parameters that you want to include in the parameter set.

During a recent project, I had reason to investigate what the maximum character field links for Netezza to be able to fit the data ETL ‘ed from the source. The constraints on Netezza character fields are not as long as some other databases, therefore, I thought I would document and a quick post for future reference. Knowing these limits will help with your ETL design and construction but you may know where to truncate your field input data.

Data type

Alias names

InfoSphere DataStage data types (SQL type)

Maximum Length

Notes

CHAR

CHARACTER, CHAR(n), CHARACTER(n)

Char

64,000

Fixed-length character string, blank padded to length n. If you do not specify n, the default is an unsized CHAR value.

VARCHAR

CHARACTER VARYING, VARCHAR(n), CHARACTER VARYING(n), CHAR VARYING(n)

VarChar, LongVarChar

64,000

Variable length character string to a maximum length of n. If you do not specify n, the default is an unsized VARCHAR value. There is no blank padding, and the value is stored as entered.

NCHAR

NATIONAL CHARACTER, NATIONAL CHAR(n), NCHAR(size)

NChar, Char,

16,000

Fixed-length character string, blank padded to length n. If you do not specify n, the default is an unsized NCHAR value.

NVARCHAR

NATIONAL CHARACTER VARYING, NATIONAL CHAR VARYING, NATIONAL CHAR VARYING(n), NATIONAL CHARACTER VARYING(n), and NVARCHAR(n)

VarChar, NVarChar ,LongVarChar, LongNVarChar, VarChar

16,000

Variable length character string to a maximum length of n. If you do not specify n, the default is an unsized NVARCHAR value.

The ‘Comment on Column’ provides the same self-documentation capability as ‘Comment On table’, but drives the capability to the column field level. This provides an opportunity to describe the purpose, business meaning, and/or source of a field to other developers and users. The comment code is part of the DDL and can be migrated with the table structure DDL. The statement can be run independently, or working with Aginity for PureData System for Analytics, they can be run as a group, with the table DDL, using the ‘Execute as a Single Batch (Ctrl+F5) command.

Using the ‘Comment On’ capabilities of Netezza is an easy way to provide self-documentation within the DDL structure of a table. This can be very helpful to developers and users, who can see the DDL of the table. Personally, I think of it as a way to document the purpose and/or business meaning of a table. This can be especially, when differing team may be working with tables. Also, reduce the reliance on developer and/or user memory, or to search of the technical documentation, which may or may not be readily available.

APT_FILE_EXPORT_BUFFER_SIZE and APT_FILE_IMPORT_BUFFER_SIZE in DataStage Administrator

While extensive use of sequential files is not best practice, sometimes there is no way around it, due to legacy systems and/or existing processes. However, recently, I have encountered a number of customers who are seeing significant performance issues with sequential file intensive processes. Sometimes it’s the job design, but often when you look at the project configuration they still have the default values. This is a quick and easy thing to check and to adjust to get a quick performance win, if they’ve not already been adjusted.These are delivered variables, but should seriously be considered for adjustment in nearly all data stage ETL projects. The adjustment must be based on the amount of available memory, the volume of work load that is sequential file intensive, and the environment you’re working in. Some experiential adjustment may be required, but I have provided a few recommendations below.

Here are a few thoughts on effectively working with IBM Infosphere, Information Server, DataStage surrogate key files, which may prove useful for developers.

Placement

The main thing about placement is that it be in a consistent location. Developers and production support teams should need to guess or look up where it is for every DataStage project. So, it is best to put the surrogate keys in same base path and that each project has its own subfolder to facilitate migrations and to reduce the possibility of human error. Here Is the patch structure, which is commonly use:

Path

/data/SRKY/<<Project Name>>

Parameter Sets

As a best practice, the surrogate key file path should be in a parameter set and the parameter used in the jobs, as needed. This simplifies maintenance, if and when changes to the path are required, and during migrations.

Surrogate Key Parameter Set Screenshot – Example Parameter Tab

Surrogate Key Parameter Set Screenshot

Surrogate Key Parameter Set Screenshot – Example Values Tab

Surrogate Key Parameter Set Screenshot – Example Values Tab

Surrogate Key Job Parameter Example Path using Parameter

Surrogate Key Job Parameter Example Path using Parameter

Permissions

DataStage must have permissions to:

The entire parent path

The project folder, and

The surrogate key files themselves.

To ensure the DataStage has access to the path and Surrogate files, ensure:

The ‘dsadm’ (owner) and ‘dstage’ (group) have access to folders in the path, with at least a “-rw-r–r–“ (644) permissions level. Keeping the permissions to a minimum can, for obvious reasons, prevent inadvertent overwrites of the surrogate key files; thus, avoiding some, potentially, serious cleanup.

The ‘dsadm’ (owner) and ‘dstage’ (group) have access to the surrogate key files

This productivity tip, is how we can quickly create a new surrogate key file in Linux. This example is leveraging native capabilities of Red Hat Enterprise Linux (RHEL) to skip a few commands, by using an existing surrogate key file to create a new surrogate file with a minimum of keys strokes and command line entries.

Creating a New Surrogate Key File From an Existing File

The basic process consists of just a few steps:

Navigate to the location of your existing surrogate key files

Copy an existing surrogate file

Empty the new surrogate key file

Navigate to the location of your existing surrogate key files

This step is preparatory step; you will need to look at the path variable for the project you are working with to know where to go. The actual path to the surrogate files your project can vary by project.

Copy an existing surrogate file

Assuming you have existing surrogate key files configured as needed, the use of the copy (cp) command can and the interactive and preserve options can eliminate the need to create the file, then set groups and permissions. The interactive (-i) option prevent you from overwriting an existing files, in case you made a filename typo and the preserver (-p) option preserve the specified attributes (e.g. ownership, and permissions).

Basic Command

Here is the command formats with interactive and preserve, either format works

cp -ip <<FileName to Be Copied>> <<New Filename>>

Here is the command formats with only preserve

cp -p <<FileName to Be Copied>> <<New Filename>>

Example Command

cp -ip srky blogexampl.srky

Copy Surrogate Key With Permissions

Empty the new surrogate key file

Setting the newly create surrogate key file to null will empty the file, so, DataStage can begin from the point configure in your DataStage job.