Thursday, December 09, 2010

I get this question on a regular basis, so I figured I might as well blog it, in the hope it will be useful for others. Here goes:

Let's say that I want to delete all records that match an id from a set of tables. The table names come in as rows into the Execute SQL Script step (check execute for every row). Next I write:

DELETE FROM {table_name} WHERE id = {identifier}

as the SQL to execute. In the parameters grid at the bottom right, I have two fields: table_name and identifier. What is the syntax for substituting the table_name and identifier parameters in the sql script?

(Although this particular question focuses on the "Execute SQL Script" step, it also applies to the "Table Input" step, and probably a few more steps I can't recall right now.)

The parameters grid can be used for prepared statement value placeholders. In the SQL statement these placeholders are denoted as questionmarks (?). These are positional parameters: they get their value from those fields in the incoming stream that are entered in the parameters grid, in order. Here's an example of the correct usage of these placeholders:

DELETE FROM myTable WHERE id = ?

Here, the ? in the WHERE clause will be bound to the value of the first field from the incoming stream entered in the parameters grid. Because there is only one such placeholder, there can be only one field in the parameters grid.

An important thing to realize is that these parameters can only be used to parameterize value expressions. So, this kind of parameter does not work for identifiers, nor do they work for structural elements of the SQL statement, such as keywords. So this kind of parameter cannot be used to parameterize the table name which seems to be the intention in the original example posed in the question.

There is a way to parameterize the structural elements of the SQL statement as well as the parameters. You can apply variable substitution to the SQL statetment.

Kettle Variables can be defined by a Set Variables step, or by specifying parameters at the transformation level. They get their value from "the environment": for example, parameters get their value initially when the transformation is started, and regular variables are typically set somewhere in the job that is calling your transformation.

In text fields, including the SQL textarea of the Table input step or the Execute SQL Script step, you denote those variables with this syntax: ${VARIABLE_NAME}. So to parameterize the table name we could use something like this:

DELETE FROM ${TABLE_NAME}

In order to force kettle to apply variable substitution to the SQL statement, you have to check the "variable substitution" checkbox. If this checkbox is checked, then all variables are simply substituted with their (string)value during transformation initialization. This is a lot like the way macro's are substituted by the pre-processor in C/C++ code.

When comparing variables with parameters, two important things should be mentioned here:

Unlike value placeholders, variables can be used to manipulate any aspect of the SQL statement, not just value expressions. The variable value will simply become the text that makes up the SQL statement, it is your responsibility it results in a syntactically valid and correct SQL statement.

Variables are evaluated once during transformation initalization. So if you want to vary the variable value, you'll have to call the transformation again for the change to take effect. For the same reasons, you cannot set the value of a variable and read it within the same transformation: setting the variable value occurs at runtime, but evaluating it occurs at initialization time.

Finally, here's a screenshot that summarizes these different ways to parameterize SQL statements in kettle:

- Within the transformation, create a new Database connection. Use variables for the fields you want to parameterize. For example, use ${HOST} for hostname, ${PORT} for portnumber etcetera.

- open the transormation settings and in the "Parameters" tab, add a parameter for each variable you used. In the paramters tab, only use the parameter name, and leave out the ${...} markers (those are only appropriate when *referencing* a variable.

Yes. All steps that accept a SQL statement in their config can be represented in part or in their entirety by a kettle variable or parameter. So, you can use this to dynamically set columns as well.

However, you will have to take care that the remainder of your transformation is resilient to a dynamic column make up of the data stream as well.

Another thing you might want to look into in making your transformations more dynamic is metadata injection, which essentially lets you configure your steps dynamically (data driven) http://wiki.pentaho.com/display/EAI/ETL+Metadata+Injection

Search This Blog

About Me

I'm Roland Bouman (@rolandbouman on twitter). I'm a software (web) application developer and I work on both the front end as well as the back end. I do data modeling, database design, ETL, Analytics, and Business Intelligence.