Set up and use federation in InfoSphere BigInsights Big SQL V3.0

This article introduces Big SQL V3.0 federation capabilities by using the
data sources Teradata, Oracle, Netezza, and IBM® DB2 for Linux, UNIX,
and Windows. Some knowledge of database systems is assumed. The focus is
on the basic configuration that is required to use Big SQL V3.0
federation.

Big SQL V3.0 federation, a feature of Big SQL V3.0, is highly configurable.
The command syntax that is used here is a simplified version of the
essential settings for a working system.

What you need for this
article

Knowledge of Big SQL in InfoSphere BigInsights, particularly about the
characteristics of the BIGSQL database.

Familiarity with the DBMS being used as data source

Knowledge of how to set up the remote server's connection and its
respective client.

Big SQL V3.0 installed on a system that runs Linux AMD64 or Linux PPC
(not all data sources are supported on PPC).

Clients for each data source, which are installed and configured on
the Big SQL V3.0 management node.

Use of Netezza® through that DataDirect ODBC that is compatible
with the Netezza client.

Netezza-branded DataDirect Driver and the Netezza client, which can be
downloaded from IBM Fix Central.

Supported data sources

Explore HadoopDev, your direct channel to
the InfoSphere BigInsights development team

Find all the resources that you need to develop with InfoSphere
BigInsights, brought to you by the extended BigInsights development
team. Doc, product downloads, labs, code examples, help, events,
expert blogs — it's all there. Plus a direct line to the
developers. Engage with the
team now.

Because data analytics is crucial to the success of a business, data
warehousing that supports consolidation and federation of data is a
requirement.

Big SQL, part of IBM InfoSphere® BigInsights™, is the SQL
interface to Hadoop file systems. It enables the storage and handling of
massive amounts of data. Big SQL federation enables a mixed environment
with consolidation of big data and federation of live data from different
data sources to be created.

Overview of Big SQL V3.0
federation

The architecture of a federated server is fairly simple. It includes the
Big SQL V3.0 engine, the wrapper, and the remote database management
systems (DBMS) client.

One of the core elements of the engine is the optimizer (shown in Figure
1), which is responsible for choosing the execution plan for a federated
query that is based on a cost analysis. The optimizer can rewrite the
query, if necessary, to make it more efficient. The optimizer distributes
query work between the federated server and data sources to make the
process as cost efficient as possible. The cost of a query is
automatically calculated per nickname by the runstats
utility, which is turned on by default.

Figure 1. Architecture of Big SQL V3.0

To enable Big SQL V3.0, you need a wrapper, a server, a nickname, and a
user mapping.

Wrapper: A library that handles the communication
between the Big SQL V3.0 federation server and the DBMS client. Each
type of data source must have its own wrapper. The wrapper translates
the queries that come to and from the Big SQL V3.0 federation server
to calls on the exposed API of the client. The client communicates
with the data source.

Server: A remote database. Although a server object
is a remote database, you need to know the system that this database
resides in because that information determines what client is used to
connect to it.

Nickname: A single remote object in the data source
(for example, a table, a view, or a procedure) in a server. As with
any federated server, the use of data sources is managed
transparently. After the nicknames for the remote objects are defined
from a user perspective, they behave as if they are local objects. If
the data comes from the data source unchanged, the result of a query
is determined by the local rules. This transparency is important in
this case because the Big SQL database uses only binary collation, by
default. This type of collation affects the options that need to be
set when you create the server object.

User mapping: An association between a local
authorization ID and a remote authorization ID. Any operation that is
run on the federated server by the local ID is run on the data source
by using the ID that it was mapped to. For a user to query a nickname,
the user must be authorized to perform SELECT operation
on the original table. For security reasons, on most systems the same
authorization ID with the same password does not exist on all
machines; therefore, user mapping is required.

After the setup is complete, Big SQL V3.0 is used through JSqsh, similar to
how Big SQL versions are accessed. To create federated objects, an
authorization ID with DBADM authority is required.

Installation

The installer for InfoSphere BigInsights automatically installs the
wrappers and applies the federation licence. The wrappers that are shown
in Table 1 are included by default in the directory
$HOME/sqllib/lib/.

Table 1. Wrappers

Data source type

Wrapper library

Auxiliary files

DB2

libdb2drda.so

libdb2drdaF.so
libdb2drdaU.so

Teradata

libdb2teradata.so

libdb2STteradataF.a

libdb2teradataU.so

Oracle

libdb2net8.so

libdb2net8F.so libdb2net8U.so
libdb2STnet8F.a

Netezza

libdb2rcodbc.so

libdb2odbct.so
libdb2rcodbcF.so libdb2rcodbcU.so

Informix

libdb2informix.so

libdb2informixF.solibdb2informixU.so

Environment setup

Login as user bigsql.
By
default, Big SQL V3.0 federation is not enabled for immediate use. Enable
it by using the following command.

DB2 UPDATE DBM CFG USING FEDERATED YES

You must restart the database for the change to take effect.

For many data sources, you must set certain variables in the db2dj.ini
file.
Variables set to file names or directories must use the fully qualified
name, and values cannot use meta characters or environment variables like
~, or $HOME.
The
db2dj.ini is in $HOME/sqllib/cfg if it exists. Otherwise, you can use any
editor to create it.
Also, to alter db2dj.ini default path from the command line, run:

export DB2DJ=<CUSTOM_PATH>/db2dj.ini
<CUSTOM_PATH> is any path in the system (for example $HOME/sqllib/cfg )

Some data sources need their client's libraries to be part of the PATH. To
do this, alter the profile file $HOME/sqllib/userprofile. After you make
any changes, reload the profile file and restart the instance.

To reload the profile, issue the following command:

.source $HOME/sqllib/userprofile

To restart the instance of the Big SQL V3.0 federation server, run the
following code:

$BIGSQL_HOME/bin/bigsql stop
$BIGSQL_HOME/bin/bigsql start

Set up the connection to DB2 for
Linux, UNIX, and Windows

To identify which database the federated server connects to, the remote DB2
database must be cataloged in the federated server system database
directory.

Listing 1. Catalog the remote
server

DB2 CATALOG TCPIP NODE RMNODE REMOTE MYHOST SERVER 12345

RMNODE is the node name.

MYHOST is the host name for the remote machine.

12345 is the service the remote DB2 server is running on,
defined by setting the svcename variable at the remote data
source (db2 often runs on port 50000).

Set up the connection to
Teradata

Add the following command to the startup file of the shell that is
used (for example, .kshrc if you use Korn shell), to export the
environmental variable TERADATA_LIB_DIR.

export TERADATA_LIB_DIR=/opt/teradata/client/lib64

/opt/teradata/client/lib64
is the absolute path where the Teradata client libraries are
located.

As shown in the following command, run the
djxlinkTeradata command in $HOME/sqllib/bin/ as
root. If the operation is successful, the file
libdb2STteradataF.a that is required for the Teradata wrapper is
created in the $HOME/sqllib/lib64 directory.

su root
<HOME>/sqllib/bin/ djxlinkTeradata

<HOME>
is the home directory for Big SQL.

To avoid errors when you use nicknames, set the variable
TERADATA_CHARSET in the db2dj.ini file to the character
set used at the data source, as shown:

TERADATA_CHARSET=ASCII

ASCII
is the character set used by the Teradata database.

Set up the connection to
Oracle

Set the ORACLE_HOME and TNS_ADMIN variables
in the db2dj.ini file. Set ORACLE_HOME to the absolute
path where the Oracle client is installed, and set
TNS_ADMIN as the absolute path to your tnsnames.ora file.

ORACLE_HOME=/opt/oracleclient
TNS_ADMIN=/home/bigsql/resources

/opt/oracleclient
is the absolute path where the Oracle client is installed.
/home/bigsql/resources is the absolute path to the
location of the tnsnames.ora file.

Add client libraries to the path. Do this by adding the following code
extract to user profile above the line "export LD_LIBRARY_PATH
DB2LIBPATH":

Set up the connection to
Netezza

Download and install the Netezza client and the compatible ODBC
driver. Consider these two compatible drivers:

DataDirect Technologies Connect for ODBC driver

IBM DataDirect ODBC driver

You can download a complete package that includes the DataDirect
ODBC Driver and the Netezza client. The package includes a sample
odbc.ini file, which describes the Netezza servers that are used
on the create server command.

Create a symbolic link named .odbc.ini in the home directory to the
location where you created your odbc.ini file:

ln -sf $HOME/resources/odbc.ini $HOME/.odbc.ini

$HOME/resources/odbc.ini
is the odbc.ini file being used (a sample is in the netezza
package).

Add the required variables to the db2dj.ini file, as shown:

ODBCINI=/home/bigsql/resources/odbc.ini

where
/home/bigsql/resources is the absolute path to the
location of the odbc.ini file.

Add ODBC driver libraries to the path. Do this by adding the following
code extract to user profile above the line "export LD_LIBRARY_PATH
DB2LIBPATH":

Set up the connection to
Informix

Add the following command to the startup file of the shell that is
used (for example, .kshrc if you use Korn shell), to export the
environmental variable INFORMIXDIR:

export INFORMIXDIR=/opt/IBM/informix

Run the djxlinkInformix command in $HOME/sqllib/bin/ as
root. If the operation is successful, the file libdb2STinformixF.a
that is required for the Informix wrapper is created in the
$HOME/sqllib/lib64 directory:

/opt/IBM/informix
is the absolute path to the Informix client.
/homedir/bigsql/bin/sqlhosts is the path to the file
containing the server definitions. en_US.819
indicates which locale is used for the client.
infdefault is the default Informix server to
use.

Complete system setup

The following code listings show a sample of the configurations a user
would have, if he used all of the data sources at the same time.

Listing 7. Content of
odbc.ini

Listing 8. Content of
sqlhosts

inffake olsoctcp 127.0.0.1 1652

Wrapper creation and
use

A wrapper is required for each different data source type because each
wrapper is associated with a single library file. Wrappers can be written
as C++ or Java™ applications. They can be seen as a composition of
two subcomponents: the query compiler and the execution engine.

For wrappers written in C++, it is possible to increase performance by
running them as trusted applications in the database engine, a practice
that enables the compiler and the execution engine to run
concurrently.

Java wrappers are always run in fenced mode. In this mode, the
compiler must finish before the query is handed over to an external
process to be run. However, this mode has the advantage of allowing
queries with only reads across different data sources to be run in
parallel. By default, the wrapper is run in trusted mode. To set it to
fenced mode, set the variable DJ_FENCED when the wrapper is
created.

Listing 11. Oracle data
source

Listing 12. Netezza data
source

Listing 13. Informix data
source

CREATE WRAPPER INFORMIX LIBRARY 'libdb2informix.so'

Note: In the previous code listing,
/opt/odbc64v51/lib/libodbc.so is the absolute path to the
ODBC driver.

Server definition and
use

A server refers to a remote database. The server definition must include
the type of server and the database version or release
level that it is in, the wrapper that should be used to
communicate with it, and a user and password to be used for
authentication.

The following section describes the minimum options that are required for
each data source and the value that the collating_sequence
option must be set to. The Big SQL V3.0 federation server uses only binary
collation. Binary collation is blank sensitive and differentiates empty
from null strings. Determine whether the data source's collating sequence
is compatible with binary collation by checking whether the data sources
collation behavior is the same for blank sensitivity and treatment of null
strings.

<WRAPPER_NAME> is a unique wrapper name defined by the
create wrapper command.

<REMOTE_USER> is a valid remote user.

<USER_PASSWORD> is the password
for <REMOTE_USER>.

<SERVER OPTIONS> definition of options for the server
being created.

The values for <SERVER TYPE> and
<SERVER VERSION> can be found on the respective data
source type in the following section.

Options for DB2 data source

The <TYPE> of a DB2 server is specified as
db2/<SUBTYPE> where SUBTYPE is your
platform. DB2 for Linux, UNIX, and Windows is the only supported platform;
therefore, the type is db2/udb.

The fully supported <VERSION> is 10.5.

The DB2 identity collation has the same behavior as binary collation in
terms of empty and null strings. It is blank insensitive; therefore, it is
incompatible with binary collation.

DB2 in Oracle compatibility mode is compatible in terms of blank padded
comparisons. But in this mode, empty and null strings are treated the
same; therefore, it is incompatible with binary collation.

To enable pushdowns in either of these cases, the
COLLATING_SEQUENCE must be set to N.

Table 2. DB2 for Linux, UNIX, and Windows options
reference

Option

Description

DBNAME

(Always required.) Specifies the
specific database to use for the initial remote DB2 database
connection. This specific database is the database alias for
the remote DB2 database that is cataloged on the federated
server that is mentioned in the environmental setup
instructions.

PUSHDOWN

Specifies whether the federated
server allows the data source to evaluate operations. Valid
values are Y and N. The default is
Y; the data source evaluates operations.
N specifies that the federated server send SQL
statements that include only SELECT with column
names.

COLLATING_SEQUENCE

(Required if
PUSHDOWN is set to Y) Specifies
whether the data source uses the same default collating
sequence as the federated database. Must be set to
N because there are currently no collations in
DB2 that are perfectly compatible with the Big SQL V3.0 binary
collation.

Teradata

The <TYPE> of a Teradata server is
teradata.

The fully supported <VERSION> is 12.

Teradata collations are ANSI-compliant. This means empty strings are
considered different from null strings; therefore, it is compatible with
binary collation, but it is blank insensitive.

To enable pushdowns, COLLATING_SEQUENCE must be set to
N.

Table 3. Teradata options reference

Option

Description

NODE

(Always required.) Specifies the
Teradata server. Can be set to the server alias, the IP
address, or the fully qualified domain name.

PUSHDOWN

Specifies whether the federated
server allows the data source to evaluate operations. Valid
values are Y and N. The default is
Y; the data source evaluates operations.
N specifies that the federated server send SQL
statements that include only SELECT with column
names.

COLLATING_SEQUENCE

(Required if
PUSHDOWN is set to Y.) Specifies
whether the data source uses the same default collating
sequence as the federated database. Must be set to
N.

Oracle

The <TYPE> in Oracle is oracle.

The fully supported <VERSION> is 11.

Oracle has blank-sensitive comparison. It even has a binary collation of
its own that is compatible with binary collation, but in Oracle regardless
of collation, empty strings are treated the same as null strings;
therefore they cannot be considered compatible.

To enable pushdowns, COLLATING_SEQUENCE must be set to
N.

Table 4. Oracle options reference

Option

Description

NODE

(Always required.) Specifies an
entry in the tnsnames.ora file

PUSHDOWN

Specifies whether the federated
server allows the data source to evaluate operations. Valid
values are Y and N. The default is
Y; the data source evaluates operations.
N specifies that the federated server send SQL
statements that include only SELECT with column
names.

COLLATING_SEQUENCE

(Required if
PUSHDOWN is set to Y.) Specifies
whether the data source uses the same default collating
sequence as the federated database. Must be set to
N because there are currently no collations in
Oracle that are perfectly compatible with Big SQL V3.0 binary
collation.

Netezza

The <TYPE> is odbc. Although the connection
is to a Netezza server, it is done through the odbc driver.

The fully supported <VERSION> is 6 for the examples given. ODBC
driver version must be 3.0 or later.

Netezza has blank-sensitive comparisons and because null strings are not
supported, it is compatible with binary collation.

Table 5. Netezza options reference

Option

Description

NODE

(Always required.) Specifies the
name of the node or the system DSN name that is assigned to
the ODBC data source defined in the odbc.ini file. The value
is case-sensitive.

PUSHDOWN

Specifies whether the federated
server allows the data source to evaluate operations. Valid
values are Y and N. The default is
Y; the data source evaluates operations.
N specifies that the federated server send SQL
statements that include only SELECT with column
names.

Informix

The <TYPE> is informix.

The fully supported <VERSION> is 11.

Informix collations are not compatible with Binary collation. Empty strings
are considered different from null strings, but it is blank
insensitive.

To enable pushdowns, COLLATING_SEQUENCE must be set to
N.

Table 6. Informix options reference

Option

Description

NODE

(Always required.) Specifies the
Informix server. Must be a server alias defined in the
sqlhosts file.

PUSHDOWN

Specifies whether the federated
server allows the data source to evaluate operations. Valid
values are Y and N. The default is
Y; the data source evaluates operations.
N specifies that the federated server send SQL
statements that include only SELECT with column
names.

COLLATING_SEQUENCE

(Required if
PUSHDOWN is set to Y.) Specifies
whether the data source uses the same default collating
sequence as the federated database. Must be set to
N.

DBNAME

(Always required.) The name of
the Informix database that you want to access

Usage examples

The following code listings show how to create a server for various data
sources.

Listing 19. Define an association between a user
ID and password at the federated server and data
source

<LOCAL_ID> is a local authorization ID. It can also be
one of the special registers USER or PUBLIC,
where USER maps the ID currently connected to the database
and PUBLIC maps all valid authorization IDs.

<SERVER> is a unique server name defined by the create
server statement.

Listing 23. Map all the valid authorization IDs to the
Netezza user nzuser with password
nzuser

Function mapping

Function mappings create an association between functions in the federated
server and the data source. As with data types, functions have default
mappings, but a user-defined function, for example a function that does
not exist at the data source, is not mapped by default.

Listing 24. Syntax for mapping a function on a
particular
server

<>MAPPING_NAME> is the unique name given to this
mapping. <>LOCAL_FUNCTION> is the function that is
mapped on the federation. server. <>SERVER> is the
unique server name as defined in the create server statement.
<>REMOTE_FUNCTION> is the function that is mapped on
the data source.

Listing 25. Syntax for mapping a function on any server
of a given
type

<>MAPPING_NAME> is the unique name given to this
mapping. <>LOCAL_FUNCTION> is the function that is
mapped on the federation. server. <>SERVER_TYPE> is
the type of the server that should do the mapping, same as
<>TYPE> in the create server statements.
<>REMOTE_FUNCTION> is the function that is mapped on
the data source.

Data type mapping

Data type mappings associate the federated database data types with the
data types at the data source. These mappings can be forward or reverse
depending on whether a data type is transformed when it's being sent or
fetched from the data source.

This association might be necessary if the data type does not exist locally
or if the data type is incompatible.

For example, when you use an Oracle data source:

Data type NUMBER(8,0) is not a valid data type in Big SQL
V3.0. You can map it to its equivalent INT data
type.

The date data type that is represented as a time stamp in
Oracle is mapped to the local date data type.

Data type mapping can be created on the server or nickname level.

Server level mapping

Server level data type mappings are applied to all transactions with the
data source that use a certain data type. When a server of a determined
<TYPE> is created, some data type mappings are created
by default. See the full list of default mappings.

Nicknames

In Big SQL V3.0, a nickname is a local designation to a remote object such
as a table or a view. To create a nickname, the user that issues the
command must be mapped to a valid user on the data source. Mapping is
explained in the preceding section.

Create nickname syntax

The following examples show how to create nicknames for various
entities.

Listing 35. Explicitly create a nickname for an existing
table

CREATE NICKNAME <LOCAL_NAME> FOR <SERVER>.<SCHEMA>.<REMOTE_NAME>

<LOCAL_NAME> is the name that the object has
locally.

<SERVER> is a unique server name.

<SCHEMA> is the schema that the object was created under
in the remote database.

<REMOTE_NAME> is the name that the object has on the
remote server.

Listing 36. Implicitly create a nickname by creating a
table at the data
source