Learn with our tutorials and training

developerWorks provides tutorials, articles and other
technical resources to help you grow your development skills
on a wide variety of topics and products. Learn about a specific
product or take a course and get certified. So, what do you want to learn
about?

Featured products

Featured destinations

Find a community and connect

Learn from the experts and share with other developers in one of our
dev centers. Ask questions and get answers with dW answers. Search for local events
in your area. All in developerWorks communities.

This content is part # of # in the series: Using IBM Big SQL over HBase, Part
1

This content is part of the series:Using IBM Big SQL over HBase, Part
1

Stay tuned for additional content in this series.

InfoSphere BigInsights Quick Start
Edition

InfoSphere BigInsights Quick Start Edition is a complimentary,
downloadable version of InfoSphere BigInsights, IBM's Hadoop-based
offering. Using Quick Start Edition, you can try out the features IBM
has built to extend the value of open source Hadoop, like Big SQL,
text analytics, and BigSheets. Guided learning is available to make
your experience as smooth as possible, including step-by-step,
self-paced tutorials and videos to help you start putting Hadoop to
work for you. With no time or data limit, you can experiment on your
own time with large amounts of data. Watch the videos and download InfoSphere BigInsights Quick Start Edition now.

This series walks you through using IBM's Big SQL technology with
InfoSphere BigInsights to query HBase using standard SQL. Here, you'll see
how to migrate a table from a relational database to InfoSphere
BigInsights using Big SQL over HBase. You'll also explore how HBase
handles row keys and learn about some pitfalls you might encounter. We'll
try some useful options, such as pre-creating regions to see how it can
help with data loading and queries, and cover various ways to load
data.

This series covers extensive ground, so we've omitted some fundamental
information. At least a rudimentary understanding of InfoSphere
BigInsights, HBase, and Jaql is assumed (see Related topics for more information about these technologies).
You can also download the sample data used in
this series.

Background

This exercise uses one table from the Great Outdoors Sales Data Warehouse
model (GOSALESDW): SLS_SALES_FACT. Figure 1 shows the details of the table
and its primary key information.

Figure 1. SLS_SALES_FACT table

Assume there is an available instance of DB2® that contains the
following table with data pre-loaded for our migration.

One-to-one mapping

In this section, we use Big SQL to do a one-to-one mapping of the columns
in the relational DB2 table to an HBase table row key and columns. This is
not a recommended approach. However, the goal of the exercise
is to demonstrate the inefficiency and pitfalls that can occur with such a
mapping.

Big SQL supports both one-to-one and many-to-one mappings.

In a one-to-one mapping, the HBase row key and each HBase column
are mapped to a single SQL column. In Figure 2, the HBase row key is
mapped to the SQL column id. Similarly, the
cq_name column within the cf_data column family
is mapped to the SQL column name, etc.

Figure 2. One-to-one mapping

To begin, you can optionally first create a schema to keep tables
organized. Within the Big SQL (JSQSH) shell, use the
create schema command to create a schema named
gosalesdw, as shown below.

CREATE SCHEMA gosalesdw;

Issue the command shown below in the same Big SQL shell. This DDL statement
will create the SQL table with the one-to-one mapping of what is in our
relational DB2 source. Notice all the column names are the same with the
same data types. The column mapping section requires a mapping for the row
key. HBase columns are identified using family:qualifier.

Big SQL supports a load from source command that can load data
from warehouse sources, which we'll use first. Big SQL also supports
loading data from delimited files using a load hbase command,
which we'll use later.

Adding new JDBC drivers

The load from source command uses Sqoop internally to do the
load. Therefore, before using the load command from a Big SQL shell, you
need to add the driver for the JDBC source into the Sqoop library
directory, then the JSQSH terminal shared directory.

From a Linux® terminal, issue the following command (as the InfoSphere
BigInsights administrator) to add the JDBC driver JAR file to access the
database to the $SQOOP_HOME/lib directory.

cp /opt/ibm/db2/V10.5/java/db2jcc.jar $SQOOP_HOME/lib

From the Big SQL shell, you can examine the drivers loaded for the JSQSH
terminal, as shown below.

\drivers

Copy the same DB2 driver to the JSQSH share directory with the following
command.

When a user adds drivers, the Big SQL server must be restarted. You could
do this from the web console or by using the following command from the
Linux terminal.

stop.sh bigsql && start.sh bigsql

You can verify that the driver was loaded into JSQSH by using the
\drivers command, as shown above.

Now that the drivers have been set, the load can finally take place. The
load from source statement extracts data from a source
outside of an InfoSphere BigInsights cluster (DB2 in this case) and loads
that data into an InfoSphere BigInsights HBase (or Hive) table.

Issue the following command to load the SLS_SALES_FACT_10P table from DB2
into the SLS_SALES_FACT table we have defined in Big SQL.

You should also verify from an HBase shell. Issue the count
command, as shown below, to verify the number of rows.

count 'gosalesdw.sls_sales_fact'

It should be apparent that the results from the Big SQL statement and HBase
commands conform to one another.

33 row(s) in 0.7000 seconds

However, this doesn't yet explain why there is a mismatch between the
number of loaded rows and the number of retrieved rows when you query the
table.

The load (and insert, to be examined later)
command behaves like upsert. If a row with the same row key
exists, HBase will write the new value as a new version for that column or
cell. When querying the table, only the latest value is returned by Big
SQL.

In many cases, this behavior could be confusing. As with our case, we tried
to load data with repeating values for a row key from a DB2 table with
44,603 rows, and the load reported 44,603 rows affected. However, the
select count(*) showed fewer rows (33). No errors are thrown
in such scenarios, so it is always recommended to cross-check the number
of rows by querying the table, as in our example.

Now that you understand that all the rows are actually versioned in HBase,
we can examine a possible way to retrieve all versions of a particular
row.

First, from the Big SQL shell, issue the following select
query with a predicate on the order day key. In the original table, there
are most likely many tuples with the same order day key.

Optionally, try the same command again, specifying a larger version number
(VERSIONS => 100, for example).

Either way, this is most likely not the intended behavior users might
expect when performing such a migration. Users probably wanted to get all
the data into the HBase table without versioned cells. There are a couple
of solutions to this. One is to define the table with a composite row key
to enforce uniqueness, which will be covered later. Another option,
outlined in the next section, is to force each row key to be unique by
appending a universally unique identifier (UUID).

One-to-one mapping with a unique
clause

Another approach to the migration is to use the
force key unique option when creating the table using Big SQL
syntax. This option will force the load to add a UUID to the row key. It
helps prevent versioning of cells. However, this method is quite
inefficient, as it stores more data and also makes queries slower.

Issue the following command in the Big SQL shell. This statement will
create the SQL table with the one-to-one mapping of what we have in our
relational DB2 source. This DDL statement is almost identical to what you
saw in the previous section (One-to-one
mapping), with one exception: the force key unique clause
is specified for the column mapping of the row key.

In One-to-one mapping you used the
load from source command to get the data from the table in
DB2 source into HBase. This may not always be feasible, so we'll explore
the load hbase loading statement. The load hbase
command will load data into HBase using flat files, which perhaps is an
export of the data from the relational source.

Issue the following statement to load data from a file into an InfoSphere
BigInsights HBase table.

Note that the load hbase command can take in an optional list
of columns. If no column list is specified, it will use the column
ordering in table definition. The input file can be on DFS or on the local
file system where the Big SQL server is running.

Once again, you should expect to load 44,603 rows (the same number of rows
that the select count statement on the original DB2 table
verified).

44603 rows affected (total: 26.95s)

Verify the number of rows loaded with a select count
statement, as shown below.

SELECT COUNT(*) FROM gosalesdw.sls_sales_fact_unique;

This time, there is no discrepancy between the results from the load
operation and the select count statement.

In One-to-one mapping, only one row was returned
for the specified date. This time, expect to see 1,405 rows since the rows
are now forced to be unique due to our clause in the create statement and,
therefore, no versioning should be applied.

1405 rows in results(first row: 0.47s; total: 0.58s)

Once again, you can check from the HBase shell if there are multiple
versions of the cells. Issue the following get statement to
try to retrieve the top four versions of the row with row key
20070720.

Zero rows are returned, as the row key of 20070720 doesn't exist. This is
because we've appended the UUID to each row key (20070720 + UUID).

COLUMN CELL
0 row(s) in 0.0850 seconds

Therefore, you should instead issue the following HBase command to do a
scan vs. a get. It will scan the table using the
first part of the row key. We are also indicating scanner specifications
of start and stop row values to only return the
results we're interested in.

Notice there are no discrepancies between the results from Big SQL
select and HBase scan.

1405 row(s) in 12.1350 seconds

Many-to-one mapping (composite keys and dense
columns)

This section discusses the other option of trying to enforce uniqueness of
the cells, which involves defining a table with a composite row key
— also known as many-to-one mapping.

In a many-to-one mapping, multiple SQL columns are mapped to a single HBase
entity (row key or a column). There are two terms that may be used
frequently: composite key and dense column. A composite
key is an HBase row key mapped to multiple SQL columns. A dense column is
an HBase column mapped to multiple SQL columns.

In Figure 3, the row key contains two parts: userid and account number.
Each part corresponds to an SQL column. Similarly, the HBase columns are
mapped to multiple SQL columns. Note that you can have a mix. For example,
you can have a composite key, a dense column, and a non-dense column or
any mix of these.

Figure 3. Many-to-one mapping

Issue the following DDL statement from the Big SQL shell. It represents all
entities from our relational table using a many-to-one mapping. Notice the
column mapping section where multiple columns can be mapped to single
family:qualifiers.

Why do we need many-to-one mapping?

HBase stores a lot of information for each value. For each value stored, a
key consisting of the row key, column family name, column qualifier, and
timestamp are also stored. A lot of duplicate information is kept.

HBase is verbose and primarily intended for sparse data. In most cases,
data in the relational world is not sparse. If you were to store each SQL
column individually on HBase, as previously done in this article, the
required storage space would exponentially grow. When querying that data
back, the query also returns the entire key (row key, column family, and
column qualifier) for each value. For illustration, after loading data
into this table, we'll examine the storage space for each of the three
tables created thus far.

Issue the following statement, which will load data from a file into the
InfoSphere BigInsights HBase table.

The number of rows loaded into a table with many-to-one mapping remains the
same even though we're storing less data. The statement also executes much
faster than the previous load for this exact reason.

44603 rows affected (total: 3.42s)

Issue the same statements and commands from the Big SQL and HBase shells,
as in the previous two sections, to verify that the number of rows is the
same as in the original dataset. All the results should be the same as
before.

As mentioned, one-to-one mapping leads to use of too much storage space for
the same data mapped using composite keys or dense columns, where the
HBase row key or HBase column(s) are made up of multiple relational table
columns. HBase would repeat row key, column family name, column name, and
timestamp for each column value. For relational data, which is usually
dense, this would cause an explosion in the required storage space.

Issue the following command as the InfoSphere BigInsights administrator
from a Linux terminal to check the directory sizes for the three tables
you created.

Data collation

All data represented thus far has been stored as strings, which is the
default encoding on HBase tables created by Big SQL. Therefore, numeric
data is not collated correctly. HBase uses lexicographic ordering, so you
might have cases where a query returns wrong results.

The following scenario walks through a situation where data is not collated
correctly.

Using the Big SQL insert into hbase statement, add the
following row to the sls_sales_fact_dense table (previously defined with
data loaded). The date specified as part of the ORDER_DAY_KEY column,
which has data type int, is a larger numerical value and does
not conform to any date standard because it contains an extra digit.

Insert another row into the table with the following command. This time,
we're conforming to the date format of YYYYMMDD and incrementing the day
by one from the last value returned in the table (20070721).

The newly added row is included as part of the result set, and the row with
ORDER_DAY_KEY of 200707201 is after the row with
ORDER_DAY_KEY of 20070721. This is an example of numeric data
that is not collated properly. The rows are not being stored in numerical
order as you might expect but rather in byte lexicographical order.

Many-to-one mapping with binary
encoding

Big SQL supports two types of data encodings: string and binary. Each HBase
entity can also have its own encoding. For example, a row key can be
encoded as a string, and one HBase column can be encoded as binary and
another as string.

String is the default encoding used in Big SQL HBase tables. The value is
converted to string and stored as UTF-8 bytes. When multiple parts are
packed into one HBase entity, separators are used to delimit data. The
default separator is the null byte. As it is the lowest byte, it maintains
data collation and allows range queries and partial row scans to work
correctly.

Binary encoding in Big SQL is sortable so numeric data, including negative
numbers, collate properly. It handles separators internally and avoids
issues of separators existing within data by escaping it.

Once again, use the load hbase data command to load the data
into the table. This time, we're adding the DISABLE WAL
clause. The option to disable the write-ahead log (WAL) can speed up
writes into HBase. However, this is not a safe option. Turning off WAL can
result in data loss if a region server crashes. Another option to speed up
load is to increase the write buffer size.

Issue a select statement on the newly created and loaded table
with binary encoding, sls_sales_fact_dense_binary.

SELECT * FROM gosalesdw.sls_sales_fact_dense_binary
go -m discard;

Note that the go -m discard option is used so the results of
the command will not be displayed in the terminal.

44603 rows in results(first row: 0.35s; total: 2.89s)

Issue another select statement on the previous table that has
string encoding, sls_sales_fact_dense.

SELECT COUNT(*) FROM gosalesdw.sls_sales_fact_dense
go -m discard;

44605 rows in results(first row: 0.31s; total: 3.1s)

A key point here is that the query can return faster. (Numeric
types are also collated properly.)

You will probably not see much, if any, performance difference when
working with small datasets.

There is no custom serialization/deserialization logic required for string
encoding, making it portable if you want to use another application to
read data in HBase tables. A primary use case for string encoding is when
someone wants to map existing data. Delimited data is a common form of
storing data, and it can be easily mapped using Big SQL string encoding.
However, parsing strings is expensive and queries with data encoded as
strings are slow. And numeric data is not collated correctly, as shown in
the example.

Queries on data encoded as binary have faster response times. Numeric data,
including negative numbers, are also collated correctly with binary
encoding. The downside is that you get data encoded by Big SQL logic and
it might not be portable as-is.

HBase automatically handles splitting regions when they reach a set limit.
In some scenarios, like bulk loading, it is more efficient to pre-create
regions so the load operation can take place in parallel. In the example,
the data for sales is four months — April through July 2007. You
can pre-create regions by specifying splits in the
create table command.

In this section, we create a table within the HBase shell with pre-defined
splits, but not using any Big SQL features at first. Then we'll show how
users can map existing data in HBase to Big SQL, which can prove to be a
common practice. Creating external tables makes this
possible.

Start by issuing the following statement in the HBase shell. The
sls_sales_fact_dense_split table will be created with
pre-defined region splits for April through July in 2007.

Figure 4. Splits

Figure 5. Pre-created regions

Execute the following create external hbase command to map the
existing table you just created in HBase to Big SQL. With the
create external hbase command:

The create table statement lets you specify a different
name for SQL tables through the hbase table name clause.
Using external tables, you can also create multiple views of the same
HBase table. For example, one table can map to a few columns and
another table to another set of columns, etc.

The column mapping section of the create
table statement
allows you to specify a different separator for each column and row
key.

The data in external tables is not validated at creation time. For
example, if a column in the external table contains data with separators
incorrectly defined, the query results would be unpredictable.

Verify the same from the HBase shell directly on the underlying HBase
table.

count 'gosalesdw.sls_sales_fact_dense_split'

...
44603 row(s) in 9.1620 seconds

Issue a get command from the HBase shell specifying the row
key as follows. Notice the separator between each part of the row key is a
hyphen (-), as we defined when originally creating the
external table.

get 'gosalesdw.sls_sales_fact_dense_split', '20070720-11171-4428-7109-5588-30263-5501-605'

In the following output, you can also see the other separators we defined
for the external table: | for the
cq_DOLLAR_VALUE and / for
cq_QUANTITY.

Handling errors with load data

How do you handle errors during the load operation? The
load hbase command has an option to continue past errors. You
can use the LOG ERROR ROWS IN FILE clause to specify a file
name to log any rows that could not be loaded because of errors. A few
common errors are invalid numeric types and a separator existing within
the data for string encoding.

Acknowledgments

Downloadable resources

Presentation on Big SQL over HBase (IBD-1687A.pdf | 3MB): This article is derived from a presentation at Information On Demand
Session 1687 — Adding Value to HBase with IBM InfoSphere
BigInsights and Big SQL.