Integrate PureData System for Analytics and InfoSphere BigInsights for
email analysis

PureData™ System for Analytics is IBM's core warehousing
platform. This article explains how to integrate it with InfoSphere®
BigInsights™, IBM's enterprise-ready Hadoop distribution. We'll discuss
the integration concepts, as well as a primary use cases for this integration,
combining the advanced text analytics capabilities of BigInsights with your
warehouse. We'll illustrate how to create a small scenario that enriches
employee data with information extracted from email. Then we will demonstrate
how to extract this information in BigInsights and how to upload the extracted
results to PureData for Analytics. We will also show how to access data stored
in BigInsights from the warehouse to facilitate drill-through
capabilities.

Benjamin Leonhardi is the team lead for the Big Data/Warehousing partner
enablement team. Before that he was a software developer for InfoSphere
Warehouse at the IBM Research And Development Lab Boeblingen, Germany. He
was a developer in the data mining, text mining, and mining reporting
solutions.

Piotr Pruski is a partner enablement engineer within the Information Management Business Partner Ecosystem team at IBM. His main focus is to help accelerate sales and partner success by reaching out to and engaging business partners, enabling them to work with products within the IM portfolio -- namely, IBM InfoSphere BigInsights. In previous roles, he has worked with DB2, IBM Smart Analytics System 5710, and briefly with the IBM PureData System for Transactions.

Andre Albuquerque is the Big Data technical lead for the IBM Information Management PureData Ecosystems team. He joined IBM Canada Labs in 2009 and works with various IBM products, such as DB2 and the IBM big data platform. As a technical lead, he is the content owner of the Big Data Fundamentals Bootcamp.

Introduction

Enterprise integration is a unique value from IBM. Some of the use cases
for big data technologies that customers have can be made possible only
through integration. This requires integrating existing information in the
enterprise with the new streams of information outside of the enterprise.
The desired types of analytics cannot be realized unless structured as
well as unstructured, semi-structured data is analyzed together. We will
see here that the value of traditional and non-traditional data sources,
and traditional and non-traditional technologies increases in value when
they are brought together.

Before continuing, please note that we cover quite an extensive amount of
ground throughout this article and have omitted some information that we
believe is basic or fundamental. Therefore, we assume at least a
rudimentary understanding of BigInsights, AQL, JAQL, PureData System for
Analytics, and Cognos®. More information can be found in the Resources section on all of these technologies.

PureData System for Analytics (formerly known
as Netezza)

The PureData System for Analytics, powered by Netezza technology, is a
simple data appliance for serious analytics. It simplifies and optimizes
performance of data services for analytic applications, enabling complex
algorithms to run in minutes instead of hours. You may know the PureData
System for Analytics as Netezza Appliances. The product has been renamed
and we may use the two names interchangeably.

Enables companies to rapidly explore large volumes of raw data to
gain insight into operational challenges that were previously too
large or complex to tackle.

Processes a variety of data, including unstructured and
semi-structured formats often ignored by traditional technologies.
Such data may be "messy" and contain varying structures with
potentially complex interrelationships.

An effective big data platform and approach requires integration with the
rest of your IT infrastructure. The last thing you need in your
architecture is another technology or data silo. Big data technologies
should work in tandem with and extend the value of your existing data
warehouse and analytics technologies.

The following will be our basis for the remainder of this article and why
we want and need to achieve this integration.

PureData System for Analytics can process terabytes of structured data, but
has limited unstructured capabilities. InfoSphere BigInsights has
limitations when processing structured information, but it really shines
for unstructured analytics. In combining the strengths of the two systems
we can:

Analyze large amounts of unstructured data in BigInsights and write
the structured results into the PureData System for Analytics or read
them on demand with the Netezza UDFs.

Use standard reporting solutions to access the PureData System and
enrich structured (OLAP, for example) reports with unstructured
information.

Use case description

The following figure showcases the use case we will base our integration
scenario on.

Figure 1. BigInsights PureData System for Analytics
Integration use case

The demo shows the integration of the PureData System for Analytics and
BigInsights and uses Cognos as a front end. Employee records are stored in
PureData System for Analytics and email is stored and analyzed in
BigInsights. An AQL annotator scores the email as work or private. AQL is
the annotated query language, a programming language similar to SQL used
for the extraction of structured information from text. The scoring
results are then uploaded from BigInsights into PureData System for
Analytics through a JAQL Netezza module. The results are aggregated and
shown in a Cognos report (employees and their number of private and work
email). The report also allows a drill-through to the underlying email
texts, which are then retrieved from BigInsights. This uses the Netezza
BigInsights connector functionality to read from HDFS.

Document analysis

Here we describe how to create a simple AQL annotator and use it to
extract structured information from a set of documents. We will first
describe how to create a simple AQL annotator that can distinguish work
email from private email. The next step is to import the email into
BigInsights. Then we use a JAQL script to annotate the documents and
upload the results into PureData System for Analytics.

That encapsulates the BigInsights part of our integration, but we'll still
need to create a drill-through capability from the Cognos report to the
underlying email data. To do this, we need to be able to access the HDFS
file system from the PureData System for Analytics data warehouse. This is
achieved through the Netezza UDF extension shipped with InfoSphere
BigInsights. We will describe the installation process and show one
possibility to implement the drill-through function.

We cover a lot of ground, so the implementation will be greatly simplified.
We plan to provide more details and alternatives to some of the described
steps in future articles.

AQL annotator development

We will briefly describe how to develop a dictionary-based AQL annotator
that can distinguish between work and private email. This annotator can be
used to, for example, identify non-work related emails that do not need to
be archived, thereby reducing backup costs. Since we are limited by the
scope of the article, we will only use a small portion of the capabilities
of AQL; the intention is to give you an idea of what is possible. Your
development environment will look like the following.

Figure 2.
InfoSphere BigInsights Development Studio

Our annotator will have two dictionaries containing words indicative of
private and work email. In a second step, we will use AQL to count all
occurrences of both categories of words, and finally compute a single
number, or a total score, for each email. Identifying an email as private,
meaning more private than work words, will yield a negative number, while
a work email with more work words than private words will yield a positive
number. We will also demonstrate the creation of an AQL Java™ UDF
since AQL does not have its own subtraction function. Nonetheless, it is
trivial to implement these kinds of functionality in Java technology.

Of course each email set might have different requirements, so we used a
training set of email messages that have been divided manually into work
and private email; in this limited example, there are about 10 for each.
From these, we identified the word lists and created the dictionaries. We
assume that you have a fundamental knowledge of AQL and the development
environment. You can find more details and examples in the Resources section.

AQL is a modular language, so we created a project with two modules: Common
and Email. Common is used to define the UDFs we might want to
reuse in other projects. We then use it in the main Email module
to extract the email score.

Developing the subtraction UDF

To compute the total score, we need to implement a Java UDF that can
subtract two integer numbers. This can be implemented in the same project
where you implement your AQL since a BigInsights Eclipse project is
derived from a Java Eclipse Project. Following is the code of our UDF:

Next, we created an AQL module called Common that implements this function
and exports it for use in other modules. Below is the source code of this
module. We have also compiled the Java code above into a JAR and put it
into the same module folder. This folder is automatically in the classpath
of the AQL code.

Since we exported the function, we can now import and use this function in
other AQL modules. It is best practice to put all your UDFs into a
separate module.

In the second step, we will utilize two dictionaries containing words that
indicate private or work email. Note that we are not aiming for
100-percent preciseness. There will be some situations where work words
are used in private mail and vice-versa, but we compute a total number in
the end for each document. AQL also provides many options to filter
results and enhance the precision of the annotator. However, for this
small example, we will utilize a very basic approach. Here are sample
entries in the dictionaries.

Now we need to create the main module called Emails. In the dictionary
folder of this module we have added the two dictionary files shown above.
Now we can code the annotator in the main AQL file. First we need to add a
module identifier and import the Common module we have created before.

module Emails;
import module Common;

Now we need to import the private dictionary and create an AQL view of the
documents, extracting all occurrences of the dictionary in the document.
For more information about AQL, see Resources.

We are not interested in the actual instances but want to have a single
number, so we can count the number of findings for each word found. AQL is
very similar to SQL, so we can simply compute a count on the two views:
WorkWords and
PrivateWords.

Now we have the number of findings of private and work words for each
email. In the next step, we want to create a single score for the email,
so we will subtract the number of private findings from the number of work
findings. If there are predominantly work findings, the score will be
positive. If there are predominantly private findings, the total score
will be negative. To compute the difference, we will use our AQL UDF
defined before. Finally, we output the view TotalScore, so we can use it
externally in Java and JAQL programs. This is different from the export
statement, which is used for module dependencies.

We have now created our simple annotator and we can give it a test in the
development environment using the extraction plan view. As we can see in
the results, we have a working annotator that identifies all 10 sample
work email messages as work and all private email as private.

Figure 3. Result of
AQL execution

Finally, we need to export the generated AQL annotator as TAM files. Using
Export > BigInsights >
Export Text Analytics Extractor, will create two TAM
files for both modules. To utilize them with JAQL, we need to copy these
files to BigInsights. For the purpose of this article, we will assume they
are in the folder /home/biadmin/nz-demo/Annotator. In the following
section, we will use JAQL to execute this annotator on a set of email
messages stored in a BigInsights cluster.

Importing data into BigInsights

Now that we have created the AQL annotator, we need to import the email
data into BigInsights. There are many ways to achieve this. Essentially,
we need to make sure the data is stored in large files and is splittable
so it can take advantage of the Hadoop architecture. We also need to keep
the ability to associate an email ID with each email text. To take
advantage of the parallel architecture in Hadoop, files need to be
splittable. This means that the file loaders can split the
file into multiple blocks. Files that have new lines are automatically
splittable. Multiple files in a compressed folder are also splittable as
long as a splittable storage form, such as lzo, is used.

In our example, we will go with comma-separated delimited files that have
two columns: ID and TEXT. We have replaced all newline characters in the
email messages with spaces since this does not influence our annotators.
You may need to escape these characters differently if newlines are
important for your analysis.

In many situations, you will have a set of files, each containing a text
document like an email. In our example, we used a simple Perl script to
transform these into a delimited file. Alternatively, JAQL can handle sets
of files as well, but this would be beyond the scope of this article. Our
data file looks like the following: a delimited file with one line per
email and two columns: LABEL and TEXT. LABEL is a unique email ID
used to associate the email with an employee and a text field containing
the email content.

We have decided to keep the file in HDFS as an uncompressed delimited file.
For better performance and a smaller storage footprint, we can compress it
during load. For best performance, we could also store the file as a
sequence file. Sequence files are native JSON files and increase file
loading speeds since they essentially store files in the internal data
representation of JAQL. However, sequence files are not human-readable, so
we didn't make use of it for this demo. Notice the syntax for specifying
the file schema and the file format. For a complete documentation of these
features refer to the information center (see Resources).

Analyzing documents with JAQL

Now we need to annotate the results. To do this, we need to import the
systemT module. This JAQL module contains the functions for annotating
text documents: import systemT;.

Then we need to create a variable that specifies how to read the file in
HDFS. Since we do not materialize the command with
:=, it will only be executed in the actual
annotation step:

The transform operator in JAQL allows us to compute new fields in the JSON
stream. We specify two fields: One is the label (our email ID), and the
other field is computed by our AQL annotator. To do this, we use the
systemT::annotateDocument function. This
computes the email score we described in the AQL section and adds it to
the label. We use the function with the following functions:

$: The first parameter is the document we
want to annotate. In JAQL, $ denotes the
current document. annotateDocument expects
per default two fields, the label and the text.

["Common", "Emails"]: The second parameter
specifies the TAM files we want to utilize. You specify them as an
array of strings, where the strings are the names of the TAM
files.

["file:///home/biadmin/nz-demo/Annotator"]:
The directory or directories containing the TAM files. This can also
be a compressed folder.

tokenizer="multilingual": This parameter
specifies that we want to use the LanguageWare tokenizer. This is a
powerful tokenizer with syntax analysis and multi-language support.
For simple tasks, the whitespace tokenizer may be sufficient, but when
in doubt use LanguageWare.

outputViews=["Emails.TotalScore"]: The last
parameter specifies which output views in the annotators we want to
add to the JSON records.

Now we have extracted the email score and added it to the JSON records. The
results will look as follows:

As you can see, the results are added as sub-arrays to the stream. This is
necessary because an AQL output view normally has more than one result.
There could be multiple persons, addresses, etc. in the text file. In our
example, we aggregate a total number so we will always only have a single
end result. We also want to be able to load the results in a Netezza
database table. To achieve this, we need to flatten the array (i.e., we
want to have a list of email IDs and their respective scores in a flat
table). This can be achieved in JAQL with the
expand unroll command. It will move the
sub-array up one level and create one parent result row for every row of
the sub-array.

We have done two things with this piece of code. First, we unrolled the
sub-arrays and merged them in the parent array. We then used transform to
change the naming of the fields, so they correspond to the naming in the
Netezza table. Finally, we need to write the results into HDFS, so we can
later take them and move them to the Netezza database. To do this, we use
the following command.

We have now saved a flat delimited file with two columns email ID and
computed score and saved it in HDFS. The file looks like the
following:

1|-3
9|-4
10|10
11|19
12|16
13|9
14|16
...

We can now run the complete JAQL script with
jaqlshell -b analyzeemails.jaql. Alternatively,
we can package it as an application and schedule it or invoke it over a
REST API.

Upload of results to PureData System for
Analytics

Here, we describe how to upload the results from the previous section to a
Netezza table. We will utilize the JAQL Netezza connector for this.
Alternatively, you can also pull results through the Netezza UDFs, which
we will describe in the next section. But in our demonstration, the
analyzation is a push. We have separated the analyzation and upload part
for operational reasons. You of course could do it in a single step as
well.

We need to import the Netezza Connector into our JAQL script:
import dbms::netezza;.

You can of course use the standard DBMS module, but this will not result in
the best performance. The Netezza module utilizes external tables for
loads and unloads and utilizes parallel loads for optimal performance. You
also need to copy the Netezza JDBC driver JAR to the JAQL folder
$JAQL_HOME/modules/dbms/netezza or add the
location to the classpath.

Next, we need to set up a database connection. This requires the
netezza::connect function with JDBC connection
string and user account properties. We utilize the database
BIGIDEMO for our sample database.

Now we need to prepare the insert statement.
This works similar to JDBC. We specify the database connection and the
table name for the insert:
dataIn = netezza::prepareInsert( db,
'EMAIL_SCORE' );.

Due to the unique architecture os PureData System for Analytics, there is
no need to specify commit counts or any other parameters. JAQL will
automatically load data in four parallel loads, splitting the data on the
data slice ID. You can specify splits, which define the number of parts
the data is divided in. This is useful if data loaded is used for further
computations. These would be single threaded if no splits are defined. For
our upload we do not need them.

Finally, we can execute the upload by streaming the email file into the
insert statement:
tblData -> write( dataIn
);.

We have finished the upload of our data into PureData System for Analytics.
We need to implement a drill-down function in the next sections. To do
this, we need to truncate the text field, so PureData System for Analytics
can load the text fields. PureData System for Analytics has an upper limit
of 64,000 bytes for normal character fields and 16,000 bytes for unicode
NCHAR fields. To truncate the text field, we use the substring
function.

We have uploaded the email score into a Netezza table and prepared the
email file for drill-through. We can now run the complete JAQL script with
jaqlshell -b uploadscoring.jaql.

Netezza UDFs installation

Here, we will describe how to install the Netezza UDFs. These user-defined
functions allow you to access a BigInsights system from PureData System
for Analytics. Three functions are provided:

HDFS_READ: This function allows you to read
files from the remote HDFS file system.

HDFS_WRITE: This function allows you to
write files to HDFS.

JAQL_SUBMIT: This functions executes a JAQL
string on BigInsights.

You can find the installer for the Netezza UDFs on the welcome page of the
BigInsights web console. Download them and copy them over to the PureData
System for Analytics host. Then install the Netezza extensions with the
install.sh script in the scripts folder of the
installation package.

This unpacks the UDFs and puts them in the Netezza extension folder. We
still need to deploy them to a database. In our example, we have created a
database called BIGIDEMO, which contains our sample tables. To deploy the
UDFs into that database and specify connection parameters for the
connection, you can utilize the following command. It will deploy the UDFs
and store the connection information for the BigInsights server. You could
connect to two different BigInsights servers by installing them in
separate databases.

The Netezza BigInsights UDFs are now deployed and ready to use. Note the
JAQL_SUBMIT function uses the JAQL ad-hoc
application in BigInsights, so you need to have this application deployed
to use this UDF.

Development of drill-down function

Here, we describe how to develop the drill-down function that will allow
you to select a specific employee in the Cognos report and which will load
the corresponding email messages from BigInsights. With that, we will have
shown the integration of BigInsights and PureData System for Analytics in
both directions. We will have uploaded information from BigInsights to
PureData System for Analytics in the previous sections and will now
connect from PureData System for Analytics to BigInsights. The function we
use is included below. It utilizes the
HDFS_READ function to read the email text file
from HDFS specifying load parameters similar to a normal external table
load definition. Note that we need to specify the underlying part files
since JAQL splits up written files into multiple part files for each
reduce task. This is transparent to JAQL, but not transparent to the
HDFS_READ function, a Java table function,
which means it needs to be executed in the FROM
clause of the query and you need to prefix it with the
TABLE WITH FINAL keyword.

This will read all the email text into PureData System for Analytics and
the statement can be utilized like any other table. We also join in the
employee and email IDs so we can easily filter on the employee ID. For
performance reasons, you would normally prefix this with a
JAQL_SUBMIT call that pushes the filtering into
BigInsights. We have omitted this for this article because it would force
us to create a stored procedure that first does the
JAQL_SUBMIT and then reads the filtered
document. We would also need to either add the employee ID to the email
documents or push the requested email IDs through into the JAQL string.
This would go beyond the scope of the article, so take the following code
as an inspiration:

Report creation

We will very quickly describe the demonstration report we created. It will
also contain the database tables of our sample application. We will first
show the data model layer and then the created reports. It is a
straightforward Cognos report, so we go into minimal detail. You can find
more information on how to create Cognos reports in the Resource section.

Data model

Our data model consists of six query subjects. Four physical tables, one
derivative for easier report creation, and one dynamic subject based on a
query:

NATION: Every employee record has a nation
id, this is a lookup table containing the nation information.

EMPLOYEE: A table with employee records.
Containing employee ID, nation ID, first name, and last name.

EMPLOYEE_EMAIL: A table containing employee
IDs to the email IDs. So we can match email to a specific
employee.

EMAIL_SCORE: This is the table used to
store the email score in our JAQL script. It has two columns: email ID
and work_private score.

EMAIL_SCORE_FILTERED: This query subject is
created on the EMAIL_SCORE to make the report creation easier.

EMAIL_TEXT: This query subject allows us to
drill down to the email text for a given employee using the function
described above.

We imported the first four query subjects in Cognos based on the underlying
Netezza tables of the same name. We also need to create relationships
between all query subjects, so Cognos can join them correctly on nation
ID, email ID, or employee ID when columns from multiple query subjects are
used in the same report query.

The query subject EMAIL_SCORE_FILTERED
simplifies the work score for each email ID into two query items
(WORK_MAILS and
PRIVATE_MAILS), each of which can be 0 or 1. It
utilizes CASE statements to check if the
WORK_PRIVATE score is positive, which means the
work column is 1 and the private column is 0; or negative, which means the
work query item is 0 and the private query item 1. This allows us to
easily count the number of private and work email messages each employee
has been sending.

The query subject EMAIL_TEXT is utilizing the
drill-through function we developed in the previous section. It is created
on a native SQL query. Some minor syntax changes have been made to
accommodate Cognos. When this query subject is referenced in a Cognos
report the underlying query will be executed, fetching the email text into
PureData System for Analytics and then to Cognos. The full data model
looks like the following figure.

Figure 4. Data model

Putting it all together: The Cognos
report

Our Cognos demo consists of two report pages connected via drill-through
functionality. The main report shows columns from the static tables
NATION, EMPLOYEE, and EMAIL_SCORE. It is a table of employees with
personal information (employee ID, first name, last name, nation) and
information from the text annotation (i.e., how many private and work
email messages sent). We also implemented a drill-through functionality on
the employee ID field. If a specific employee ID is selected the second
report is called with the employee ID as input parameter to fetch the
email text for this employee.

Figure 5. Main report
page

The second report page contains a simple table of all the email messages of
the specific employee. It displays the email ID and the truncated email
text fetched from BigInsights. We also display the work_private score.

Figure 6. Drill-through report page

Conclusion

We have demonstrated one way to integrate the PureData System for Analytics
and the BigInsights Hadoop environment. We tried to demonstrate how each
can be used for its strengths. PureData System for Analytics is used as a
warehouse and base platform for running Cognos BI reports, and BigInsights
is used to store and analyze text files. In BigInsights, we used the
powerful text-extraction language AQL to extract structured information
from the unstructured text files. We used JAQL to run the text annotation
and upload the results into PureData System for Analytics. Finally, we
used the Netezza UDFs to implement a drill-down capability from PureData
System for Analytics to BigInsights showing both integration directions.
We brought all of this together on the example of email analysis, storing
and analyzing email data in BigInsights and the structured employee data
in PureData System for Analytics.

We covered a lot of ground, so we have not been able to investigate each
part in full detail. In a real production solution, the AQL would be much
more complex. We would also store the data using JAQL in a compressed and
optimized fashion, and the drill-through would be implemented in a way
that is parallelizable. Nevertheless, we hope this article has given you a
good starting point.

The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.