Community

Related topics

More resources

Using R with databases

A speedy combination: DB2 10.5 with dashDB and R

By Grant Hutchison Published February 6, 2014

Introduction

R is an open source programming language that is excellent for data analysis and graphics. The language was initially created by Ross Ihaka and Robert Gentleman at the University of Auckland in 1993. They wanted to design a language that would help them teach introductory statistics to their students. The design of R was influenced by an existing language called S, from Bell Labs, which was created in the 1970s. R is considered a domain specific language because it was designed primarily for data analysis—and it does this very well. You can use R as an interactive environment, or embed scripts and models into packages and integrate them with other software modules.

You can use R to analyze data from many different data sources including external files or databases. This article explains how to connect, query, and update data with relational database servers. You will learn why DB2 v10.5 with dashDB technology is an excellent choice for analytics. The article also explores how dashDB for Cloud simplifies loading and analyzing datasets using R.

Frequently used acronyms

CRAN: Comprehensive R Archive Network

CSV: Comma-separated values

DSN: Data source name

ODBC: Open Database Connectivity

R

ODBC: Open Database Connectivity

for R

SIMD: Single instruction multiple data

SQL: Structured Query Language

External data

R objects are created and managed within a single memory area. In most cases, your data analysis tasks require the data to be available as a data frame. A data frame can be considered a two-dimensional array of heterogeneous data or an in-memory table. If the data already exists in a delimited text file, R users can bring the data into memory for analysis using one of the many read.xxx() functions such as read.csv() for CSV files. Similarly, if an R data frame is to be externalized to a file, you can use the write.xxx() functions.

R has the ability to persist data frames to disk as objects outside of its memory using the R functions saveRDS(), save(), or save.image(). These objects can then be read back into memory for later analysis. But, the persistence mechanism has many drawbacks, including limited sizes for data frames. There are also problems associated with using text files for data analysis; it takes significant effort to cleanse the data and share the dataset with others.

If your data for analysis is generated through existing operational systems backed by relational databases, why not simply extract the data from the operational system and populate an analytic database for analysis with R?

Why use a relational database with R?

You can use R to explore data and build predictive models. Relational database servers are designed to handle large amounts of data and they will maintain data consistency for concurrent users. Because data is usually stored in a normalized fashion in relational databases, you will likely need to recall some of your SQL skills to join the relevant attributes across multiple tables to perform your exploratory data analysis (EDA) tasks. If you’re working alongside a Database Administrator (DBA) or data analyst with strong relational database skills, you could create some read-only views that would speed up the initial data analysis tasks. If you’re working with data in DB2, you can use the IBM Data Studio tool or the web console within dashDB for Cloud to examine the database schema or define new views to simplify data access from your R scripts.

By default, R will read all of the data into memory before performing any analysis. When queries are processed by relational database servers such as DB2, the system will not simply load all of the data from disk into memory. Database servers will consider the entire query and determine the most efficient method of obtaining the results as requested.

DB2 with dashDB

DB2 version 10.5 with dashDB is optimized for analytics and therefore an excellent choice for working with R data analysis tasks. There are many reasons to use DB2 with dashDB, including:

Simplicity
Load the data and perform data analysis. BLU is optimized for analytics so you don’t need to create indexes or perform any tuning tasks.

Performance
The columnar storage model used by BLU reduces the number of disk reads and minimizes the amount of memory used to support very fast query processing.
Advanced data compression techniques are used by default to reduce memory and disk usage.

Unlike R, DB2 with dashDB will examine all of the hardware capabilities of its server and optimize the use of resources. For example, DB2 will use SIMD CPU instructions across multiple cores whenever possible. It will also use various memory caching mechanisms to reduce the number of times data is read from disk. The optimization occurs automatically. This platform is now also available in the cloud with BLU Acceleration for Cloud.

dashDB for Cloud

dashDB for Cloud, a web-based database server offering, is optimized for simplicity. In a few minutes you can create tables, load data, and start your analysis. Tools are provided (IBM Data Studio and IBM InfoSphere Data Architect) to simplify creating and maintaining database models and objects (such as tables). After the schema has been created, you can use the dashDB for Cloud web console to load your data. There are many options for loading the data, such as local files, cloud storage services (for example, Amazon S3), or IBM InfoSphere DataStage. The web console can be used to performed your analysis work using Excel, SQL, Cognos Business Insight (BI)” , or R scripts and R models.

Figure 1 shows two different scenarios of how an R user can work with data stored in dashDB for Cloud. The data analyst can use their favourite R environment, such as RStudio, to query and analyze the data from their own computer. Or, the analyst can use a browser to create and execute R scripts on the cloud-based server.

Figure 1. Explore data using R (remote)

The chart in Figure 2 is an example of the R environment integrated into dashDB for Cloud. dashDB for Cloud provides an R runtime alongside the data warehouse within a cloud environment. BLU Acceleration for Cloud provides a web console where you can load data and perform analytics within minutes. The data analysis could include SQL, BI tools, or R scripts and models. An easy way to use R script editor is provided, but you can also launch RStudio from within your browser. Figure 2 shows an example of the integration including a plot of an analysis of U.S. Census data. You can create new scripts by clicking on the plus (+) sign highlighted in the top left of Figure 2. The R scripts are managed in the cloud within your own workspace. You can even import scripts to work with them in the cloud.

Figure 2. Explore Data using R in the Cloud

Data types and design considerations

Data within R and data stored in relational databases both have a defined type or structure. It is important to understand the various data types and how to map values properly.

R has a very limited set of base data types including: characters, numeric (integers and double-precision floating point values), logical, and raw data (binary). Character vectors in R are simply mapped to SQL CHARACTER or VARCHAR data types. Numeric data types from the database server are usually mapped to INTEGER and DOUBLE data types in R. Because the logical data type in R does not have an obvious SQL data type equivalent, it is usually mapped to a text field in database tables. If you’re working with time-series data you will need to carefully consider how the data will be mapped between R and the database server.

Continuous and categorical data are quite different entities when statistical methods are used. R users are aware of these differences and they will encode the categorical data using the factor datatype. After data has been retrieved from a database, you should consider if the data should be treated as continuous or categorical. If the data is truly categorical, the variable should be recoded or coerced using the factor() function.

Table 1. Terminology comparison (R and RDBMS)

R

SQL/RDBMS

data frame

table (relation)

observation

row (tuple)

variable

column (attribute)

various ( [], subset(), order(), sort() )

SELECT statements

Connectivity options

IBM Netezza and R

IBM Netezza Analytics is an embedded, purpose-built, advanced analytics platform. R can be used in many different ways with Netezza appliances including in-database analytics and client-side analytics. Discover how R can be used within a Netezza environment.

The R language does not include built-in relational database access capability, but there are many data access interfaces available from vendors or through the Comprehensive R Archive Network (CRAN).

You will learn how to use either RJDBC or RODBC to access data stored in DB2 database servers from R. If you plan to access other database servers, you could use these same interfaces or you may wish to explore database-specific alternative data access packages.

Connecting with RJDBC

The RJDBC package is based on the database interface (DBI) established in the R community. The DBI package contains virtual classes; it is the responsibility of the underlying driver to implement the classes. RJDBC uses a combination of a JDBC compliant database driver and Java Runtime Environment (JRE) to exchange data between R and the database server. You will use the IBM JDBC driver (type 4) to provide the underlying connectivity. The IBM JDBC driver can be used to access databases across the DB2 family including: DB2 for Linux, Unix, Windows, DB2 for z/OS, and DB2 i. If the RJDBC package has not been installed in your R environment, use the command install.packages("RJDBC") to install it from CRAN.

In Listing 1, the RJDBC package loaded an R object representing the JDBC driver. The db2jcct4.jar file must be reachable from your CLASSPATH and a valid JRE must be available in your JAVA_PATH. The dbConnect() function is used to allocate a channel, or connection, object to the database server. The connection is to a database called SAMPLEDB, located on the server, with a hostname of blueforcloud.imdemocloud.com. Note that the value for the password is required, but it is not included in the script.

Spend a few minutes to try the dashDB for Cloud Trial plan and try the script shown in Listing 1. In the remainder of this article you will learn more about how to access data from R.

Listing 2 shows how the connection object is used to send an SQL query to the server with the dbSendQuery() function. The data is not actually returned to the client until the fetch() function is executed. In this scenario, all of the rows from the result set are returned because the second parameter was set to -1.

Listing 2. RJDBC – querying

In Listing 3, the data frame df contains the results of the query. The purposes of this script are to determine the vehicles with the best fuel economy and to determine the model. The output shows that the Toyota Prius vehicles have the best combined fuel economy of the data contained in our table of 1165 vehicles.

When NULL values are returned from a database table they are mapped to NAs in an R data frame. In the first line of Listing 3, the na.omit() function was used to remove any observations with missing values. R data frame variables are returned from DB2 using upper case characters. To release database resources on the server, remember to use the dbDisconnect() function at the end of our R scripts.

You can also use the dbSendQuery() function to bind R objects to SQL statements, as Listing 4.

In Listing 4, the value of the manufacturer is not part of the query text; it is sent as part of the query when the dbSendQuery() function is executed.

Figure 4 shows the boxplot that was easily generated based on the data frame that was returned.

Figure 4. Boxplot example

Connecting with RODBC

RODBC is a more efficient and faster data access interface to DB2 for R users. RODBC is available in CRAN and is used by many people in the R community. If the RODBC package is not installed in your R environment, use the install.packages("RODBC") command to install it.

ODBC was originally developed by Microsoft in the early 1990s. It has since become an official standard for data access known as SQL/CLI (Call Level Interface). An ODBC environment consists of an ODBC Driver Manager and an ODBC compliant driver for the database server you would like to use. On Windows, the ODBC Driver Manager is built into the platform, but on Linux or other platforms an ODBC Driver Manager should be installed.

Connecting to a database from the RODBC driver involves identifying the location of the server, the name of the database, and supporting credentials (for example, user name and password). The name of the database is usually defined as an ODBC DSN. A DSN is a detailed reference to a database that is either local or remote from the client computer. You can consider a DSN as an alias to the database—it does not need to match the actual name of the database defined on the server.

Creating DSNs on Windows involves using the ODBC Data Source Administrator tool, which is accessible from the Control Panel->Administration Tools menu. In Figure 5, there are many DSNs defined in this client computer.

Figure 5. Windows ODBC Data Source Administrator tool

If you decide to use R and DB2 Express-C for Windows with local databases, there should be DSNs already defined for your databases. If you wish to access a remote database, then you should: obtain the connection details from the database server provider, catalog the connection, and define a DSN within the ODBC Data Source Administrator tool.

Table 3. RODBC – database and table metadata functions

Provides information about the supported data types of the ODBC database.

channel

Data frame of the supported data types and their characteristics.

sqlTables (channel, ...)

Provides a description of the table-like objects defined within a database.

channel recommended optional parameters : “schema=”, “tableType=”

Data frame containing details about the tables, views, or other table-like objects in the database.

sqlColumns (channel, sqtable, ...)

Provides a description of the columns defined within a table.

channel, table name

Data frame containing details about the column names and other attributes for a table.

After you determine which table you want to work with, use the sqlFetch() function to retrieve the data into R. The RODBC package will map the data to the appropriate R data type based on the defined mappings displayed in the output of the getSqlTypeInfo("DB2/NT") function.

Let’s examine a simple R script that will connect to a database using a DSN and determine some basic information. The program will determine the number of table objects and the columns within a specified table. The sqlFetch() function is used to retrieve all of the data into R and display the first row that is returned in the data frame.

In Listing 5, the odbcConnect() function and metadata functions determined there are 27 tables within the “DB2INST1” schema and 18 columns defined in the fuel economy table. The sqlFetch() function is used to query the contents of the table and create a data frame called cars. The first four variables of the first observation in the data frame are displayed to the user and the connection to the database server is released using the odbcCloseAll() function.

An alternative method of connecting to DB2 databases using RODBC involves using the odbcDriverConnect() function. Note that the db.name defined in Listing 6 is the actual database name on the database server and it is not necessarily the DSN created on the client computer.

Querying data

A key reason for analyzing data stored in database servers, instead of files, using R is to help manage issues associated with large datasets. As mentioned, R will load all of the data into memory by default and, with conditional SQL queries, you can more easily examine portions of large datasets.

As shown in Table 4, any valid SQL query can be sent to the database server using the sqlQuery() function. If you have the authority to DROP tables or remove data, then the sqlDrop() or sqlClear() functions are also available.

Table 4. RODBC – querying and deleting data

Function

Description / Purpose

Input

Output

sqlQuery ( channel, query, ... )

Executes the SQL query on the database server and provides the results.

Data frame of the result set. The data will be mapped to compatible R data types.

sqlDrop ( channel, sqtable, ... )

Removes the table contents and definition from the database.

channel, table

Note that this function will attempt to execute a DROP TABLE statement.

sqlClear ( channel, sqtable, ... )

Removes all of the rows in a table from the database.

channel, table

Note that this function will attempt to execute a TRUNCATE TABLE statement.

You can use the very versatile sqlQuery() function to issue SQL Data Manipulation Language (DML) statements such as SELECT, INSERT, UPDATE, or DELETE, and SQL Data Definition Language (DDL) statements such as CREATE TABLE. The SQL statement in Listing 7 contains an error. The datatype is not spelled correctly and the phrase INTEGR should be INTEGER. Thankfully, the odbcGetErrMsg() function is available to obtain and display a detailed error message. You will find this function very useful as you edit and test your R scripts.

Storing data

Data is usually stored in relational databases using the SQL INSERT statement. If you have an R data frame and want to persist the data into a relational database, you can use the sqlSave() function. The default behaviour of this function is to create a table with the same data as the data frame variable. The column names will also be the same as the column names in the data frame. You may not want to store the R row names from the data frame in your persisted table, as there is usually a natural key within your data set.

The sqlSave() function will use the SQL INSERT statement to populate the table with the data from the data frame. If you have a large dataset that you want to persist, consider creating an external delimited file of the data and then using a database specific high speed loader to publish and share the data with other data analysts. After the data is safely stored in the relational database, you no longer need to worry about maintaining its integrity and availability—the database server will handle that task for you.

In Listing 8, the sqlSave() function is used to create and populate a new table. Since the data frame was called CLASSMARKS the table will have the same name. In this scenario, the safer parameter allows the script to replace any existing table with the same name. The default behaviour is to return an error if there is an attempt to save data into an existing table.

You can also use the RODBC sqlUpdate() function to perform modifications to data in existing tables. I recommend using the SQL UPDATE statements as a more effective alternative, or simply store the new data into a temporary table using sqlSave() and then perform an SQL UPSERT between the tables. If you use the sqlUpdate() function, be aware that an index parameter is required. The index parameter is used to uniquely identify each row in the existing table.

Using stored procedures

Stored procedures are programming modules that are managed and executed by the database server. Since the data is co-located with the program, stored procedures can be very efficient and perform some tasks much faster than client-side applications. DB2 stored procedures can be written using various programming languages such as Java, C, PL/SQL, and SQL PL. If you have existing stored procedures, or your analysis work can be simplified or improved by using stored procedures, you can easily call them from an R script using RJDBC or RODBC.

In Listing 9, the GETMEDIAN stored procedure is called using the sqlQuery() function. The data in this scenario would be returned as a single valued data frame. The subject of interest is passed to a stored procedure as a defined input parameter. At this time, there is no mechanism available to retrieve multiple result sets or output parameters from a DB2 stored procedure.

Listing 9. RODBC – stored procedures

Summary

R is a powerful open source data analysis tool that can help you explore data and build predictive models. When your data grows beyond the memory limits of R, consider loading the data into a database server such as IBM DB2 with dashDB or dashDB for Cloud. In this article, we explored the benefits of using R with databases instead of using delimited text files for data analysis. You also learned how to use the RJDBC and RODBC packages to perform data analysis of big data stored in DB2.