Database Technology for the Web: The MapReduce Debate

Data Analysis Processing Modes

When accessing and analyzing data there are three types of processing that need to be considered: batch processing of static data, interactive processing of static data, and dynamic processing of in-flight data. A business intelligence environment, for example, involves the SQL processing of static data in a data warehouse. This can be done in batch mode (production reporting) or interactively (on-demand analytical processing). SQL may also be used to analyze and transform data as it captured from operational systems and loaded into a data warehouse.

MapReduce is used to process large amounts of data in batch mode. It is particular useful for processing unstructured data or sparse data involving many dimensions. It is not suited to interactive processing. It would be very useful, for example, for transforming large amounts of unstructured data for loading into a data warehouse, or for data mining.

Neither MapReduce nor SQL are particularly suitable to the dynamic processing of in-flight data such as event data. This is why we are seeing extensions to SQL (such as StreamSQL) and new technologies such as stream and complex event processing to handle this need. MapReduce is, however, useful for the filtering and transforming of large event files such as web logs. The next article in this series will look at stream processing in more detail.

MapReduce and Relational Coexistence and Integration

Vertica’s strategy is one of coexistence. With Vertica, MR programs continue to run in their normal operating environment, but instead of routing the output to the MR system, the Reduce program loads output data into the Vertica relational DBMS. The Vertica support works in conjunction with Amazon Elastic MapReduce (EMR). EMR is a web service that provides a hosted Hadoop framework running on the infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). The reference below shows how to use EMR to process and load a data set from S3 into the Vertica RDBMS running on Amazon EC2.

The Vertica solution could be used, for example to do batch ETL (extract, transform, load) processing where the input is a very large data set (a set of web logs, for example) and the output is loaded into a data warehouse managed by Vertica.

Aster and GreenPlum have a strategy of integrating the MR processing framework into the RDBMS to take advantage of the benefits of RDBMS technology such as parallel computing, scalability, backup and recovery and so forth.

Greenplum allows developers to write MR programs in the Python and Perl scripting languages. This support enables MR scripts to use open source features such as text analysis and statistical toolkits. These scripts can access flat files and web pages, and can use SQL to access Greenplum relational tables. Source tables can read by Map scripts and target tables can be created by Reduce scripts. This architecture allows developers to mix and match data sources and programming styles. It also allows the building of a data warehouse using both ETL (data is transformed before it is loaded in the data warehousing environment) and ELT (data is transformed after it is loaded into the data warehousing environment) approaches.

Greenplum MR scripts can also be used as virtual tables by SQL statements – the MR job is run on the fly as part of the SQL query processing. Greenplum’s RDBMS engine executes all the code – SQL, Map scripts, Reduce scripts – on the same cluster of machines where the Greenplum database is stored.

Whereas, Greenplum tends to emphasize the use of SQL in MR programs, Aster takes the opposite approach of focusing on the use of MR processing capabilities in SQL-based programs. Aster allows MR user-defined functions to be invoked using SQL. These functions can be written in languages such as Python, Perl, Java, C++ and Microsoft .NET (C#, F#, Visual Basic), and can use SQL data manipulation and data definition statements. The Linux .NET support is provided by the Mono open source product. These functions can also read and write data from flat files. Like Greenplum, Aster MR capabilities can be used for loading a data warehouse using both ETL and ELT approaches. Aster, however, tends to emphasize the power of the ELT approach.

Both Greenplum and Aster allow the combining of relational data with MapReduce style data. This is particularly useful for batch data transformation and integration applications, and intensive data mining operations. The approach used will depend on the application and the type of developer. In general, programmers may prefer the Greenplum approach, whereas SQL experts may prefer the Aster approach.

What About Performance?

MapReduce supporters often state that MapReduce provides superior performance to relational. This obviously depends on the workload. Andrew Pavlo of Brown University together with Michael Stonebraker, David DeWitt and several others recently published a paper comparing the performance of two relational DBMSs (Vertica and an undisclosed row-oriented DBMS) with Hadoop MapReduce. The paper concluded that, “In general, the SQL DBMSs were significantly faster and required less code to implement each task, but took longer to tune and load the data.” It also acknowledged that, “In our opinion there is a lot to learn from both kinds of systems” and “… the APIs of the two classes of systems are clearly moving toward each other.”

Conclusion

MapReduce has achieved significant visibility because of its use by Google and its ability to process large amounts of unstructured web data, and also because of the heated debate between the advocates of MapReduce and relational database technology experts.

Two things are clear. Programmers like the simplicity of MapReduce and there is a clear industry direction toward supporting MR capabilities in traditional DBMS systems.

MapReduce is particularly attractive for the batch processing of large files of unstructured data for use in a business intelligence system. My personal opinion is that if MR programs are being used to filter and transform unstructured data (documents, web pages, web logs, event files) for loading into a data warehouse then I prefer an ETL approach to an ELT approach. This is because the ELT approach usually involves storing unstructured data in relational tables and manipulating it using SQL. I have seen many examples of these types of database application and this approach is guaranteed to give database designers heartburn.

At the same time, I accept that some organizations would prefer a single data management framework based on an RDBMS. This is one of the reasons why DBMS vendors added support for XML data and XQuery to their RDBMS products. My concern is that relational products and SQL are becoming overly complex, especially for application developers.