Griddable ETL solutions for Hadoop Hive and HBase environments

Loading data into analytic databases poses major problems. Even with only a few databases, the problem is significant because analytics systems have different database types and schemas.

Extract, transfer, and load (ETL) solutions exist to solve this problem. ETL solutions usually run as nightly jobs. They extract data from source systems, transform it to Teradata or other analytics formats, and load it into target systems. The transformation process is complex because source databases are typically second or third normal form and target data warehouses use star schemas with both fact and dimension tables.

ETL to ELT

After Google invented map-reduce to handle analytics for data obtained from web-crawling, Yahoo and many other large enterprises quickly followed. In response, several vendors released Hadoop distributions in 2010 and 2011. With map-reduce, the ETL solution transformation process runs as a distributed computing operation using a shared filesystem (HDFS). These new ETL solutions swapped the order of transform and load, resulting in ELT. ELT extracts from the source system, loads into Hadoop, and transforms data as a Hadoop map-reduce job.

While the scale of ELT is vastly better than traditional ETL solutions, it still uses batch-oriented extract. And like ETL solutions, batch-oriented ELT jobs run daily or weekly. Furthermore, HDFS operates as an immutable (append-only) store. Hadoop never generates a cache invalidation operation compared to shared file system stores like NFS. This means the ELT load process cannot efficiently handle deletes and updates.

Real-time analytics

While Google and Microsoft update search engine indexes once a day, many analytic applications need more than a daily ETL solution. A new class of real-time analytic (RTA) solutions using Apache Spark attempt to address this requirement. Spark provides window-based and stateful stream processing, and it works exceptionally well for log data.

Using artificial intelligence and machine learning (AI/ML), many enterprises run data lakes based on Hadoop and Spark. In these data lakes, enterprises must pick their poison – operate on windows of data or use regular batch-oriented ingests. This choice has significant impact on the characteristics and features of the associated ETL solutions.

Hadoop continuous ingest

With the Griddable platform, enterprises enjoy the benefits of Hadoop environments while utilizing incremental and continuous ingests. Griddable ETL solutions perform incremental and continuous ingests which reduce load on source databases by up to a factor of two. Plus, Griddable performs transformations on data as it moves to Hadoop and writes to HDFS while preserving transaction semantics. Further, the Griddable Hive consumer performs incremental snapshot materialization and allows Hive queries to run on very recent ingests.

Other vendors claim to deliver ETL solutions with incremental ingests to Hive. However, those solutions usually rely on Hive ACID, a new variant of Hive. Hive ACID supports updates and deletes but imposes several problematic user limitations. For example, all Hive ACID transactions are auto-commit without the use of Begin, Commit, and Rollback. Hive ACID supports fewer file formats. It disallows reading/writing of an ACID table from a non-ACID session. It supports only snapshot-level isolation. The list continues.

Next steps

To discuss how Griddable can help with your ETL solution requirements, click the “Demo Now” button for a 10 minute, no-obligation tour.