Buyer Beware: ACID Compliance of Analytical Data Platforms May Not Be What You Expect

A Comparison of Splice Machine, Hive LLAP and Snowflake’s performance based on transactional throughput

Recently there has been a lot of interest in the transactional processing functionality of data platforms. A case in point is the Delta Lake that was announced by Databricks at the Spark + Ai Summit. Delta Lake is an open source storage layer that runs on top of a data lake and is compatible with Apache Spark APIs. Delta Lake claims to support ACID transactions for its data lakes.

Even prior to these announcements, customers and prospects would frequently ask us about the transactional performance of data platforms that are primarily known for their analytical capabilities. These include both open source and commercial platforms such as Apache Hive and Snowflake. These questions were largely triggered by the claims of ACID compliance by the providers of these platforms. For example, Hive claims that with version 0.13 it is possible to provide full ACID functionality at the row-level. Similarly, Snowflake’s documentation states “Snowflake transactions guarantee ACID properties.”

Splice Machine’s position on this issue has always been that a data platform that is capable of processing workloads consisting of transactional and analytical queries at scale has to be built from the ground up with full ACID compliance. It would be very difficult to bolt on ACID capabilities and overcome the latency inherent in a data platform that has been originally designed as a data warehouse or a data lake to process analytical queries.

Since the fundamental criteria for a data platform to process OLTP transactions manifest itself in being ACID-compliant, it is worthwhile to refresh the concept of ACID. ACID stands for four database concepts, namely: atomicity, consistency, isolation, and durability:

Atomicity means a transaction is either completed or failed in its entirety

Consistency requires the results of an operation – even in failure – to result in a fully consistent state of the database

Isolation ensures intermediate states of transactions are invisible to other transactions, and hence are isolated

Durability has to do with once a transaction is committed, it is preserved in the system even when a system failure occurs

Users routinely expect the above-mentioned attributes to be part of their operational database functionality.

We embarked on this round of benchmarking to validate our assumption that a data platform can only be considered truly ACID compliant if it supports concurrent transactions in the database. These tractions require tracking data changes at a more granular level of individual database reads, writes, and lookups.

To test ACID compliance at the transaction level, we compare the performance of Splice Machine platform against Hive LLAP and Snowflake using OLTP queries. For this benchmark, we simulated the operation of 100 warehouses (HTAP-100) and measured the throughput using transactions per minute or tpmC on very simple low-end comparable systems. The simulated transactions in the TPCC benchmark are composed of five categories, namely new orders, stock level, delivery, order status, and payment. TpmC is a measure of the new order maximum system performance that the platform can provide on a sustained basis. In other words, tpmC measures how many new order transactions a processing platform generates per minute, while simultaneously executing four other transaction types – order status, stock level, payment, and delivery.

Based on this methodology, in the new order transactions category, Splice Machine achieved a score of 613.8 tpmC. This means that Splice Machine was able to generate about 614 transactions per minute of new orders while continuing to fulfill the transactions for four other transaction types. For the same transaction category, both Hive LLAP and Snowflake scored less than one.

Splice Machine

Hive LLAP

Snowflake

TPC-C (tpmC)

613.8

< 1

< 1

While these results are dramatic, they are not completely unexpected. Both Hive LLAP and Snowflake are analytical data platforms that are designed to run OLAP queries efficiently. Even though these platforms now claim to run OLTP queries by providing ACID functionality they can only do so at the file level and not at the individual transaction level. This is evidenced in the performance gap that we observed during our benchmark. Further, Splice Machine has been shown to run tpmC at greater than 10,000 (refer to the previous benchmark), but we wanted to run a simple demonstrable test across similar platforms.

As you might know, Splice Machine was built from the ground up as an ACID-compliant, relational database management system (RDBMS) that enables users to not only perform analytics using SQL and machine learning but it can also power real-time applications. We have always adhered to the philosophy that a data platform that needs to power real-time applications must do so by providing ACID capability at the individual transaction level. Splice Machine provides this functionality through a true Multi-Version Concurrency Control (MVCC) system using Snapshot Isolation semantics. This enables a full ANSI SQL data platform that indexes mutable data at scale to enable consistent updates and fast lookups. On the other hand, analytical platforms uses a file-based approach to provide ACID capability. These systems provides ACID transactions between multiple batch writes. This means that the system keeps track of changes in data over time by analyzing the differential in the Parquet files. Being ACID at the file level is not sufficient to power real-time applications as it requires tracking data changes at a more granular level of individual reads, writes, and lookups in the database and we believe that the proof is in the pudding. We also plan a follow-up blog on Databricks Delta Lake in the near future.