About

The blog you’re reading right now will be our home. First, we’ll bring you into our world through in depth technical articles, ground breaking statistical analysis and tales of epic and annual nerf wars. Second, we’ll also be giving back to the community in the form of frequent open source contributions, so expect this space to be updated regularly with the latest from Enova Tech & Analytics. Lastly, because Enova is focused on helping others, we’ll keep this site updated with all of the latest meetups, workshops and conferences the Enova team is helping with.

We can’t wait to show you what we’re working on.

January 5, 2017

Citus – Exploration and Evaluation

One of the cool things we do here at Enova is the “Fellowship.” It’s a chance for people to pitch ideas on exploring new technologies that could benefit Enova and our customers. If the pitch is successful, the goal of the fellowship is to deliver a proof-of-concept at the end of four weeks. During my Fellowship I investigated whether Citus can provide performance and scalability benefits for Real-time Reporting Analytics here.

Citus has a master and worker nodes configuration, wherein the former stores metadata and has its custom distributed query planner, and the latter store shards. The application is unaware about this setup and all queries are directed inherently at the master.

Citus parallelizes incoming queries by breaking it into multiple fragment queries, which run in parallel on worker shards. This allows Citus to utilize the processing power of all the nodes in the cluster and also of individual cores on each node for each query. Due to this parallelization, performance is cumulative of computing power of all the cores in the cluster leading to a dramatic increase in query times compared to traditional PostgreSQL on a single server.

There are substantial issues, even with Citus (v5.0) that cannot be overlooked if we were to deploy Citus in production. Work-arounds do exist for some of these, but they’re not ideal for our use cases. Note: Citus does have a new v6.0 but it requires an upgrade to PostgreSQL 9.6.

Here are some of the drawbacks of V5.0:

No support for these standard PostgreSQL features:

Serial Primary Keys

Triggers (work-around: create update-able view on the table)

Sub-queries in the WHERE clause (IN/NOT IN, EXISTS/NOT EXISTS)

SELECT DISTINCT

Foreign Keys (fix released in Citus 6.0 but compatible only with Postgres9.6)

No support for partitioning on multiple columns (work-around: create a composite type and partition on it)

Join Issues:

Joins not possible between a regular and distributed table (work-around: replicate regular table to a single shard on every worker and push the join query down to the workers. This table must be defined as a ‘reference table’ and set citus.shard_replication_factor to the current number of worker nodes)

Outer Joins must Join on the same partition key and underlying tables must have same number of shards

After reviewing some existing functionality, the first experiment involved a long-running query (runtime:~9.75min) on our existing PostgreSQL database. Using Citus and appropriately sharding certain large tables of 19M rows, 2.6B rows, and 3.5M rows, some performance gain was measured.

Result: Unfortunately, even after re-writing this query to satisfy join issues of Citus, the query timed-out after 24 hours on Citus without finishing. However, it is possible that improved hardware would address this.

Case 2:

The second experiment involved another long-running query (runtime:~30min) with table sizes of 11.3M rows and 121.6M rows. Performance gain was again measured to assess for any improvements.

Result: Unfortunately, this query does not run on Citus as its distributed query planner fails to execute it.

Summary

Citus, although theoretically solves the issues seen with large datasets by automated distribution of databases physically, in both our real-world cases, there was no performance gain. Additionally, while implementing the test cases, several underlying limitations of the technology were revealed. But, that’s why we do these proof-of-concepts.

Conclusion

Citus is a definitely a promising concept and could be worth investigating in the near future once it matures, but owing to its current limitations and performance issues on existing Enova-specific queries, it does not seem to be an ideal fit for us. There are some instances where it would clearly provide performance gain – for example, for Large Dataset Archival where attachments can be moved to AWS and the application can directly access these when required, Citus would be substantially faster than traditional Postgres. Similar examples are also available in Citus documentation wherein for certain test cases, performance gain is stark. Based on my research, where this technology stands today, not only its limited technical adaptation makes moving entire Enova datasets to Citus not possible, but also the optimization it provides does not translate to measurable performance gains.