Re: Rollback

When I've had to support data rollback in Hive (and every other data warehouse project concerned with reliability), I've leveraged partitions to support logical rollback, vs physically moving data around or restoring the metastore. The process flow looked something like:

1. Create a table, partitioned by load date or id.

2. Insert data into that partition.

3. Create a view in an interface layer (another DB) that queries from the base table, and filters to that one partition. Query tools / clients only query from this interface layer.

4. Next load, add a new partition.

5. Insert data into the new partition.

6. Update the interface view to point to the new partition.

When rollback is needed, it's easy to update all the views to point to a previous partition that represents an earlier point-in-time.

With the views filtering by a partition key, Hive and Impala (but not Spark SQL) will use partition pruning to only select from the correct partitions. Another benefit of the above approach is you have point-in-time snapshots of data over time, which makes it very easy to build a data warehouse with snapshot facts. It can also be made to work with tables that are appended to (vs replaced or merged into), assuming you aren't updating data in place, e.g., you're using new partitions with each load.

The downside of the above approach is you can end up with lots of partitions, which can be a problem with Impala (it scales less well than Hive on partition count), but that's mitigated with a good retention policy or table archive implementation.

This is indeed a very interesting point. Do you think could make sense to have daily-based partitions, since my main ingestion workflow run once a day? And how can I force Hive or Impala users to use the last point-in-time data?