Work in Progress

Proposed Work

The following describe tasks that are proposed for future work on Howl. They are ordered by what we currently believe to be their priority, with the most important tasks being listed first.

Support for more file formats
At least one row format and text format need to be supported.

Notification
Add ability for systems such as work flow to be notified when new data arrives in Howl. This will be designed around a few systems receiving notification, not large numbers of users receiving notifications (i.e. we will not be building a general purpose publish/subscribe system). One solution to this might be an RSS feed or similar simple service.

Allow specification of general storage type
Currently Hive allows the user to specify specific storage formats for a table. For example, the user can say STORED AS RCFILE. We would like to enable users to select general storage types (columnar, row, or text) without needing know the underlying format being used. Thus it would be legal to say STORED AS ROW and let the administrators decide whether sequence file or tfile is used to store data in row format.

Mark a set of partitions done
Often users create a collection of data sets altogether, though different sets may be completed at different times. For example, users might partition their web server logs by date and region. Some users may wish to only read a particular region and are not interested in waiting until all of the regions are completed. Others will want to wait until all regions are completed before beginning processing. Since all partitions are committed individually, Howl has no way for users to know when all partitions for the day are present. A way is needed for the writer to signal that all partitions with a given key value (such as date = today) are complete and users waiting for the entire collection can now begin. This will need to be propagated through to the notification system.

Data Compaction
Very frequently users wish to store data in a very fine grained manner because their queries tend to access only specific partitions of the data. Consider, for example, if a user downloads logs from the website for all twenty countries it operates in, every hour, and keeps those logs for a year, and each hour has one hundred part files. That's 1,720,000 files for just this one input. This places a significant burden on the namenode. A way is needed to compact these into a larger file while preserving the ability to address individual partitions. This compaction may be done when the file is being written, done soon after the data is written, or done at some later point. For an example of the last case consider the example of hourly data. For the first few days hourly data may have significant value. After a week, it is less likely that users will be interested in any given hour of data. So the hourly data may be compacted into daily data after a week. Small performance degradation will be acceptable to achieve this compaction. har will be evaluated for implementing this feature. Whether this compaction is automatically initiated by Howl or requires user or administrator initiation is TBD.

Dynamic Partitioning
Currently Howl can only store data into one partition at a time. It needs to support spraying to multiple partitions in one write.

Utility APIs
Grid managers will want to build tools that use Howl to help manage their grids. For example, one might build a tool to do replication between two grids. Such tools will want to use Howl's metadata. Howl needs to provide an appropriate API for these types of tools.

Pushing filters into storage formats
In columnar compression performance can be improved when a row selection predicate can be evaluated against the relevant columns before the remaining columns are decompressed and deserialized and the row is constructed. When the filter itself can be applied on a compressed and serialized version of the column the performance boost is significant. When the underlying storage format supports these, Howl needs to push the filters from Pig and Hive. Columnar storage formats that Howl commonly uses will also need to be modified to support these features.

Separate compression for separate columns
One of the values of columnar compression is the ability to select compression formats that are optimal for different columns in the data. Howl needs to support a variety of data specific compression formats and allow users to select different formats for different columns in a table.

Indices for sorted tables
Providing the first record in each block for a sorted table enables a number of performance optimizations in the query engine accessing the data (such as Pig's merge join). In Howl's standard formats we may need to provide this functionality. It is also possible that the index functionality already being added to Hive could be used for this.

Statistics Storage
Data statistics should be accessible through Howl. Compact statistics (e.g. number of rows) can be stored in the db. Large statistics (e.g. histograms) would have to be stored in hdfs. Hive has also done some work in this area. We would need to integrate any work we did with them.

Schema Evolution
Currently schema evolution in Hive is limited to adding columns at the end of the non-partition keys columns. It may be desirable to support other forms of schema evolution, such as adding columns in other parts of the record, or making it so that new partitions for a table no longer contain a given column.

Support for streaming
Currently Howl does not support Hadoop streaming users. It should.

Integration with Hbase
Currently Howl does not support Hbase tables. It needs to have storage drivers so that HowlInputFormat and HowlLoader can do bulk reads and HowlOutputFormat and HowlStorage can do bulk writes. We also need to understand what, if any, interface it makes sense for Howl to expose for point reads and writes for Howl tables that use Hbase as a storage mechanism.