Tip 2: NULLs

HiveQL DDL has no option in its CREATE TABLE syntax to specify that a column be NULLable. That's because in the Hadoop/HDFS/Big Data philosophy, data is in general allowed to be missing, and that translates nicely to the SQL concept of NULL in Hive/HiveQL.

However, that doesn't extend to STRUCTs in HiveQL. Apparently neither columns within STRUCTs nor the STRUCTs themselves can be NULLs.

Thus there is this odd restriction: while all regular Hive columns must be NULLable, all STRUCT Hive columns must be non-NULLable. That's quite a polarization of behavior for a DDL that doesn't even have a keyword "NULL".

Thursday, February 7, 2013

As an update to the previous post, it turns out Avro can append to an HDFS file after all. The capability is so obscure that even the experts on the Apache Avro mailing list weren't sure of the succinct way of accomplishing it. Sample code from GitHub.

But the demand for this capability will only grow, now that Hive supports Avro directly as of the Hive 0.10 release in January.

Declaring a 1-to-1 sub-record requires the verbose style above, where the "type" is declared to again to be a "type", which is of "record".

In order for a field to be "nullable", it must be declared as a UNION with type "null" as shown above for "type" using the JSON array syntax.

In order to avoid having to explicitly set a field when writing out an Avro record, it must be declared as "default": null

Combining the above two points, in order to declare a field nullable and default: null, the "null" in the UNION must appear first because it is the first type in a UNION list that determines the data type for the default, and "null" is an invalid default for all data types (e.g. "string", "int", etc.) with the exception of the "null" data type.

The above applies to sub-records as well. In the example above, "details" is a required sub-record (with an optional field "eventname"), whereas "metadata" is an optional sub-record (with, if the sub-record is present, a required field "referrer").

When writing out sub-records using the "generic" schema rather than the "specific" schema, the Java code to instantiate an empty sub-record is

Appending to existing Avro files in HDFS is not supported, even though appending to existing Avro files on the local filesystem and appending to existing non-Avro HDFS files are both supported. The open one-year-old unassigned JIRA ticket for such a feature is AVRO-1035.

[UPDATE 2013-07-18: The above Hive table creation TBLPROPERTIES no longer works with Hive 0.11. Instead one must use SERDEPROPERTIES.]

Although Apache documentation gives examples of 'avro.schema.url'='file:///home/...' that actually doesn't work for any non-trivial Hive query because map/reduce doesn't have access to the local filesystem.

It's not required to specify LOCATION for where the external Hive data will be stored. If it is not specified, then Hive creates the directory in HDFS of /user/hive/warehouse/<tablename>

It is possible to arbitrarily dump a bunch of Avro files into a Hive table directory, and Hive will automatically read them all when executing a query.

Not specific to Avro, but it is possible to specify one or more levels of PARTITION for a Hive table. Each level of PARTITION corresponds to another level of directory underneath /user/hive/warehouse/ (e.g. for a two-level PARTITION one might have a directory /user/hive/warehouse/mytable/datestring=2013-02-01/useridfirstchar=2). Using the point above, one can dump multiple Avro files into one such leaf directory of a PARTITIONed table.
However, one cannot just merely create partition directories in /user/hive/warehouse/<tablename> because the partition names also need to be added to the Hive metastore, which can be done with HiveQL via either the Hive CLI or JDBC:

ALTER TABLE mytable ADD IF NOT EXISTS PARTITION (ds='2013-02-01')

Also, note that contrary to examples in Apache documentation, "date" is not a valid partition name as it is a reserved word. "ds" (for "date string") is a popular alternative.

Hive supports SQL1999-style nested structures and nested arrays out of the box, and so is a good match for highly-structured Avro schemas. E.g., below is a valid Hive query against the example schema and CREATE TABLE from above: