Overview =E2=
=80=93 Working with Avro from Hive

Infers the schema of the Hive table from the Avro schema. Starting in&n=
bsp;Hive 0.14, the Avro schema can be inferre=
d from the Hive table schema.

Reads all Avro files within a table against a specified schema, taking =
advantage of Avro's backwards compatibility abilities

Supports arbitrarily nested schemas.

Translates all Avro data types into equivalent Hive types. Most types m=
ap exactly, but some Avro types don't exist in Hive and are automatically c=
onverted by the AvroSerde.

Understands compressed Avro files.

Transparently converts the Avro idiom of handling nullable types as Uni=
on[T, null] into just T and returns null when appropriate.

Writes any Hive table to Avro files.

Has worked reliably against our most convoluted Avro schemas in our ETL=
process.

Starting in Hive 0.14, columns can be add=
ed to an Avro backed Hive table using the Alter Table statement.

For general information about SerDes, see Hive SerDe in the Develo=
per Guide. Also see SerDe fo=
r details about input and output processing.

Requirements

The AvroSerde has been built and tested against Hive 0.9.1 and later, an=
d uses Avro 1.7.5 as of Hive 0.13 and 0.14.

Hive Versions

Avro Version

Hive 0.9.1

Avro 1.5.3

Hive 0.10, 0.11, and 0.12

Avro 1.7.1

Hive 0.13 and 0.14

Avro 1.7.5

Avro to Hive type conversion<=
/h3>

While most Avro types convert directly to equivalent Hive types, there a=
re some which do not exist in Hive and are converted to reasonable equivale=
nts. Also, the AvroSerde special cases unions of null and another type, as =
described below:

Avro type

Becomes Hive type

Note

null

void

boolean

boolean

int

int

long

bigint

float

float

double

double

bytes

binary

Bytes are converted to Array[smallint] prior =
to Hive 0.12.0.

string

string

record

struct

map

map

list

array

union

union

Unions of [T, null] transparently convert to =
nullable T, other types translate directly to Hive's unions of those types.=
However, unions were introduced in Hive 7 and are not currently able to be=
used in where/group-by statements. They are essentially look-at-only. Beca=
use the AvroSerde transparently converts [T,null], to nullable T, this limi=
tation only applies to unions of multiple types or unions not of a single t=
ype and null.

enum

string

Hive has no concept of enums.

fixed

binary

Fixeds are converted to Array[smallint] prior=
to Hive 0.12.0.

Creating Avro-backed Hiv=
e tables

Avro-backed tables can be created in Hive using AvroSerDe.

All Hive versions

To create an Avro-backed table, specify the serde as org.apache.hadoop.h=
ive.serde2.avro.AvroSerDe, specify the inputformat as org.apache.hadoop.hiv=
e.ql.io.avro.AvroContainerInputFormat, and the outputformat as org.apache.h=
adoop.hive.ql.io.avro.AvroContainerOutputFormat. Also provide a location fr=
om which the AvroSerde will pull the most current schema for the table. For=
example:

At this point, the Avro-backed table can be worked with in Hive like any=
other table.

Hive 0.14 and later versions<=
/h4>

Starting in Hive 0.14, Avro-backed tables =
can simply be created by using "STORED AS AVRO" in a DDL statement. AvroSer=
De takes care of creating the appropriate Avro schema from the Hive table s=
chema, a big win in terms of Avro usability in Hive.

Writing tables to Avro files<=
/h3>

The AvroSerde can serialize any Hive table to Avro files. This makes it =
effectively an any-Hive-type to Avro converter. In order to write a table t=
o an Avro file, you must first create an appropriate Avro schema (except in=
Hive 0.14.0 and later, as described below). Create as select type statemen=
ts are not currently supported.

Types translate as detailed in the table above. For types that do not tr=
anslate directly, there are a few items to keep in mind:

Types that may be null must be defined as a union of that type =
and Null within Avro. A null in a field that is not so defined wil=
l result in an exception during the save. No changes need be made to the Hi=
ve schema to support this, as all fields in Hive can be null.

Avro Bytes type should be defined in Hive as lists of tiny ints. The Av=
roSerde will convert these to Bytes during the saving process.

Avro Fixed type should be defined in Hive as lists of tiny ints. The Av=
roSerde will convert these to Fixed during the saving process.

Avro Enum type should be defined in Hive as strings, since Hive doesn't=
have a concept of enums. Ensure that only valid enum values are present in=
the table =E2=80=93 trying to save a non-defined enum will resul=
t in an exception.

Hive is very forgiving about types: it will attempt to store whatever va=
lue matches the provided column in the equivalent column position in the ne=
w table. No matching is done on column names, for instance. Therefore,=
it is incumbent on the query writer to make sure the target column types a=
re correct. If they are not, Avro may accept the type or it may throw an ex=
ception; this is dependent on the particular combination of types.

Example

Consider the following Hive table, which covers all types of Hive data t=
ypes, making it a good example:

Hive 0.14 and later

In Hive versions 0.14 and later, you do not need to create the Avro sche=
ma manually. The procedure shown above to save a table as an Avro file redu=
ces to just a DDL statement followed by an insert into the table.

Avro file extension

The files that are written by the Hive job are valid Avro files, however=
, MapReduce doesn't add the standard .avro extension. If you copy these fil=
es out, you'll likely want to rename them with .avro.

Specifying the Avro s=
chema for a table

There are three ways to provide the reader schema for an Avro table, all=
of which involve parameters to the serde. As the schema evolves, you can u=
pdate these values by updating the parameters in the table.

Use avro.schema.url

Specifies a URL to access the schema from. For http schemas, this works =
for testing and small-scale clusters, but as the schema will be accessed at=
least once from each task in the job, this can quickly turn the job into a=
DDOS attack against the URL provider (a web server, for instance). Use cau=
tion when using this parameter for anything other than testing.

The schema can also point to a location on HDFS, for instance: hdfs://yo=
ur-nn:9000/path/to/avsc/file. The AvroSerde will then read the file from HD=
FS, which should provide resiliency against many reads at once. Note that t=
he serde will read this file from every mapper, so it's a good idea to turn=
the replication of the schema file to a high value to provide good localit=
y for the readers. The schema file itself should be relatively small, so th=
is does not add a significant amount of overhead to the process.

Use schema.literal and embed the schema in the create statement

You can embed the schema directly into the create statement. This works =
if the schema doesn't have any single quotes (or they are appropriately esc=
aped), as Hive uses this to define the parameter value. For instance:

Note that the value is enclosed in single quotes and just pasted into th=
e create statement.

Us=
e avro.schema.literal and pass the schema into the script

Hive can do simple variable substitution and you can pass the schema emb=
edded in a variable to the script. Note that to do this, the schema must be=
completely escaped (carriage returns converted to \n, tabs to \t, quotes e=
scaped, etc). An example:

To execute this script file, assuming $SCHEMA has been defined to be the=
escaped schema value:

=20

hive --hiveco=
nf schema=3D"${SCHEMA}" -f your_script_file.sql

=20

Note that $SCHEMA is interpolated into the quotes to correctly handle sp=
aces within the schema.

Use none to ignore either avro.schema.literal or avro.schema.url

Hive does not provide an easy way to unset or remove a property. If you =
wish to switch from using URL or schema to the other, set the to-be-ignored=
value to none and the AvroSerde will treat it as if it we=
re not set.

HBase Integration

Hive 0.14.0 onward supports storing and querying Avro objects in HBase c=
olumns by making them visible as structs to Hive. This allows Hive to perfo=
rm ad hoc analysis of HBase data which can be deeply structured. Prior to 0=
.14.0, the HBase Hive integration only supported querying primitive data ty=
pes in columns. See Avro Data Stored in HBase Colu=
mns for details.

If something goes wrong

Hive tends to swallow exceptions from the AvroSerde that occur before jo=
b submission. To force Hive to be more verbose, it can be started with *hive --hiveconf hive.root.logger=3DINFO,console*, which will =
spit orders of magnitude more information to the console and will likely in=
clude any information the AvroSerde is trying to get you about what went wr=
ong. If the AvroSerde encounters an error during MapReduce, the stack trace=
will be provided in the failed task log, which can be examined from the Jo=
bTracker's web interface. The AvroSerde only emits the AvroSerdeException; =
look for these. Please include these in any bug reports. The most common is=
expected to be exceptions while attempting to serializing an incompatible =
type from what Avro is expecting.

FAQ

Why do I get error-error-error-error-error-error-error=
and a message to check avro.schema.literal and avro.schema.url when descri=
bing a table or running a query against a table?

The AvroSerde returns this message when it has trouble finding or parsin=
g the schema provided by either the avro.schema.literal or avro.avro.schema=
.url value. It is unable to be more specific because Hive expects all calls=
to the serde config methods to be successful, meaning we are unable to ret=
urn an actual exception. By signaling an error via this message, the table =
is left in a good state and the incorrect value can be corrected with a cal=
l to alter table T set TBLPROPERTIES.