Defining Table Record Formats in Hive

The Java technology that Hive uses to process records and map them to column data types in Hive tables is called SerDe, which is short for SerializerDeserializer. The figure illustrates how SerDes are leveraged and it will help you understand how Hive keeps file formats separate from record formats.

So the first thing to notice is the INPUTFORMAT object. INPUTFORMAT allows you to specify your own Java class should you want Hive to read from a different file format. STORED AS TEXTFILE is easier than writing INPUTFORMAT org.apache.hadoop.mapred.TextInputFormat — the whole Java package tree and class name for the default text file input format object, in other words.

The same is true of the OUTPUTFORMAT object. Instead of writing out the whole Java package tree and class name, the STORED AS TEXTFILE statement takes care of all of that for you.

Hive allows you to separate your record format from your file format, so how exactly do you accomplish this? Simple, you either replace STORED AS TEXTFILE with something like STORED AS RCFILE, or you can create your own Java class and specify the input and output classes using INPUTFORMAT packagepath.classname and OUTPUTFORMAT packagepath.classname.

Finally notice that when Hive is reading data from the HDFS (or local file system), a Java Deserializer formats the data into a record that maps to table column data types. This would characterize the data flow for a HiveQL SELECT statement. When Hive is writing data, a Java Serializer accepts the record Hive uses and translates it such that the OUTPUTFORMAT class can write it to the HDFS (or local file system).

This would characterize the data flow for a HiveQL CREATE-TABLE-AS-SELECT statement. So the INPUTFORMAT, OUTPUTFORMAT and SerDe objects allow Hive to separate the table record format from the table file format.

Hive bundles a number of SerDes for you to choose from, and you’ll find a larger number available from third parties if you search online. You can also develop your own SerDes if you have a more unusual data type that you want to manage with a Hive table. (Possible examples here are video data and e-mail data.)

In the following list, some of the SerDes provided with Hive are described as well as one third-party option that you may find useful.

LazySimpleSerDe: The default SerDe that’s used with the TEXTFILE format; it would be used with our_first_table from the following listing.

RegexSerDe: The regular expression SerDe, which ships with Hive to enable the parsing of text files, RegexSerDe can form a powerful approach for building structured data in Hive tables from unstructured blogs, semi-structured log files, e-mails, tweets, and other data from social media. Regular expressions allow you to extract meaningful information (an e-mail address, for example) with HiveQL from an unstructured or semi-structured text document incompatible with traditional SQL and RDBMSs.

HBaseSerDe: Included with Hive to enables it to integrate with HBase. You can store Hive tables in HBase by leveraging this SerDe.

JSONSerDe: A third-party SerDe for reading and writing JSON data records with Hive. You can quickly find (via Google and GitHub) two JSON SerDes by searching online for the phrase json serde for hive.

AvroSerDe: Included with Hive so that you can read and write Avro data in Hive tables.