The first case where this breaks down is when you want to return multiple values from your UDF. For me, this often arises when we have serialized data stored in a single Hive field and want to extract multiple pieces of information from it.

For example, suppose we have a simple Person object (leaving out all of the error checking code):

Unfortunately, the two invocations will have to separately deserialize their inputs, which could be expensive in less trivial examples. It also requires writing two separate implementation classes whose only difference is which field to pull out of your model object.

An alternative is to use a GenericUDF and return a struct instead of a simple string. This requires using object inspectors to specify the input and output types, just like in a UDTF:

Here, we're specifying that we expect a single primitive object inspector as an input (error handling code omitted) and returning a struct containing two fields, both of which are strings. We can now use the following query:

4 comments:

The problem with this though is that the hive-users need to have knowledge of the object before hand to properly take advantage of the system. It's not so easy for them to browse each table and find what they want. Do you have any tools to allow people to do that?

I'd recommend having the query that loads data expan the object into its individual fields, as in the "output" table in the post. That way, the only time users need to be aware of the object structure is if they're using the UDF directly.

For that case, I'd recommend documenting the object format in a @Description annotation (http://hive.apache.org/docs/r0.7.0/api/org/apache/hadoop/hive/ql/exec/Description.html) on your UDF. The "value" and "extended" fields will then be available inside the Hive console via "describe function foo" and "describe extended function foo".

The main reason would be that structs are a better description of what you're actually returning. In the example, we can access the data inside of the result using "firstName" and "lastName". It might seem pretty intuitive to the developer to simply make these two fields of a string array, but what happens if the return type has a large number of fields or if there is no natural order to the fields?

The other reason is that arrays are homogeneous in Hive, so you can't return multiple types of data in a single array. With a struct, for example, you could return a firstName, lastName, age (integer-valued), aliases (array of strings), and known addresses (array of structs).