Hive should store the full table schema in partition storage descriptors

Details

Type: Bug

Status:Resolved

Priority: Major

Resolution:
Not a Problem

Affects Version/s:
None

Fix Version/s:
None

Component/s:
None

Labels:

None

Description

Hive tables have a schema, which is copied into the partition storage descriptor when adding a partition. Currently only columns stored in the table storage descriptor are copied - columns that are reported by the serde are not copied. Instead of copying the table storage descriptor columns into the partition columns, the full table schema should be copied.

DETAILS

This is a little long but is necessary to show 3 things: current behavior when explicitly listing columns, behavior with HIVE-2941 patched in and serde reported columns, and finally the behavior with this patch (full table schema copied into the partition storage descriptor).

Here's an example of what currently happens. Note the following:

the two manually-defined fields defined for the table are listed in the table storage descriptor.

both fields are present in the partition storage descriptor

This works great because users who query for a partition can look at its storage descriptor and get the schema.

When adding a partition, copy the full table schema into the partition storage descriptor. The table storage descriptor may not contain the full schema when using a serde that reports its schema. This change makes partitions with serde-reported schema behave the same as partitions without a serde-reported schem.

Hive tables have a schema, which is copied into the partition storage descriptor when adding a partition. Currently only columns stored in the table storage descriptor are copied - columns that are reported by the serde are not copied. Instead of copying the table storage descriptor columns into the partition columns, the full table schema should be copied.

DETAILS

This is a little long but is necessary to show 3 things: current behavior when explicitly listing columns, behavior with HIVE-2941 patched in and serde reported columns, and finally the behavior with this patch (full table schema copied into the partition storage descriptor).

Here's an example of what currently happens. Note the following:

the two manually-defined fields defined for the table are listed in the table storage descriptor.
both fields are present in the partition storage descriptor

This works great because users who query for a partition can look at its storage descriptor and get the schema.

Now let's examine what happens when creating a table when the serde reports the schema. Notice the following:

The table storage descriptor contains an empty list of columns. However, the table schema is available from the serde reflecting on the serialization class.
The partition storage descriptor does contain a single "part_dt" column that was copied from the table partition keys. The actual data columns are not present.

Travis Crawford
added a comment - 30/Jun/12 02:23 Status update:
Looking into this a bit more, I think we can avoid storing the cols in the metastore if we simply allow partitions to report cols from the serde. Something like this:
ql/src/java/org/apache/hadoop/hive/ql/metadata/Partition.java
public List<FieldSchema> getCols() {
- return tPartition.getSd().getCols();
+ if (SerDeUtils.shouldGetColsFromSerDe(table.getSerializationLib())) {
+ return table.getCols();
+ } else {
+ return tPartition.getSd().getCols();
+ }
}
For thrift/protobuf this would work perfectly, since you want all records to have the newest schema, and let thrift/protobuf deal with figuring out missing values, unknown fields, etc.
Thoughts?

Travis,
I liked your first patch better then this one. Semantics is that when you add partition, you store current table's schema in Partition's storage descriptor (which is what your first patch is doing). Your second approach will return the table schema for Partition at the read time, by which time table's schema might have changed. I will test and commit your first patch.

Ashutosh Chauhan
added a comment - 02/Jul/12 21:45 Travis,
I liked your first patch better then this one. Semantics is that when you add partition, you store current table's schema in Partition's storage descriptor (which is what your first patch is doing). Your second approach will return the table schema for Partition at the read time, by which time table's schema might have changed. I will test and commit your first patch.

In some cases (storing thrift/protobufs) reporting the read-time schema is preferable to the write-time schema. For example, let's say you're storing thrift structs and add a new optional field with default value. In that case all your old records would be automatically upgraded if using the read-time field reporting.

I played around with this over the weekend and think it could look something like this. If the partition storage descriptor has a serde that you should get the cols from, then do that. Otherwise, use the fields stored in the metastore. Since this is per-partition you could have some partitions using serde-reported fields, and other partitions using hard-coded ones.

Thoughts? I still think the original patch (copy table columns into partition storage descriptor at write time) is an improvement, but this dynamic approach would be awesome. Some of our schemas are pretty large so this would save a lot of metastore space

Travis Crawford
added a comment - 02/Jul/12 22:25 In some cases (storing thrift/protobufs) reporting the read-time schema is preferable to the write-time schema. For example, let's say you're storing thrift structs and add a new optional field with default value. In that case all your old records would be automatically upgraded if using the read-time field reporting.
I played around with this over the weekend and think it could look something like this. If the partition storage descriptor has a serde that you should get the cols from, then do that. Otherwise, use the fields stored in the metastore. Since this is per-partition you could have some partitions using serde-reported fields, and other partitions using hard-coded ones.
ql/src/java/org/apache/hadoop/hive/ql/metadata/Partition.java
- public List<FieldSchema> getCols() {
- return tPartition.getSd().getCols();
+ public List<FieldSchema> getCols() throws HiveException {
+ if (SerDeUtils.shouldGetColsFromSerDe(
+ tPartition.getSd().getSerdeInfo().getSerializationLib())) {
+ return Hive.getFieldsFromDeserializer(table.getTableName(), getDeserializer());
+ } else {
+ return tPartition.getSd().getCols();
+ }
Thoughts? I still think the original patch (copy table columns into partition storage descriptor at write time) is an improvement, but this dynamic approach would be awesome. Some of our schemas are pretty large so this would save a lot of metastore space

The confusion arises from Hive having two sets of classes to represent the main objects (tables, partitions, ...). If you use metastore.api classes the fields are not available unless stored in the metastore. However, if using the ql.metadata classes, Partition copies the table fields to the partition if they're empty. This works great for thrift-based tables.

Travis Crawford
added a comment - 17/Jul/12 18:08 After looking at this further, this change is not actually needed.
The confusion arises from Hive having two sets of classes to represent the main objects (tables, partitions, ...). If you use metastore.api classes the fields are not available unless stored in the metastore. However, if using the ql.metadata classes, Partition copies the table fields to the partition if they're empty. This works great for thrift-based tables.