DUPLICATE: Add TIMESTAMP column type for thrift dynamic_type

Details

Description

create table something2 (test timestamp);
ERROR: DDL specifying type timestamp which has not been defined
java.lang.RuntimeException: specifying type timestamp which has not been defined
at org.apache.hadoop.hive.serde2.dynamic_type.thrift_grammar.FieldType(thrift_grammar.java:1879)
at org.apache.hadoop.hive.serde2.dynamic_type.thrift_grammar.Field(thrift_grammar.java:1545)
at org.apache.hadoop.hive.serde2.dynamic_type.thrift_grammar.FieldList(thrift_grammar.java:1501)
at org.apache.hadoop.hive.serde2.dynamic_type.thrift_grammar.Struct(thrift_grammar.java:1171)
at org.apache.hadoop.hive.serde2.dynamic_type.thrift_grammar.TypeDefinition(thrift_grammar.java:497)
at org.apache.hadoop.hive.serde2.dynamic_type.thrift_grammar.Definition(thrift_grammar.java:439)
at org.apache.hadoop.hive.serde2.dynamic_type.thrift_grammar.Start(thrift_grammar.java:101)
at org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe.initialize(DynamicSerDe.java:97)
at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:180)
at org.apache.hadoop.hive.ql.metadata.Table.initSerDe(Table.java:141)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:202)
at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:641)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:98)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:215)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:174)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:207)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:305)

Namit Jain
added a comment - 10/Mar/09 18:33 For the first cut, we dont need so much flexibility. Let us only support one default format.
We can have a to_timestamp() function which takes in the format, that function can be enhanced to support more formats on demand.

Ok. After lot of inspection over the MySQL 6.0 spec and Hive grammar, I decided to implement TIMESTAMP as
a primitive type which will be stored and retrived as a string of characters. Also we have to implement one
UDF to retrieve TIMESTAMP in certain formats from stored data. We have to implement eventually something
called TimeZone inside Hive server for storage and retrieval of UTC time. Right now we can use system time.

Shyam Sundar Sarkar
added a comment - 10/Mar/09 19:59 Ok. After lot of inspection over the MySQL 6.0 spec and Hive grammar, I decided to implement TIMESTAMP as
a primitive type which will be stored and retrived as a string of characters. Also we have to implement one
UDF to retrieve TIMESTAMP in certain formats from stored data. We have to implement eventually something
called TimeZone inside Hive server for storage and retrieval of UTC time. Right now we can use system time.

I went through the write up and it looks quite comprehensive from the DDL point of view. Here are a few comments:

1. I think we can drop ON UPDATE all together as we do not support UPDATE in Hive at this point.
2. auto increment is a nice to have feature since it can be achieved with by providing a now() UDF in hive, it is definitely a good feature to have though.
3. Would be great if you could also add the UDFs that could be added to timestamps. Mysql has a bunch of these. Some that directly come to mind are day(ts), month(ts), year(ts) etc, and to_timestamp function as namit pointed out.

As far as storage is concerned, that would depend on the serde in question...

Ashish Thusoo
added a comment - 10/Mar/09 21:16 I went through the write up and it looks quite comprehensive from the DDL point of view. Here are a few comments:
1. I think we can drop ON UPDATE all together as we do not support UPDATE in Hive at this point.
2. auto increment is a nice to have feature since it can be achieved with by providing a now() UDF in hive, it is definitely a good feature to have though.
3. Would be great if you could also add the UDFs that could be added to timestamps. Mysql has a bunch of these. Some that directly come to mind are day(ts), month(ts), year(ts) etc, and to_timestamp function as namit pointed out.
As far as storage is concerned, that would depend on the serde in question...

I do not have permission to work on this issue. Can someone explain why I have no permission on worklog? I am currently working on it.
Also what is the timeline for 0.3.0 release in terms of weeks or months?

Shyam Sundar Sarkar
added a comment - 27/Mar/09 22:32 I do not have permission to work on this issue. Can someone explain why I have no permission on worklog? I am currently working on it.
Also what is the timeline for 0.3.0 release in terms of weeks or months?

Neil Conway
added a comment - 02/Apr/09 00:54 I'd suggest also looking at the SQL spec for the timestamp type (and/or the implementation in PostgreSQL), assuming SQL compliance is something you guys care about.
Why are you storing timestamps as strings, rather than as a numeric offset from some epoch time?

I was following Hive Developer Guide and found that one important section is missing.
Section on "3.4. Adding new unit tests" has no instructions about how to add a new unit test.
I had to go through trial and error methods (with velocity templates) to add a new unit test
in the test suite.

I request that someone from original test suite designer team should write few words
for this imporatnt subsection.

Shyam Sundar Sarkar
added a comment - 16/Apr/09 19:19 I was following Hive Developer Guide and found that one important section is missing.
Section on "3.4. Adding new unit tests" has no instructions about how to add a new unit test.
I had to go through trial and error methods (with velocity templates) to add a new unit test
in the test suite.
I request that someone from original test suite designer team should write few words
for this imporatnt subsection.
Regards,
shyam_sarkar@yahoo.com

I was debugging Hive SQL. I found that Hive talks to Thrift software layer. From the documentation it is clear that there is no
TIMESTAMP type in thrift type system.

I am curious if we need to add TIMESTAMP type to Thrift as a basic type. TIMESTAMP type is there in almost all software systems.
If we do not add TIMESTAMP type to Thrift, should we map it to long Integer or a struct type?

Shyam Sundar Sarkar
added a comment - 24/Apr/09 13:13 Hello,
I was debugging Hive SQL. I found that Hive talks to Thrift software layer. From the documentation it is clear that there is no
TIMESTAMP type in thrift type system.
I am curious if we need to add TIMESTAMP type to Thrift as a basic type. TIMESTAMP type is there in almost all software systems.
If we do not add TIMESTAMP type to Thrift, should we map it to long Integer or a struct type?
Any suggestions?
Thankls,
shyam_sarkar@yahoo.com

1. About unit tests, you should not have to play with velocity at all to add a new test. You can just add a xyz.q file with the query in ql/test/queries/clientpositive directory and in ql/test/queries/clientnegative directory. And you can capture the test results by

and then run these tests again after removing the -Doverwrite part and after checking the results in the associated .q.out file formed in ql/test/results... directory. Will add these instructions to the Developer Guide.

2. Native timestamp support in thrift would help a lot to be able to support timestamps with thrift base serdes. Note that with zheng's work on lazy serdes we would remove the dependency on thrift for serializing and deserializing data. So at this point we should just say that TIMESTAMPS are not supported with thrift serde. On the use of thrift on as a client server transport, there should not be any issue as any data passing that boundary is also serialized/deserialized using serdes.

Ashish Thusoo
added a comment - 24/Apr/09 14:19 Hi Shyam,
1. About unit tests, you should not have to play with velocity at all to add a new test. You can just add a xyz.q file with the query in ql/test/queries/clientpositive directory and in ql/test/queries/clientnegative directory. And you can capture the test results by
ant -lib testlibs clean-test test -Dtestcase=TestCliDriver -Dqfile=xyz.q -Doverwrite=true
for positive tests and
ant -lib testlibs clean-test test -Dtestcase=TestNegativeCliDriver -Dqfile=xyzneg.q -Doverwrite=true
and then run these tests again after removing the -Doverwrite part and after checking the results in the associated .q.out file formed in ql/test/results... directory. Will add these instructions to the Developer Guide.
2. Native timestamp support in thrift would help a lot to be able to support timestamps with thrift base serdes. Note that with zheng's work on lazy serdes we would remove the dependency on thrift for serializing and deserializing data. So at this point we should just say that TIMESTAMPS are not supported with thrift serde. On the use of thrift on as a client server transport, there should not be any issue as any data passing that boundary is also serialized/deserialized using serdes.

After inspecting many different options for implementation, following implementation seems to have the right direction::

(1) TIMESTAMP values cannot be earlier than 1970 UTC or later than '2038-01-09 03:14:07' UTC. This means that
a date such as '1968-01-01', while legal as a DATETIME or DATE value, is not valid as a TIMESTAMP value
and is converted to 0.

(2) The display width is fixed at 19 characters, and the format is 'YYYY-MM-DD HH:MM:SS'. TIMESTAMP values
are converted from the current time zone to UTC for storage, and converted back from UTC to the current
time zone for retrieval. By default, the current time zone for each connection is the server's time.

(3) The implementation involves conversion of TIMESTAMP string into UTC which is a long integer number
representing number of seconds from 1970 UTC before storing through SERDE2. Similarly, a long integer
type representing number of seconds from 1970 UTC will be converted in TIMESTAMP string type after
retrieval through SERDE2.

(4) In this implementation SERDE2 or thrift software layer will not be touched at all. The metadata store will
maintain the information about TIMESTAMP type only. Actual storage will keep long integer values.

Shyam Sundar Sarkar
added a comment - 11/May/09 21:05 - edited Dear Ashish and others,
After inspecting many different options for implementation, following implementation seems to have the right direction::
(1) TIMESTAMP values cannot be earlier than 1970 UTC or later than '2038-01-09 03:14:07' UTC. This means that
a date such as '1968-01-01', while legal as a DATETIME or DATE value, is not valid as a TIMESTAMP value
and is converted to 0.
(2) The display width is fixed at 19 characters, and the format is 'YYYY-MM-DD HH:MM:SS'. TIMESTAMP values
are converted from the current time zone to UTC for storage, and converted back from UTC to the current
time zone for retrieval. By default, the current time zone for each connection is the server's time.
(3) The implementation involves conversion of TIMESTAMP string into UTC which is a long integer number
representing number of seconds from 1970 UTC before storing through SERDE2. Similarly, a long integer
type representing number of seconds from 1970 UTC will be converted in TIMESTAMP string type after
retrieval through SERDE2.
(4) In this implementation SERDE2 or thrift software layer will not be touched at all. The metadata store will
maintain the information about TIMESTAMP type only. Actual storage will keep long integer values.
Please comment on this observation and suggest any corrections!
Thanks,
Shyam_sarkar@yahoo.com

I converted a TIMESTAMP type into string type as part of createTable method in DDLTask.java class. Can anyone suggest how do I retain TIMESTAMP type in the Hive metadata repository and send string type only to db.createTable() method so that SerDe takes it as string type only ? I want this for back and forth conversions during SELECT, UPDATE etc. It seems that metadata is permanently modified to string type in Hive.

Shyam Sundar Sarkar
added a comment - 06/Aug/10 18:38 I converted a TIMESTAMP type into string type as part of createTable method in DDLTask.java class. Can anyone suggest how do I retain TIMESTAMP type in the Hive metadata repository and send string type only to db.createTable() method so that SerDe takes it as string type only ? I want this for back and forth conversions during SELECT, UPDATE etc. It seems that metadata is permanently modified to string type in Hive.
Any suggestions will help.
-S. Sarkar

Shyam Sundar Sarkar
added a comment - 18/Aug/10 22:22 This is just the initial changes for others to look at and suggest. I need suggestions about string to Timestamp conversion within Dynamic SerDe layer.