Looking at the code it seems you're right, it just dumps the output of the command to a file and then runs the cmd line diff against it and the predefined correct test output file. Since the unix timestamp in the query is GMT the UDFFromUnixTime class will format it differently for all the timezones. Any suggestions on how to solve this? Having a unit test tied to a specific timezone doesn't sound like a good idea.

Johan Oskarsson
added a comment - 06/Jan/09 16:00 Looking at the code it seems you're right, it just dumps the output of the command to a file and then runs the cmd line diff against it and the predefined correct test output file. Since the unix timestamp in the query is GMT the UDFFromUnixTime class will format it differently for all the timezones. Any suggestions on how to solve this? Having a unit test tied to a specific timezone doesn't sound like a good idea.

There are two ways to solve this, and I think we should do both in the long run:

1. Add a variable to hive-site.xml to indicate the timezone. This variable if present would override the timezone set on the machine. Then we can set that variable for the hive-site.xml for the tests.\
2. Make all the datetime functions timezone aware like they are for other RDBMs (mysql, oracle etc.) and then change the test query to take the timezone.

Both of these would solve the problem and would be desirable feature extensions as well.

Ashish Thusoo
added a comment - 07/Jan/09 13:09 There are two ways to solve this, and I think we should do both in the long run:
1. Add a variable to hive-site.xml to indicate the timezone. This variable if present would override the timezone set on the machine. Then we can set that variable for the hive-site.xml for the tests.\
2. Make all the datetime functions timezone aware like they are for other RDBMs (mysql, oracle etc.) and then change the test query to take the timezone.
Both of these would solve the problem and would be desirable feature extensions as well.

Johan Oskarsson
added a comment - 09/Jan/09 11:27 For whatever reason the three other tests in this ticket description have started failing for me again in a similar fashion to the log already attached to this ticket. Am I the only one to see this?

I'm starting to suspect that these are related to File.list ordering just like HIVE-90. I had a closer look at input3_limit.q. The important bits:
<snip>
LOAD DATA LOCAL INPATH '../data/files/kv1.txt' INTO TABLE T1;
LOAD DATA LOCAL INPATH '../data/files/kv2.txt' INTO TABLE T1;
<snip>
INSERT OVERWRITE TABLE T2 SELECT a.key, a.value from T1 a LIMIT 20;
<snip>

It seems that the insert select statement with a 20 limit is meant to select from the kv1.txt file, but on my system it picks the kv2.txt file first. Thus the final output will be incorrect. Could someone confirm that this differs from a run where it passes?

Johan Oskarsson
added a comment - 12/Jan/09 17:10 I'm starting to suspect that these are related to File.list ordering just like HIVE-90 . I had a closer look at input3_limit.q. The important bits:
<snip>
LOAD DATA LOCAL INPATH '../data/files/kv1.txt' INTO TABLE T1;
LOAD DATA LOCAL INPATH '../data/files/kv2.txt' INTO TABLE T1;
<snip>
INSERT OVERWRITE TABLE T2 SELECT a.key, a.value from T1 a LIMIT 20;
<snip>
It seems that the insert select statement with a 20 limit is meant to select from the kv1.txt file, but on my system it picks the kv2.txt file first. Thus the final output will be incorrect. Could someone confirm that this differs from a run where it passes?

Zheng Shao
added a comment - 12/Jan/09 22:55 I believe LIMIT statement does take the data from the first file first. So the file ordering does matter.
The other 3 tests should be fixed by Ashish's patch correct? If it does, please let me know and I will commit Ashish's patch and close this jira (and please open a new one for the limit problem).

Ashish Thusoo
added a comment - 13/Jan/09 02:00 The problematic code is in
FetchTask.java:getNextPath()
We are using fs.listStatus there to get the list of files and I do not think that gaurantees order.

Zheng: Only one of the tests are fixed by the patch, the udf5 test. The others are working sometimes but mostly failing, I suspect they might all be related to File list sort order, but I have only confirmed it with one of them so far as mentioned above.
There are a few options as far as I can see, commit the udf5 and open new ticket for the other three tests or try to fix all the tests in this ticket. I don't mind either way.

Johan Oskarsson
added a comment - 13/Jan/09 10:14 Zheng: Only one of the tests are fixed by the patch, the udf5 test. The others are working sometimes but mostly failing, I suspect they might all be related to File list sort order, but I have only confirmed it with one of them so far as mentioned above.
There are a few options as far as I can see, commit the udf5 and open new ticket for the other three tests or try to fix all the tests in this ticket. I don't mind either way.

Ashish Thusoo
added a comment - 14/Jan/09 18:29 On thinking more about this, I think we should change the test for input3_limit.q to make it deterministic. Instead of
INSERT OVERWRITE TABLE T2 SELECT a.key, a.value from T1 a LIMIT 20;
We should have
INSERT OVERWRITE TABLE T2 SELECT * FROM (SELECT * FROM T1 DISTRIBUTE BY key SORT BY key, value) T LIMIT 20;
SELECT * FROM T2
That would ensure that the rows that we get are top 20 rows.
The order in which these files are processed are controlled by hadoop, so I don't think we have much control on whether kv1.txt gets processed first or kv2.txt.
Thoughts?

Johan Oskarsson
added a comment - 14/Jan/09 18:32 Sounds like a good solution to me.
Do you have any idea what might be happening in testCliDriver_sample5 and testCliDriver_sample3? Similar issue with file sorting?

Ashish Thusoo
added a comment - 20/Jan/09 19:10 sample3 and sample5 are suffering from the same sorting issue. I think the related code there is at
getBucketPath in
ql/metadata/Partition.java
I think we should be sorting the files there as well...
This is picked up in
TableSample.java to create the paths needed..

Prasad Chakka
added a comment - 21/Jan/09 01:17 ql/src/java/org/apache/hadoop/hive/ql/metadata/Partition.java:216 i think FileStatus default comparator is same the one you defined here. so custom comparator is not necessary.

Johan Oskarsson
added a comment - 21/Jan/09 13:11 Tried out the latest patch but unfortunately both of the sample methods still fail. The rest works fine though. I don't have time to look deeper at why right now. Attaching log file for the test run.

Ashish Thusoo
added a comment - 23/Jan/09 11:41 From the diffs I do not actually see any difference in the lines that are reported as diffining in the two test cases. Is there extra space at the end of one of these results? Puzzled...

Ashish Thusoo
added a comment - 04/Feb/09 00:01 Ok. This one should work. The problem was the ordering the sample queries and I have fixed those. Please verify and let me know. I will checkin once this passes for you.