Edward Capriolo
added a comment - 30/Jul/13 05:05 We currently do not have qualifiers (20,2) for types in the hive language. This sounds like a fairly involved change I am very curious how they will interact with the already existing system.

These two are very good questions, for which I don't have definite answer yet. For the first one (precision/scale notation), I guess we will have to introduce. For the second one, I'd think there may be a way to keep data backward compatible (new hive reading old data). I agree this is quite complex, nevertheless.

Xuefu Zhang
added a comment - 30/Jul/13 05:33 These two are very good questions, for which I don't have definite answer yet. For the first one (precision/scale notation), I guess we will have to introduce. For the second one, I'd think there may be a way to keep data backward compatible (new hive reading old data). I agree this is quite complex, nevertheless.

Hi Ed/Xuefu, yeah I have similar issues for setting the length parameter for char/varchar types for https://issues.apache.org/jira/browse/HIVE-4844. I've got some prototype code for this, not entirely sure if I have the best approach, was going to try to work through this a bit more but if you'd like I can post a patch for you guys to take a look/comment on.
Basically I've added parameterized versions of PrimitiveTypeEntry, PrimitiveTypeInfo, and ObjectInspector, with additional factory methods for these types so that the caller can fetch TypeEntry/TypeInfo/ObjectInspector based on PrimitiveCategory + type parameters. Will definitely need to work out how this interacts with the existing system as currently all of those types have liberal use of pointer-based equality, and there seem to be some instances where it may not be possible to have have access to type params when trying to get the TypeEntry/TypeInfo/ObjectInspector.

Jason Dere
added a comment - 30/Jul/13 09:07 Hi Ed/Xuefu, yeah I have similar issues for setting the length parameter for char/varchar types for https://issues.apache.org/jira/browse/HIVE-4844 . I've got some prototype code for this, not entirely sure if I have the best approach, was going to try to work through this a bit more but if you'd like I can post a patch for you guys to take a look/comment on.
Basically I've added parameterized versions of PrimitiveTypeEntry, PrimitiveTypeInfo, and ObjectInspector, with additional factory methods for these types so that the caller can fetch TypeEntry/TypeInfo/ObjectInspector based on PrimitiveCategory + type parameters. Will definitely need to work out how this interacts with the existing system as currently all of those types have liberal use of pointer-based equality, and there seem to be some instances where it may not be possible to have have access to type params when trying to get the TypeEntry/TypeInfo/ObjectInspector.

Xuefu Zhang
added a comment - 30/Jul/13 17:26 Thanks, Jason.
Please do attach a patch so that we can see how you plan to do it, even if it's incomplete. I think our issues belong to the same category, so a generic approach works best.

Jason Dere
added a comment - 29/Aug/13 19:10 Here is the patch containing the instances where I've removed precision/scale from the patch to hIVE-4844, if you are interested in re-applying these changes on your side

This is cool. Looking forward to having a complete decimal data type. I think it'd be neat to have some sort of spec for this though. Some thoughts that I had when doing the initial decimal (hope that's helpful):

The current decimal type is of "mixed scale" and by default has the full maximum precision (38). Are you going to change this? A smaller default has benefits, so does a fixed default scale. For performance reasons it would also be good to keep the default precision below 19. This way it'll fit in a single long (and we can in future switch to a faster implementation). Fixed scale means that you can more efficiently compute the basic operations (and it's more sane than mixed scale). Eric Hanson was the one who pointed that out to me.

HIVE-5022 is related to this and it'd be good to keep any breaking changes in a single release.

Currently when we read data and the read value has a precision greater than the max we turn that into null. Have you thought about what to do when you read a decimal that doesn't fit/doesn't conform with the column specification? Round, truncate, error, or null?

Propagating precision and scale changes throughout arithmetic expressions seem difficult. Jason's patch might lay some ground work, but numeric operations are different from the char/varchar. I think you need to keep track of them though so you can return them to the user (e.g.: through jdbc/odbc) and probably also for CTAS.

Same thing for UDFs. That might have some wrinkles too. I.e.: How do you know what precision/scale a UDF is going to return?

I'm not sure whether you need to worry about "insert into" - maybe you can just write whatever and handle the data when you read it again.

If you switch everything to fixed scale, I think BinarySortableSerde can be greatly simplified for decimals.

Gunther Hagleitner
added a comment - 30/Aug/13 06:30 This is cool. Looking forward to having a complete decimal data type. I think it'd be neat to have some sort of spec for this though. Some thoughts that I had when doing the initial decimal (hope that's helpful):
The current decimal type is of "mixed scale" and by default has the full maximum precision (38). Are you going to change this? A smaller default has benefits, so does a fixed default scale. For performance reasons it would also be good to keep the default precision below 19. This way it'll fit in a single long (and we can in future switch to a faster implementation). Fixed scale means that you can more efficiently compute the basic operations (and it's more sane than mixed scale). Eric Hanson was the one who pointed that out to me.
HIVE-5022 is related to this and it'd be good to keep any breaking changes in a single release.
Have you thought about the rules for precision + scale in arithmetic operations? Here are some sensible and compliant rules from microsoft: http://technet.microsoft.com/en-us/library/ms190476.aspx
Currently when we read data and the read value has a precision greater than the max we turn that into null. Have you thought about what to do when you read a decimal that doesn't fit/doesn't conform with the column specification? Round, truncate, error, or null?
Propagating precision and scale changes throughout arithmetic expressions seem difficult. Jason's patch might lay some ground work, but numeric operations are different from the char/varchar. I think you need to keep track of them though so you can return them to the user (e.g.: through jdbc/odbc) and probably also for CTAS.
Same thing for UDFs. That might have some wrinkles too. I.e.: How do you know what precision/scale a UDF is going to return?
I'm not sure whether you need to worry about "insert into" - maybe you can just write whatever and handle the data when you read it again.
If you switch everything to fixed scale, I think BinarySortableSerde can be greatly simplified for decimals.

One note about propagating precision/scale throughout the expressions, especially if we want to have them available through jdbc/odbc. All of the add/sub/mult/div operations are implemented as old-style UDFs, which is a bit problematic. The old-style UDFs use reflection to determine the return type TypeInfos/ObjectInspectors, based on the return type of the evaluate() method chosen for the expression. The way this is being done, we cannot customize the precision/scale of the TypeInfo representing the result - the resulting TypeInfo would just be the default decimal type with no parameters, and whatever default precision/scale information that comes with that type. So if you want the type metadata to have the correctly set precision/scale, all of the arithmetic operators would need to be redone as GenericUDFs, which allow you to customize the return type ObjectInspector during the initialize() method. I had to do the same thing with a few string UDFs to get the varchar length reported back correctly in the TypeInfos.

Jason Dere
added a comment - 30/Aug/13 10:22 One note about propagating precision/scale throughout the expressions, especially if we want to have them available through jdbc/odbc. All of the add/sub/mult/div operations are implemented as old-style UDFs, which is a bit problematic. The old-style UDFs use reflection to determine the return type TypeInfos/ObjectInspectors, based on the return type of the evaluate() method chosen for the expression. The way this is being done, we cannot customize the precision/scale of the TypeInfo representing the result - the resulting TypeInfo would just be the default decimal type with no parameters, and whatever default precision/scale information that comes with that type. So if you want the type metadata to have the correctly set precision/scale, all of the arithmetic operators would need to be redone as GenericUDFs, which allow you to customize the return type ObjectInspector during the initialize() method. I had to do the same thing with a few string UDFs to get the varchar length reported back correctly in the TypeInfos.

Jason Dere Thanks for posting your code regarding precision/scale, and your comments about related UDFs. There seems a lot of work, but we hope the outcome is worth the effort. It's good that you have gained insight with your char/varchar work. It will be valuable.

Gunther Hagleitner Thanks for sharing your thoughts. I agree that this is complex enough to have a spec, with which, the issue may close quicker and easier due to a large community. The questions you posted are valid and yet to be answered. Existing decimal feature seems incomplete and non-standard in many ways. With this task, we can hope to put it in a good shape. In principal, I think we should follow standard if available, and follow some implementation or have hive's own implementation when standard is not defined. Doing that seems making it unavoidable to break backward compatibility. But how much we can break. For instance, can we say that a decimal without precision and scale specified defaults to (10, 0) (as mysql does) rather than the current (38, ?). It's great if we can redefine everything and do it right, once for all.

Xuefu Zhang
added a comment - 03/Sep/13 20:12 Jason Dere Thanks for posting your code regarding precision/scale, and your comments about related UDFs. There seems a lot of work, but we hope the outcome is worth the effort. It's good that you have gained insight with your char/varchar work. It will be valuable.
Gunther Hagleitner Thanks for sharing your thoughts. I agree that this is complex enough to have a spec, with which, the issue may close quicker and easier due to a large community. The questions you posted are valid and yet to be answered. Existing decimal feature seems incomplete and non-standard in many ways. With this task, we can hope to put it in a good shape. In principal, I think we should follow standard if available, and follow some implementation or have hive's own implementation when standard is not defined. Doing that seems making it unavoidable to break backward compatibility. But how much we can break. For instance, can we say that a decimal without precision and scale specified defaults to (10, 0) (as mysql does) rather than the current (38, ?). It's great if we can redefine everything and do it right, once for all.

In regards to the default precision/scale behavior of DECIMAL values which have no explicit precision/scale set, I wonder if we may be better off deviating from the SQL standard which defines that it should be 0 scale. This has a couple of pluses:
1) If users have existing Decimal data in Hive, applying the SQL standard behavior means that their existing data is now truncated to integer values. Having a default precision/scale here could allow most users to continue using decimal as they are. Though users do have the workaround of using ALTER TABLE to set the precision/scale of their existing decimal columns so that the values will be read properly.
2) Having a default precision/scale might allow non-generic UDFs to continue to work with decimal types, to some degree. If a non-generic UDF had decimal arguments and return values with a useful default precision/scale.

Would definitely be good to see what people think on the default precision/scale since changing to SQL compliance would mean a big change in behavior. On the other hand if no one is using decimal then it might not matter.

Jason Dere
added a comment - 11/Sep/13 00:54 Thanks for posting the doc.
In regards to the default precision/scale behavior of DECIMAL values which have no explicit precision/scale set, I wonder if we may be better off deviating from the SQL standard which defines that it should be 0 scale. This has a couple of pluses:
1) If users have existing Decimal data in Hive, applying the SQL standard behavior means that their existing data is now truncated to integer values. Having a default precision/scale here could allow most users to continue using decimal as they are. Though users do have the workaround of using ALTER TABLE to set the precision/scale of their existing decimal columns so that the values will be read properly.
2) Having a default precision/scale might allow non-generic UDFs to continue to work with decimal types, to some degree. If a non-generic UDF had decimal arguments and return values with a useful default precision/scale.
Would definitely be good to see what people think on the default precision/scale since changing to SQL compliance would mean a big change in behavior. On the other hand if no one is using decimal then it might not matter.

Thanks for sharing the thoughts. I think both 1) and 2) are regarding to backward compatibility. If decimal is reserved for fixed precision/scale, then there is no way to be fully backward compatible. That's because decimal, as it's in hive, is fixed precision and variable scale. No matter what the new default is for scale, there will be compatibility issue. To make things worse, user doesn't even know the scale of the existing data.

As mentioned, if the adoption is low, it might make better sense for us to take the pain and do it right. This is the stand point that the document takes. Validity of the assumption is subject to discussion, of course.

As to non-generic UDFs, I think we need to educate user to avoid them unless the whatever the default is exactly what he/she needs, which is unlikely in my opinion: if they needs decimal(5, 2), decimal(6,3) will not make them happier.

Xuefu Zhang
added a comment - 11/Sep/13 20:13 Thanks for sharing the thoughts. I think both 1) and 2) are regarding to backward compatibility. If decimal is reserved for fixed precision/scale, then there is no way to be fully backward compatible. That's because decimal, as it's in hive, is fixed precision and variable scale. No matter what the new default is for scale, there will be compatibility issue. To make things worse, user doesn't even know the scale of the existing data.
As mentioned, if the adoption is low, it might make better sense for us to take the pain and do it right. This is the stand point that the document takes. Validity of the assumption is subject to discussion, of course.
As to non-generic UDFs, I think we need to educate user to avoid them unless the whatever the default is exactly what he/she needs, which is unlikely in my opinion: if they needs decimal(5, 2), decimal(6,3) will not make them happier.

Hi Xuefu,
Thanks for posting the document. Overall, looks good! It's great that you are working on this.

One minor question, in the document you mention "Specifically, for +/, the result scale is s1 + s2, and for multiplication, max(s1, s2)". Did you mean the other way around (max for +/, s1+s2 for multiplication)?

In the error handling case, you mention rounding instead of using null. That definitely seems like a plausible idea. However, I think it may be a slippery slope. For example, we may be using a rounding policy here by default. This policy may not work for all users and then we may have to expose configuration/table specific options to specify such rounding policies. There might be value in masking the values that don't conform as nulls and putting the onus on the users to explicitly round them to conform to the column type. However, if mysql or another popular database has set a precedent, what you are suggesting may be ok.

Big +1 on changing arithmetic operation UDFs to GenericUDFs like you and Jason already suggested.

Mark Grover
added a comment - 12/Sep/13 03:59 Hi Xuefu,
Thanks for posting the document. Overall, looks good! It's great that you are working on this.
One minor question, in the document you mention "Specifically, for +/ , the result scale is s1 + s2, and for multiplication, max(s1, s2)". Did you mean the other way around (max for +/ , s1+s2 for multiplication)?
In the error handling case, you mention rounding instead of using null. That definitely seems like a plausible idea. However, I think it may be a slippery slope. For example, we may be using a rounding policy here by default. This policy may not work for all users and then we may have to expose configuration/table specific options to specify such rounding policies. There might be value in masking the values that don't conform as nulls and putting the onus on the users to explicitly round them to conform to the column type. However, if mysql or another popular database has set a precedent, what you are suggesting may be ok.
Big +1 on changing arithmetic operation UDFs to GenericUDFs like you and Jason already suggested.

One minor question, in the document you mention "Specifically, for +/, the result scale is s1 + s2, and for multiplication, max(s1, s2)". Did you mean the other way around (max for +/, s1+s2 for multiplication)?

You're right. I will correct this in the doc.

Regarding error handling, some DB has different modes that controls the behavior. Since Hive doesn't have such a setting yet, I think we can implement a default behavior but keep the door open to more options in the future.

As to rounding behavior, I think providing a config on the rounding method could be an overkill. In fact, mysql is very clear on the rounding method. For decimal, it uses "round half up". More information can be found at http://dev.mysql.com/doc/refman/5.1/en/precision-math-rounding.html. The document borrows the mysql's way of decimal rounding, i.e rounding is always allowed for fractional part regardless of server mode. I think is more reasonable and practical. I will update the doc to make more explicit.

I have to say that the error handling regarding numeric types are very complex. It appears to me that in most cases how a DB does it doesn't really matter. What matters is to document clearly. Hive needs to do this as well.

Xuefu Zhang
added a comment - 12/Sep/13 06:28 Mark Grover Thanks for your feedback.
One minor question, in the document you mention "Specifically, for +/, the result scale is s1 + s2, and for multiplication, max(s1, s2)". Did you mean the other way around (max for +/, s1+s2 for multiplication)?
You're right. I will correct this in the doc.
Regarding error handling, some DB has different modes that controls the behavior. Since Hive doesn't have such a setting yet, I think we can implement a default behavior but keep the door open to more options in the future.
As to rounding behavior, I think providing a config on the rounding method could be an overkill. In fact, mysql is very clear on the rounding method. For decimal, it uses "round half up". More information can be found at http://dev.mysql.com/doc/refman/5.1/en/precision-math-rounding.html . The document borrows the mysql's way of decimal rounding, i.e rounding is always allowed for fractional part regardless of server mode. I think is more reasonable and practical. I will update the doc to make more explicit.
I have to say that the error handling regarding numeric types are very complex. It appears to me that in most cases how a DB does it doesn't really matter. What matters is to document clearly. Hive needs to do this as well.

A comment about the change in decimal precision. The original precision of 38 was chosen because it fit within 16 bytes, and involved some discussion with the folks doing the vectorization work. They might also have some comments about these changes, as it may also affect the vectorization work.

Jason Dere
added a comment - 26/Sep/13 00:58 A comment about the change in decimal precision. The original precision of 38 was chosen because it fit within 16 bytes, and involved some discussion with the folks doing the vectorization work. They might also have some comments about these changes, as it may also affect the vectorization work.

Took a look at the patch, pretty good so far. Some comments/questions:

1. Patch is missing HiveDecimalUtils.java

2. LazyHiveDecimalObjectInspector.getPrimitiveJavaObject(): We should only be returning null when the integral portion of the HiveDecimal exceeds the integral portion of the typeParams, so I think here it should be checking (result.precision - result.scale) <= (params.precision() - params.scale()). Otherwise we would return null here with type params of (5, 2), if the result was 0.123456 (decimal(6,6)), because the precision of the result is larger than 5.

3. HiveDecimal.enforcePrecisionScale(): I think here we're allowing integer values larger than should be allowed, because it's shrinking the scale based on the size of the value. I was under the impression that for a decimal(5, 3) we can't assign any values larger than 99.999.

4. Looks like there is potentially a difference between HiveDecimal.bd.scale() and the defined scale of the decimal type params. Just wondering if there are any issues that might arise from that.

Jason Dere
added a comment - 26/Sep/13 01:17 Took a look at the patch, pretty good so far. Some comments/questions:
1. Patch is missing HiveDecimalUtils.java
2. LazyHiveDecimalObjectInspector.getPrimitiveJavaObject(): We should only be returning null when the integral portion of the HiveDecimal exceeds the integral portion of the typeParams, so I think here it should be checking (result.precision - result.scale) <= (params.precision() - params.scale()). Otherwise we would return null here with type params of (5, 2), if the result was 0.123456 (decimal(6,6)), because the precision of the result is larger than 5.
3. HiveDecimal.enforcePrecisionScale(): I think here we're allowing integer values larger than should be allowed, because it's shrinking the scale based on the size of the value. I was under the impression that for a decimal(5, 3) we can't assign any values larger than 99.999.
4. Looks like there is potentially a difference between HiveDecimal.bd.scale() and the defined scale of the decimal type params. Just wondering if there are any issues that might arise from that.

Re: #1. I can see HiveDecimalUtils.java in patch #1.
#2. Thanks for noticing that. I was consolidating this logic but still forgot this class. It should be synched with other parts of the code.
#3. A type decimal(5,3) signifies that at most two digits are allowed for integer part because 3 is allocated for decimal part. With this, the biggest value it can take is 99.999. Thus, a value greater than 99.999 shouldn't be allowed for this type. Here is what mysql does in default mode:

I believe that under strict mode, the value will be rejected. Since Hive currently only has a strict mode, we choose null in this case. With other modes in the future, we may have different behaviour. Regardless, I don't think decimal(5,3) should hold more than 2 integer digits.

#4. I will check what's the difference between BidDecimal.scale() and decimal.scale(). In the meantime, could you be specific about what you know of the difference.

Xuefu Zhang
added a comment - 26/Sep/13 04:03 Jason Dere Thank you very much for your early feedback.
Re: #1. I can see HiveDecimalUtils.java in patch #1.
#2. Thanks for noticing that. I was consolidating this logic but still forgot this class. It should be synched with other parts of the code.
#3. A type decimal(5,3) signifies that at most two digits are allowed for integer part because 3 is allocated for decimal part. With this, the biggest value it can take is 99.999. Thus, a value greater than 99.999 shouldn't be allowed for this type. Here is what mysql does in default mode:
mysql> CREATE TABLE t3 (d DECIMAL(5,3));
Query OK, 0 rows affected (0.12 sec)
mysql> insert into t3 values (123.45);
Query OK, 1 row affected, 1 warning (0.04 sec)
mysql> select * from t3;
+--------+
| d |
+--------+
| 99.999 |
+--------+
1 row in set (0.00 sec)
I believe that under strict mode, the value will be rejected. Since Hive currently only has a strict mode, we choose null in this case. With other modes in the future, we may have different behaviour. Regardless, I don't think decimal(5,3) should hold more than 2 integer digits.
#4. I will check what's the difference between BidDecimal.scale() and decimal.scale(). In the meantime, could you be specific about what you know of the difference.

The semantics so far look in line with what was discussed, though most of the semantics regarding the resulting precision/scale during arithmetic will not be in place until HIVE-5356.
A little surprised that default "decimal" won't work though - could you manually register decimal(10,0) using "decimal", in TypeInfoFactory?

Jason Dere
added a comment - 18/Oct/13 19:10 The semantics so far look in line with what was discussed, though most of the semantics regarding the resulting precision/scale during arithmetic will not be in place until HIVE-5356 .
A little surprised that default "decimal" won't work though - could you manually register decimal(10,0) using "decimal", in TypeInfoFactory?

Jason Dere Default "decimal" as type name works out of box, which means you can do "create table dec (d decimal)". At semantic analysis, the type name is converted to "decimal(10,0)" internally. After that point, it has no difference, and works as if the user started with "decimal(10,0)" at the beginning. Let me know if this doesn't answer your question.

Xuefu Zhang
added a comment - 18/Oct/13 19:30 Jason Dere Default "decimal" as type name works out of box, which means you can do "create table dec (d decimal)". At semantic analysis, the type name is converted to "decimal(10,0)" internally. After that point, it has no difference, and works as if the user started with "decimal(10,0)" at the beginning. Let me know if this doesn't answer your question.

Actually my comment was supposed to be about HIVE-5564 - would doing what I described allow legacy decimals to work without without the error? Since "decimal" should only ever show up as a typeinfo string for legacy decimal columns, you could even register decimalTypeInfo (currently decimal(65,30) in your patch) as the typeinfo for "decimal". This would give backward-compatible behavior for legacy decimal columns, and sql-compliant behavior for new decimal columns due to the behavior you describe in your previous comment.

Jason Dere
added a comment - 18/Oct/13 19:44 Actually my comment was supposed to be about HIVE-5564 - would doing what I described allow legacy decimals to work without without the error? Since "decimal" should only ever show up as a typeinfo string for legacy decimal columns, you could even register decimalTypeInfo (currently decimal(65,30) in your patch) as the typeinfo for "decimal". This would give backward-compatible behavior for legacy decimal columns, and sql-compliant behavior for new decimal columns due to the behavior you describe in your previous comment.

If the comment is about HIVE-5564, I think it might be better to move the discussion there. Nevertheless, as short response to your comments:

1. The question of backward compatibility was brought up in day 1, and detailed in the functional document. I think there is an agreement that no known customer is seriously using decimal, and break of the backward compatibility isn't a concern for this project. Unless you want to bring up that discussion again, let's move forward without further concerning.

2. Internally, typeinfo object for "decimal(65,30)" is not user facing, (i.e, when specifying column type), but for internal default when serve doesn't know the type (for instance, the decimal type returned by non-generic udf). To further clarify, externally user specifying "decimal" for column name, server interprets it as "decimal(10, 0)"; internally, if server cannot determine the exact type, then decimal(65,30) is assumed.

3. In my opinion, fixing the error in HIVE-5564 is better than assuming (65,30) as the user default due to its usability problem. BTW, the fix could have been included in this patch, so the "error" would not have come in the picture at all.

I think I should put the clarification in #2 in the document as well. Let me know if you have more questions.

Xuefu Zhang
added a comment - 18/Oct/13 20:27 Thanks for clarifying your question.
If the comment is about HIVE-5564 , I think it might be better to move the discussion there. Nevertheless, as short response to your comments:
1. The question of backward compatibility was brought up in day 1, and detailed in the functional document. I think there is an agreement that no known customer is seriously using decimal, and break of the backward compatibility isn't a concern for this project. Unless you want to bring up that discussion again, let's move forward without further concerning.
2. Internally, typeinfo object for "decimal(65,30)" is not user facing, (i.e, when specifying column type), but for internal default when serve doesn't know the type (for instance, the decimal type returned by non-generic udf). To further clarify, externally user specifying "decimal" for column name, server interprets it as "decimal(10, 0)"; internally, if server cannot determine the exact type, then decimal(65,30) is assumed.
3. In my opinion, fixing the error in HIVE-5564 is better than assuming (65,30) as the user default due to its usability problem. BTW, the fix could have been included in this patch, so the "error" would not have come in the picture at all.
I think I should put the clarification in #2 in the document as well. Let me know if you have more questions.

Xuefu Zhang
added a comment - 22/Oct/13 18:07 Hi Ashutosh Chauhan , I'm wondering if you could provide your review feedback. The patch is fairly big, and so the earlier we get it in, the easier the merge would be. Thanks.