HASH function (CORE-4436)

HASH function (CORE-4436)

We have HASH function that returns a 64 bit integer. The algorithm is
bad and the result too small for a hash.

I propose to extend the same function with a second parameter with the
algorithm name.

When only one parameter is passed, things works as now.

When two parameters are passed, it will return a VARCHAR(64) CHARACTER
SET OCTETS. That's sufficient for a SHA-256, for example.

We implement (preferably getting code with compatible license or public
domain - like was done for SHA-1 used in authentication - instead of use
a bloated library) some algorithms (MD5, some useful SHA-*). No need for
"hash plugins" for now.

Re: HASH function (CORE-4436)

> We have HASH function that returns a 64 bit integer. The algorithm is bad and
> the result too small for a hash.
>
> I propose to extend the same function with a second parameter with the
> algorithm name.
>
> When only one parameter is passed, things works as now.
>
> When two parameters are passed, it will return a VARCHAR(64) CHARACTER
> SET OCTETS. That's sufficient for a SHA-256, for example.
>
> We implement (preferably getting code with compatible license or public
> domain - like was done for SHA-1 used in authentication - instead of use a
> bloated library) some algorithms (MD5, some useful SHA-*). No need for
> "hash plugins" for now.

Re: HASH function (CORE-4436)

18.07.2017 21:55, Adriano dos Santos Fernandes wrote:
>
> We have HASH function that returns a 64 bit integer. The algorithm is
> bad and the result too small for a hash.

Yes, and we have CORE-4436 with related complains.

> I propose to extend the same function with a second parameter with the
> algorithm name.

Agreed.

> When two parameters are passed, it will return a VARCHAR(64) CHARACTER
> SET OCTETS. That's sufficient for a SHA-256, for example.

Would it make sense to reserve more bytes, 256 or 1024 octets for
example? Just to avoid extending the result every five years...

> We implement (preferably getting code with compatible license or public
> domain - like was done for SHA-1 used in authentication - instead of use
> a bloated library) some algorithms (MD5, some useful SHA-*). No need for
> "hash plugins" for now.

Re: HASH function (CORE-4436)

On 18/07/2017 16:34, Dmitry Yemanov wrote:
> 18.07.2017 22:00, Leyne, Sean wrote:
>>
>> Would this approach have any performance advantages over using UDFs?
>
> People hate writing UDFs for common tasks. And IMHO getting a robust
> hash belongs to this category.
>
Moreover, we already have HASH and it's bad (can't be removed, but can
be improved).

Re: HASH function (CORE-4436)

On 18/07/2017 16:32, Dmitry Yemanov wrote:
>
>
>> When two parameters are passed, it will return a VARCHAR(64) CHARACTER
>> SET OCTETS. That's sufficient for a SHA-256, for example.
>
> Would it make sense to reserve more bytes, 256 or 1024 octets for
> example? Just to avoid extending the result every five years...

Ok for me, I believe 256 would be good then, as very large column is
difficult to read in ISQL.

Re: HASH function (CORE-4436)

On 18/07/2017 16:48, Dimitry Sibiryakov wrote:
> 18.07.2017 21:32, Dmitry Yemanov wrote:
>> Would it make sense to reserve more bytes, 256 or 1024 octets for
>> example? Just to avoid extending the result every five years...
>
> Just take care not to overflow index limit.
>
>

It will be described as VARCHAR, so real length depends on the algorithm
used by the user.

Re: HASH function (CORE-4436)

> On 18/07/2017 16:32, Dmitry Yemanov wrote:
> >
> >
> >> When two parameters are passed, it will return a VARCHAR(64)
> >> CHARACTER SET OCTETS. That's sufficient for a SHA-256, for example.
> >
> > Would it make sense to reserve more bytes, 256 or 1024 octets for
> > example? Just to avoid extending the result every five years...
>
> Ok for me, I believe 256 would be good then, as very large column is difficult
> to read in ISQL.

If we are going to support common HASH functions then we need to support all currently/commonly available.

In which case a length of at least 512 bits/128 bytes (for SHA-512 and SHA3-512) is required -- difficulty reading is the last thing that should be of concern.

Re: HASH function (CORE-4436)

> Hi!
>
> We have HASH function that returns a 64 bit integer. The algorithm is
> bad and the result too small for a hash.
>
> I propose to extend the same function with a second parameter with the
> algorithm name.
> > When only one parameter is passed, things works as now.
>
> When two parameters are passed, it will return a VARCHAR(64) CHARACTER
> SET OCTETS. That's sufficient for a SHA-256, for example.

Algorithm name could define result length. If alg name is passed as
parameter, we could use maximum known length (of all known algs) at
prepare time.

Re: HASH function (CORE-4436)

Em 18/07/2017 16:58, Dimitry Sibiryakov escreveu:
> 18.07.2017 21:52, Adriano dos Santos Fernandes wrote:
>> It will be described as VARCHAR, so real length depends on the algorithm
>> used by the user.
>
> Index is built on full data length, not real one.
>

People will not be crazy to create computed index on hashes.

They will store hash on field of appropriate length for the given hash
and index it.

Re: HASH function (CORE-4436)

We need to decide whether the algorithm name can be passed dynamically
(and thus be presented as "value" in the grammar) or must be predefined
(via a string literal or maybe token). The latter gives us more
flexibility regarding the result type. Tricks with runtime type coercion
are possible but require careful programming at the user side (proper
CASTs must present), I'd rather avoid that.

Re: HASH function (CORE-4436)

On 19/07/2017 05:27, Dimitry Sibiryakov wrote:
> 19.07.2017 2:47, Adriano dos Santos Fernandes wrote:
>> They will store hash on field of appropriate length for the given hash
>> and index it.
>
> On contrary, there is no point to store hash value if its purpose is
> to speed up lookup only.
>

I hope you don't need to recreate your index and recompute whole table
hashes with a slow algorithm.

In most cases original string (password) is not even stored and hashes
are used to check if a password matches a previously one.

Re: HASH function (CORE-4436)

> 19.07.2017 01:32, Adriano dos Santos Fernandes wrote:
>>
>>>
>>> Algorithm name could define result length.
>>
>> Yes, for constants.
>
> We need to decide whether the algorithm name can be passed dynamically
> (and thus be presented as "value" in the grammar) or must be
> predefined (via a string literal or maybe token). The latter gives us
> more flexibility regarding the result type.

This is an interesting idea.

We can use special syntax (like some others system functions):

HASH( <value> [WITH <algorithm>] )

<algorithm> will be an identifier (not keyword) interpreted at compile time.

Re: HASH function (CORE-4436)

On 19.07.2017 14:07, Adriano dos Santos Fernandes wrote:

> On 19/07/2017 02:29, Dmitry Yemanov wrote:
>> 19.07.2017 01:32, Adriano dos Santos Fernandes wrote:
>>>> Algorithm name could define result length.
>>> Yes, for constants.
>> We need to decide whether the algorithm name can be passed dynamically
>> (and thus be presented as "value" in the grammar) or must be
>> predefined (via a string literal or maybe token). The latter gives us
>> more flexibility regarding the result type.
> This is an interesting idea.
>
> We can use special syntax (like some others system functions):
>
> HASH( <value> [WITH <algorithm>] )
>
> <algorithm> will be an identifier (not keyword) interpreted at compile time.

Re: HASH function (CORE-4436)

> > We need to decide whether the algorithm name can be passed dynamically
> > (and thus be presented as "value" in the grammar) or must be
> > predefined (via a string literal or maybe token). The latter gives us
> > more flexibility regarding the result type.
>
> This is an interesting idea.
>
> We can use special syntax (like some others system functions):
>
> HASH( <value> [WITH <algorithm>] )

Why do we need to extend the current function?

Why not create separate, built-in, functions for each hash type with names* that align with the common algorithm name?

Re: HASH function (CORE-4436)

On 19/07/2017 11:44, Leyne, Sean wrote:

>
>>> We need to decide whether the algorithm name can be passed dynamically
>>> (and thus be presented as "value" in the grammar) or must be
>>> predefined (via a string literal or maybe token). The latter gives us
>>> more flexibility regarding the result type.
>> This is an interesting idea.
>>
>> We can use special syntax (like some others system functions):
>>
>> HASH( <value> [WITH <algorithm>] )
> Why do we need to extend the current function?
>
> Why not create separate, built-in, functions for each hash type with names* that align with the common algorithm name?
>

Because "HASH" is an excellent namespace/container for hash functions.