Feature #12142

Hash tables with open addressing

Hello, the following patch contains a new implementation of hash
tables (major files st.c and include/ruby/st.h).
Modern processors have several levels of cache. Usually,the CPU
reads one or a few lines of the cache from memory (or another level of
cache). So CPU is much faster at reading data stored close to each
other. The current implementation of Ruby hash tables does not fit
well to modern processor cache organization, which requires better
data locality for faster program speed.
The new hash table implementation achieves a better data locality
mainly by
o switching to open addressing hash tables for access by keys.
Removing hash collision lists lets us avoid *pointer chasing*, a
common problem that produces bad data locality. I see a tendency
to move from chaining hash tables to open addressing hash tables
due to their better fit to modern CPU memory organizations.
CPython recently made such switch
(https://hg.python.org/cpython/file/ff1938d12240/Objects/dictobject.c).
PHP did this a bit earlier
https://nikic.github.io/2014/12/22/PHPs-new-hashtable-implementation.html.
GCC has widely-used such hash tables
(https://gcc.gnu.org/svn/gcc/trunk/libiberty/hashtab.c) internally
for more than 15 years.
o removing doubly linked lists and putting the elements into an array
for accessing to elements by their inclusion order. That also
removes pointer chaising on the doubly linked lists used for
traversing elements by their inclusion order.
A more detailed description of the proposed implementation can be
found in the top comment of the file st.c.
The new implementation was benchmarked on 21 MRI hash table benchmarks
for two most widely used targets x86-64 (Intel 4.2GHz i7-4790K) and ARM
(Exynos 5410 - 1.6GHz Cortex-A15):
make benchmark-each ITEM=bm_hash OPTS='-r 3 -v' COMPARE_RUBY='<trunk ruby>'
Here the results for x86-64:
hash_aref_dsym 1.094
hash_aref_dsym_long 1.383
hash_aref_fix 1.048
hash_aref_flo 1.860
hash_aref_miss 1.107
hash_aref_str 1.107
hash_aref_sym 1.191
hash_aref_sym_long 1.113
hash_flatten 1.258
hash_ident_flo 1.627
hash_ident_num 1.045
hash_ident_obj 1.143
hash_ident_str 1.127
hash_ident_sym 1.152
hash_keys 2.714
hash_shift 2.209
hash_shift_u16 1.442
hash_shift_u24 1.413
hash_shift_u32 1.396
hash_to_proc 2.831
hash_values 2.701
The average performance improvement is more 50%. ARM results are
analogous -- no any benchmark performance degradation and about the
same average improvement.
The patch can be seen as
https://github.com/vnmakarov/ruby/compare/trunk...hash_tables_with_open_addressing.patch
or in a less convenient way as pull request changes
https://github.com/ruby/ruby/pull/1264/files
This is my first patch for MRI and may be my proposal and
implementation have pitfalls. But I am keen to learn and work on
inclusion of this code into MRI.

Associated revisions

You can see all of code history here:
<https://github.com/vnmakarov/ruby/tree/hash_tables_with_open_addressing>
This improvement is discussed at
<https://bugs.ruby-lang.org/issues/12142>
with many people, especially with Yura Sokolov.
* st.c: improve st_table.
* include/ruby/st.h: ditto.
* internal.h, numeric.c, hash.c (rb_dbl_long_hash): extract a function.
* ext/-test-/st/foreach/foreach.c: catch up this change.

You can see all of code history here:
<https://github.com/vnmakarov/ruby/tree/hash_tables_with_open_addressing>
This improvement is discussed at
<https://bugs.ruby-lang.org/issues/12142>
with many people, especially with Yura Sokolov.
* st.c: improve st_table.
* include/ruby/st.h: ditto.
* internal.h, numeric.c, hash.c (rb_dbl_long_hash): extract a function.
* ext/-test-/st/foreach/foreach.c: catch up this change.

You can see all of code history here:
<https://github.com/vnmakarov/ruby/tree/hash_tables_with_open_addressing>
This improvement is discussed at
<https://bugs.ruby-lang.org/issues/12142>
with many people, especially with Yura Sokolov.
* st.c: improve st_table.
* include/ruby/st.h: ditto.
* internal.h, numeric.c, hash.c (rb_dbl_long_hash): extract a function.
* ext/-test-/st/foreach/foreach.c: catch up this change.

You can see all of code history here:
<https://github.com/vnmakarov/ruby/tree/hash_tables_with_open_addressing>
This improvement is discussed at
<https://bugs.ruby-lang.org/issues/12142>
with many people, especially with Yura Sokolov.
* st.c: improve st_table.
* include/ruby/st.h: ditto.
* internal.h, numeric.c, hash.c (rb_dbl_long_hash): extract a function.
* ext/-test-/st/foreach/foreach.c: catch up this change.

Thanks you for your quick response. I am not a Rubyist but I like MRI
code.

Do you compare memory usages?

Sorry, I did not. Although I evaluated it. On my evaluation, in the
worst case scenario, memory usage will be about the same as for the
current hash tables taking into account that the element size is now
1/2 of the old size and the element array minimal usage is 50%. This
is because, when the hash table rebuilds, the element array
size is doubled. Rebuilding means 100% of array element usage before it. If it
is less, the new array size will be the same or will be smaller.

This evaluation excludes cases when the current hash table uses packed
elements (up to 6). But I consider it is a pathological case. The proposed hash
tables can use the same approach. It is even more natural because the
packed elements of the current hash tables have exactly the same
structure as for the proposed table elements.

So the packed element approach could be implemented too for the proposed
implementation. It means avoiding creation of entries array for small
size tables. I don't see it is necessary unless the hash tables will
be again used for method tables where most of them are small. Hash
tables will be faster than the used binary search. But it is not a
critical code (at least for benchmarks in MRI) as we search method table
once for a method and all other calls of the method skips this search.
I am sure you know it much better.

Speaking of measurements. Could you recommend credible benchmarks for
the measurements. I am in bechmarking business for a long time and I
know benchmarking may be an evil. It is possible to create benchmarks
which prove opposite things. In compiler field, we use
SPEC2000/SPEC2006 which is a consensus of most parties involved in the
compiler business. Do Ruby have something analogous?

(rare case) so many deletion can keep spaces (does it collected? i need to read code more)

In the proposed implementation, the table size can be decreased. So in
some way it is collected.

Reading the responses to all of which I am going to answer, I see people
are worrying about memory usage. Smaller memory usage is important
for better code locality too (although a better code locality does not mean a
faster code automatically -- the access patter is important too). But
I consider the speed is the first priority these days (especially when memory
is cheap and it will be much cheaper with new coming memory
technology).

In many cases speed is achieved by methods which requires more memory.
For example, Intel compiler generates much bigger code than GCC to
achieve better performance (this is most important competitive
advantage for their compiler).

This is actually seventh variant of hash tables I tried in MRI. Only
this variant achieved the best average improvement and no any
benchmark with worse performance.

I think goods overcomes bads.

Thanks, I really appreciate your opinion. I'll work on the found
issues. Although I am a bit busy right now with work on GCC6 release.
I'll have more time to work on this in April.

We can generalize the last issue as "compaction".
This is what I didn't touch this issue yet (maybe not a big problem).

Trivial comments

at first, you (or we) should introduce st_num_entries() (or something good name) to wrap to access num_entries/num_elements before your patch.

I'm not sure we should continue to use the name st. at least, st.c can be renamed.

Ok. I'll think about the terminology. Yura Sokolov wrote that changing
entries to elements can affect all rubygems. I did not know about
that. I was reckless about using a terminology more familiar for me.

num_entries should remain num_entries. It is easier for you to change naming than fix all rubygems.

Thanks for pointing this out.

do not change formatting of a code you do not change, it is awful to read and check that part of your patch.

I'll restore it. It is probably because I work on other projects
which uses a different formatting.

speed improvement is not from open addressing but from array storage for elements. You can use chaining with "next_index" with same result. But perhaps open addressing is simpler to implement.

Sorry, I doubt in this conclusion unless you prove it by benchmarks.
Open addressing removes pointer making a smaller element which increases
probability to be read from memory by reading the same cache line or/and
probability to stay in a cache.

In case of collisions, I believe checking other entry again improves data
locality in comparison with going by pointers through disperse elements
(may be allocated originally for different tables).

On the other hand, you are right. I think the biggest improvement comes
from putting elements into an array.

Open addressing is not simpler to implement too. The code might be smaller.
For example, I suspect CPython has a wrong implementation of hash tables and
can be cycled in extremely rare cases. For open addressing, someone should
implement a full cycle linear congruential generator. I use X_next = (5 * X_prev + 1) mod pow_of_2
since it satisfies the requirements of the Hull-Dobell theorem. CPython function
lookdict violates the theorem requirement 0 <= X < 'the module' and consequently
not a full cycle linear congruential generator. So implementing a correct
open addressing is not easy. It is easy if you use prime numbers (that is what GCC
hash tables uses) but dividing big values (hashes) by prime numbers is even slower
(> 100 cycles) than access to main memory. By the way I tried using prime numbers
too (in average the implementation was faster than the current tables but
there were a few benchmarks where it was slower).

if you stick with open-addressing, then it could be even faster, if you store hashsum in st_entry.

I tried something analogous and it did not work. Storing hash with entries increases
the table size as the array entries is bigger than the array elements.

st_foreach is broken in case table were rebuilt.

Sorry, I did not catch what you meant. St_foreach works fine in the proposed implementation
even if the table rebuilds. Moreover st_foreach in the proposed implementation
can work forever adding and removing elements meanwhile the current implementation
will result out of memory in such case.

Any way, it is a great attempt I wished to do by myself, but didn't give time for. I hope, you'll fix all issues and change will be accepted.

I don't think the faster hash tables is that important for general Ruby speed.
Simply I need to start with some simpler project on MRI.

Quadratic probing is most probably not faster on modern super-scalar OOO CPUs
than the secondary hash function I use. Quadratic probing will traverse all
entries for sure if # of entries is a prime number. As I wrote division of big
numbers (hash) by a prime number (for primary hash function) is > 100 cycles
according to the official Intel documents (but some researches say it is much more
than the official digits). So using prime numbers is out of question (I really
tried this approach too. It did not work). So for fast tables, the size should
be a power of 2.

As for not using pertrubation. It means ignoring part of the hash (the biggest part
for most cases). I don't see it logical. Continuing this speculation, we could
generate 32-bit or even 16-bit hash instead of 64-bit as it is now.

Thanks for the numbers. Is it a real world scenario? I mean using huge
numbers of only small tables. I can imagine this when hash tables were used
for Ruby method tables. But it is not the case now.

May be I am wrong but I don't see how it helps with memory problems (mostly
GC) which dynamic languages experience.

But if you think that memory for such cases is important we can implement
the same approach as for the current hash tables when they are small. Simply
the code will be more complicated. I started this work once but I did not
like the code complication (the code might be a bit slower too as we need to check
in many places what variant of the table we use). Still I can implement it
if people think it is important. Please, just let me know overall people opinion
about necessity of this.

I can imagine many cases: loading big data such as JSON and YAML, some kind of data structures such as Trie, data processing/aggregation, etc.

And I can confirm this problem not only with micro benchmark but also with an actual program. My local program (not published yet) consumes 76684kb without your patch, 106600kb with your patch. I don't think it uses Hash so heavily, though.

But if you think that memory for such cases is important we can implement
the same approach as for the current hash tables when they are small.

I can imagine many cases: loading big data such as JSON and YAML, some kind of data structures such as Trie, data processing/aggregation, etc.

And I can confirm this problem not only with micro benchmark but also with an actual program. My local program (not published yet) consumes 76684kb without your patch, 106600kb with your patch. I don't think it uses Hash so heavily, though.

Thanks for the cases.

OK. Then I should try to implement a compact representation of small tables even if it complicates the code.

But if you think that memory for such cases is important we can implement
the same approach as for the current hash tables when they are small.

I tried 4 different parameters for the test: 100000 1000000 10000000 100000000

The trunk ruby gives on my machine

15576kb
73308kb
650104kb
6520868kb

ruby with the proposed hash tables gives

15260kb
58268kb
795764kb
6300612kb

In 3 cases out of 4, the proposed hash tables are more compact than
the trunk ones. So you were unfortunate with the test parameter.

It says for me the size of big tables is probably not an important problem.

But still sizes of small tables might be if it results in a slower code.
Unfortunately it does. I confirm the slowdown problem exists for tables
of size <= 3 for your test in the first email. So I will work on the
small table problem.

Doing performance improvements is a hard job (I've been experiencing it well doing such
job for GCC for last 20 years). There are always tests where the result can
be worse. The average results (or even better a geometric mean) on credible benchmarks should
be a major criterion. If there is requirement that everything should
be better, there will be no progress on performance front after some
short initial time of the project. I wrote this long paragraph here in order to make
other people expectation more clear -- whatever I'd do with the tables,
someone can find a test where the new tables will be worse on speed or/and
memory consumption.

Quadratic probing is most probably not faster on modern super-scalar OOO CPUs
than the secondary hash function I use. Quadratic probing will traverse all
entries for sure if # of entries is a prime number.

For m = 2n, a good choice for the constants are c1 = c2 = 1/2, as the values of h(k,i) for i in [0,m-1] are all distinct. This leads to a probe sequence of h(k), h(k)+1, h(k)+3, h(k)+6, ... where the values increase by 1, 2, 3, ...

Then you may use closed addressing with same effect as your open addressing.
But instead of probing next element in entries array, you will probe it by next index.
Cache locality will remain the same , cause anyway you'll probe "random" entries.
(unless you store hash in st_entry and use quadrating probing, which has better hash locality).

This way entries array could be smaller than elements entries. And small table may have
no entries at all, just chain all elements in a hash smaller than n elements (n=6 for example).

st_foreach is broken in case table were rebuilt.
Sorry, I did not catch what you meant. St_foreach works fine in the proposed implementation
even if the table rebuilds. Moreover st_foreach in the proposed implementation
can work forever adding and removing elements meanwhile the current implementation
will result out of memory in such case.

I strongly believe st_index_t should be uint32_t. It will limit hash size to 232 elements, but even with this "small" table it means that hash will consume at least 100GB of memory before limits reached.

Quadratic probing is most probably not faster on modern super-scalar OOO CPUs
than the secondary hash function I use. Quadratic probing will traverse all
entries for sure if # of entries is a prime number.

For m = 2n, a good choice for the constants are c1 = c2 = 1/2, as the values of h(k,i) for i in [0,m-1] are all distinct. This leads to a probe sequence of h(k), h(k)+1, h(k)+3, h(k)+6, ... where the values increase by 1, 2, 3, ...

I believe your code above is incorrect for tables of sizes of power of 2. The function should look like h(k,i) = (h(k) + c1 * i + c2 * i^2) mod m, where "c1 = c2 = 1/2 is a good choice". You can not simplify it. The same Wikipedia article contains

With the exception of the triangular number case for a power-of-two-sized hash table, there is no guarantee of finding an empty cell once the table gets more than half full, or even before the table gets half full if the table size is not *prime*.

I don't see the quadratic function for sizes of power of 2 is simpler than what I use.

It probes all elements of table of size 2n and has good cache locality for first few probes. So if you store 32bit hash sum there, it will be very fast to check.

The only idea I like in your proposal is a better code locality argument. Also as I wrote before your proposal means just throwing away the biggest part of hash value even if it is a 32-bit hash. I don't think ignoring the big part of the hash is a good idea as it probably worsens collision avoiding. Better code locality also means more collision probability. However only benchmarking can tell this for sure. But I have reasonable doubts that it will be better.

Also about storing only part of the hash. Can it affect rubygems? It may be a part of API. But I don't know anything about it.

In any case trying your proposal is a very low-priority task for me (high priority task is a small table representation). May be somebody else will try it. It is not a wise approach to try it all and then stop. I prefer improvements step by step.

I strongly believe st_index_t should be uint32_t. It will limit hash size to 232 elements, but even with this "small" table it means that hash will consume at least 100GB of memory before limits reached.

I routinely use a few machines for my development with 128GB memory. In a few years, there is a big probability of non-volatile memory with capacity of modern hard disks with access time comparable with the current memory and CPUs will have insns for access to it. At least some companies are working on these new technologies.

So opposite to you, I don't believe that we should make it 32-bit.

Besides I tried it. My goal was to make the array entry smaller for small or medium size tables in hope that it will improve the performance. Unfortunately, I did not see a visible performance improvement (just a noise) on the 21 hash table performance benchmarks on x86-64 when I made the index of 32-bit size.

I don't know how this thread from table speed discussion switched to memory usage discussion. Memory usage is important especially if it results in speed improvement. But I always considered it a secondary thing (although it was quit different when I started my programming on old super-computers with 32K 48-bit words).

Next note: sizeof(struct st_table) were 48 bytes before patch on x86_64, and after it is 88 bytes (which will be rounded to 96 bytes by jemalloc).

Small hashes are very common despite your expectations, cause keyword arguments are passed as a hashes, and most methods has just a couple of arguments.

OK. As I wrote I am going to implement a compact representation of small tables. This will have bigger effect on the size than the header optimization. The current table minimal hash size (a compact representation) as I remember it correctly is 48+6x3x8 = 192B. On my evaluation the minimal size of my tables will be 96 + 4x3x8 = 192B. So I guess I focus on small table compact representation instead of header optimizations as it really solves a slowdown problem for tables with less 4 elements which I wrote about recently.

If you'll change st_index_t to uint32_t, then sizeof(struct st_table) will be 56 bytes.

Please see my previous email about my opinion on this proposal.

hash_mask == allocated_entries - 1, so no need to store.

I can and will do this. I missed this. That is what other implementations I mentioned do. Thanks for pointing this out.

If you'll use closed addressing, then no need in deleted_entries, and entries array could be not bigger than elements array.

Vladimir, you acts as if i said a rubbish or i'm trying to cheat you. It makes me angry.

You wrote:

I believe your code above is incorrect for tables of sizes of power of 2.
The function should look like h(k,i) = (h(k) + c1 * i + c2 * i2) mod m,
where "c1 = c2 = 1/2 is a good choice". You can not simplify it.

And you cited Wikipedia

With the exception of the triangular number case for a power-of-two-sized hash table,
there is no guarantee of finding an empty cell once the table gets more than half full

But couple of lines above you cited my cite from Wikipedia:

This leads to a probe sequence of h(k), h(k)+1, h(k)+3, h(k)+6, ...
where the values increase by 1, 2, 3, ...

Do you read carefully before answer?It is implementation of triangular number sequence - single quadratic probing
sequence which walks across all elements of 2^n table.

Please, check your self before saying other man is mistaken.
Or at least say with less level of confidence.

Also as I wrote before your proposal means just throwing away the biggest part of hash value even if it is a 32-bit hash.
I don't think ignoring the big part of the hash is a good idea as it probably worsens collision avoiding.

Please about Birthday Paradox carefullyhttps://en.wikipedia.org/wiki/Birthday_problem
Yes, it will certainly increase probability of hash value collision, but only for very HUGE hash tables.
And it doesn't affect length of a collision chain (cause 2^n tables uses only low bits).
It just affects probability of excess call to equality check on value, but not too much:

In other words, only 2% of full hash collisions on a Hash with 100_000_000 elements, and 7% for 300_000_000 elements.
Can you measure, how much time will consume insertion of 100_000_000 elements to a Hash (current or your implementation),
and how much memory it will consume? Int=>Int ? String=>String?

At my work, we use a huge in-memory hash tables (hundreds of millions elements) (custom in-memory db, not Ruby),
and it uses 32bit hashsum. No any problem. At this

Also about storing only part of the hash. Can it affect rubygems? It may be a part of API. But I don't know anything about it

gems ought to be recompiled, but no code change.

I routinely use a few machines for my development with 128GB memory.

But you wouldn't use Ruby process which consumes 100GB of memory using Ruby Hash. Otherwise you get a big trouble (with GC for example).
If you need to store such amount of data within Ruby Process, you'd better make your own datastructure.
I've maid one for my needs :https://rubygems.org/gems/inmemory_kvhttps://github.com/funny-falcon/inmemory_kv
It also can store only 2^31 elements, but hardly beleive you will ever store more inside of Ruby process.

Could you imagine that Hash with 1M elements starts to rebuild?
I can. The current tables do it all the time already and it means traversing all the elements as in the proposed tables case.

Current st_table rebuilds only if its size grow. Your table will rebuild even if size is not changed much, but elements are inserted and deleted repeatedly (1 add, 1 delete, 1 add, 1 delete)

May be it is better to keep st_index_t prev, next in struct st_table_entry (or struct st_table_elements as you called it) ?
Sorry, I can not catch what do you mean. What prev, next should be used for.
How can it avoid table rebuilding which always mean traversing all elements to find a new entry or bucket for the elements.

Yeah, it is inevitable to maintain free list for finding free element.
But prev,next indices will allow to insert new elements in random places (deleted before),
cause iteration will go by this pseudo-pointers.

Perhaps, it is better to make a separate LRU hash structure instead in a standard library,
and keep Hash implementation as you suggest.
I really like this case, but it means Ruby will have two hash tables - for Hash and for LRU.

I'm on Debian wheezy, x86-64 gcc-4.7.real (Debian 4.7.2-5) 4.7.2
There may be other tests which fail, but I didn't investigate
further.

Other notes (mostly reiterating other comments):

In my experience, more users complain about memory usage than
performance since Ruby 1.9/2.x days. Maybe it's because
traditionally, the Ruby ecosystem did not have great non-blocking
I/O or thread support; users find it easier to fork processes,
instead.

Small hashes are common in Ruby and important to parameter
passing, ivar indices in small classes, short-lived statistics,
and many other cases. Because hashes are so easy-to-create,
Rubyists tend to create many of them.

st.h is unfortunately part of our public C API; so num_entries
shouldn't change. I propose we hide the new struct fields
somehow in similar fashion to private_list_head or at least give
them scary names which discourage public use.

Anyways, I'm excited about these changes and hope we can get
the most of the benefits without the downsides.

Vladimir, you acts as if i said a rubbish or i'm trying to cheat you. It makes me angry.

You wrote:

I believe your code above is incorrect for tables of sizes of power of 2.
The function should look like h(k,i) = (h(k) + c1 * i + c2 * i2) mod m,
where "c1 = c2 = 1/2 is a good choice". You can not simplify it.

And you cited Wikipedia

With the exception of the triangular number case for a power-of-two-sized hash table,
there is no guarantee of finding an empty cell once the table gets more than half full

But couple of lines above you cited my cite from Wikipedia:

This leads to a probe sequence of h(k), h(k)+1, h(k)+3, h(k)+6, ...
where the values increase by 1, 2, 3, ...

Do you read carefully before answer?It is implementation of triangular number sequence - single quadratic probing
sequence which walks across all elements of 2^n table.

Please, check your self before saying other man is mistaken.
Or at least say with less level of confidence.

I am really sorry. People do mistakes. I should have not written this in a hurry while running some errands.

Still I can not see how using the function (p + d) & h instead of (i << 2 + i + p + 1) & m visibly speedups hash tables (which is memory bound code) on modern OOO super-scalar CPUs. That is besides advantages (even tiny as you are writing below) of decreasing collisions by using all hash.

Actually I did an experiment. I tried these two functions on Intel Haswell with memory access to the same address (so the value will be in the 1st cache) after each function calculation. I used -O3 for GCC and run the functions 109 times. The result is a bit strange. Code with function (i << 2 + i + p + 1) & m is about 7% faster than one with simpler function (p + d) & h (14.5s vs. 15.7s). Sometimes it is hard to predict outcome as modern x86-64 processors are black boxes and actually interpreters inside. But even if it was the opposite, absence of value in the cache and the fact, that the function is a small part of code for the access by a key, probably will make that difference insignificant.

Also as I wrote before your proposal means just throwing away the biggest part of hash value even if it is a 32-bit hash.
I don't think ignoring the big part of the hash is a good idea as it probably worsens collision avoiding.

Please about Birthday Paradox carefullyhttps://en.wikipedia.org/wiki/Birthday_problem
Yes, it will certainly increase probability of hash value collision, but only for very HUGE hash tables.
And it doesn't affect length of a collision chain (cause 2^n tables uses only low bits).
It just affects probability of excess call to equality check on value, but not too much:

In other words, only 2% of full hash collisions on a Hash with 100_000_000 elements, and 7% for 300_000_000 elements.
Can you measure, how much time will consume insertion of 100_000_000 elements to a Hash (current or your implementation),
and how much memory it will consume? Int=>Int ? String=>String?

On a machine currently available to me,

./ruby -e 'h = {}; 100_000_000.times {|n| h[n] = n }'

takes 7m for my implementation.

The machine has not enough memory for 300_000_000 elements. So I did not try.

In GCC community where I am from, we are happy if we all together improved SPEC2000/SPEC2006 by 1%-2% during a year. So if I can use all hash without visible slowdown even if it decreases number of collisions only by 1% on big tables, I'll take that chance.

At my work, we use a huge in-memory hash tables (hundreds of millions elements) (custom in-memory db, not Ruby),
and it uses 32bit hashsum. No any problem. At this

Also about storing only part of the hash. Can it affect rubygems? It may be a part of API. But I don't know anything about it

gems ought to be recompiled, but no code change.

Thanks for the answer.

I routinely use a few machines for my development with 128GB memory.

But you wouldn't use Ruby process which consumes 100GB of memory using Ruby Hash. Otherwise you get a big trouble (with GC for example).
If you need to store such amount of data within Ruby Process, you'd better make your own datastructure.
I've maid one for my needs :https://rubygems.org/gems/inmemory_kvhttps://github.com/funny-falcon/inmemory_kv
It also can store only 2^31 elements, but hardly beleive you will ever store more inside of Ruby process.

IMHO, I think it is better to fix the problems which can occur for using tables with > 232 elements instead of introducing the hard constraint. There are currently machines which have enough memory to hold tables with > 232 elements. You are right it is probably not wise to use MRI to work with such big tables regularly because MRI is currently slow for this. But it can be used occasionally for prototyping. Who knows may be MRI will become much faster.

But my major argument is that using 32-bit index does not speed up work with hash tables. As I wrote I tried it and using 32-bit index did not improve the performance. So why we should create such hard constraint then.

Could you imagine that Hash with 1M elements starts to rebuild?
I can. The current tables do it all the time already and it means traversing all the elements as in the proposed tables case.

Current st_table rebuilds only if its size grow. Your table will rebuild even if size is not changed much, but elements are inserted and deleted repeatedly (1 add, 1 delete, 1 add, 1 delete)

May be it is better to keep st_index_t prev, next in struct st_table_entry (or struct st_table_elements as you called it) ?
Sorry, I can not catch what do you mean. What prev, next should be used for.
How can it avoid table rebuilding which always mean traversing all elements to find a new entry or bucket for the elements.

Yeah, it is inevitable to maintain free list for finding free element.
But prev,next indices will allow to insert new elements in random places (deleted before),
cause iteration will go by this pseudo-pointers.

There is no rebuilding if you use hash tables as a stack. The same can be also achieved for a queue with some minor changes. As I wrote there will be always a test where a new implementation will behave worse. It is a decision in choosing what is better: say 50% improvement on some access patterns or n% improvement on other access patterns. I don't know how big is n. I tried analogous approach what you proposed. According to my notes (that time MRI has only 17 tests), the results were the following

name

time

hash_aref_dsym

0.811

hash_aref_dsym_long

1.360

hash_aref_fix

0.744

hash_aref_flo

1.123

hash_aref_miss

0.811

hash_aref_str

0.836

hash_aref_sym

0.896

hash_aref_sym_long

0.847

hash_flatten

1.149

hash_ident_flo

0.730

hash_ident_num

0.812

hash_ident_obj

0.765

hash_ident_str

0.797

hash_ident_sym

0.807

hash_keys

1.456

hash_shift

0.038

hash_values

1.450

But unfortunately they are not representative as I used prime numbers as size of the tables and just mod for mapping hash value to entry index.

Perhaps, it is better to make a separate LRU hash structure instead in a standard library,
and keep Hash implementation as you suggest.
I really like this case, but it means Ruby will have two hash tables - for Hash and for LRU.

I don't know.

In any case, it is not up to me to decide the size of the index and some other things discussed here. That is why probably I should have not participated in this discussion.

We spent a lot time arguing. But what should we do is trying. Only real experiments can prove or disapprove our speculations. May be I'll try your ideas but only after adding my code to MRI trunk. I still have to solve the small table problem people wrote me about before doing this.

I'm on Debian wheezy, x86-64 gcc-4.7.real (Debian 4.7.2-5) 4.7.2
There may be other tests which fail, but I didn't investigate
further.

I am a novice to MRI. When I used test-all recently I have some errors on the trunk with and without my code. So I used only test. I guess I should try what you did.

Other notes (mostly reiterating other comments):

In my experience, more users complain about memory usage than
performance since Ruby 1.9/2.x days. Maybe it's because
traditionally, the Ruby ecosystem did not have great non-blocking
I/O or thread support; users find it easier to fork processes,
instead.

Interesting. I got a different impression that people complaining more about that MRI is slow. Moving to VM was a really great improvement (2-3 times as I remember) and made Ruby actually faster than CPython. But it is still much slower PyPy, JS, LuaJIT. That is why probably people are complaining. Getting this impression I decided to try to help improving MRI performance. MRI is the language definition so I think working on alternative Ruby implementations would be not wise. I have some ideas and hope my management permits me to spend part of my time to work on their implementation.

Small hashes are common in Ruby and important to parameter
passing, ivar indices in small classes, short-lived statistics,
and many other cases. Because hashes are so easy-to-create,
Rubyists tend to create many of them.

Thanks. I completely realize now that compact hash tables are important.

st.h is unfortunately part of our public C API; so num_entries
shouldn't change. I propose we hide the new struct fields
somehow in similar fashion to private_list_head or at least give
them scary names which discourage public use.

Yes, I'll fix it and use the old names or try to figure out a better terminology which does not change names in st.h.

Anyways, I'm excited about these changes and hope we can get
the most of the benefits without the downsides.

Don't you forget that d should be incremented every step? Otherwise it is linear probing instead of quadratic probing.

So if I can use all hash without visible slowdown even if it decreases number of collisions only by 1% on big tables, I'll take that chance.

But my major argument is that using 32-bit index does not speed up work with hash tables. As I wrote I tried it and using 32-bit index did not improve the performance. So why we should create such hard constraint then.

If you stick with open addressing, then storing 32bit hash in 'entries' array and using 32bit index may speedup hash.

If you try to closed addressing, then 32bit hash + 32bit next pointer will allow to not increase element size.

Prototyping with 2^31 elements in memory?
I work in small social network (monthly auditory 30M, daily 5M), that exists for 9 years, and some in-memory tables sharded to 30 servers just get close to 2^30 totally (i.e. sum among all servers).

Please test time for inserting 200_000_000, 300_000_000 elements, does the time grows lineary?
And you didn't measure String keys or/and values.
You said, you computer has memory for 300_000_000 elements, how much it is? How much memory will 1_000_000_000 Int=>Int elements will consume? How much 1_000_000_000 memory String=>String will comsume?

Concerning prev,next

I tried analogous approach what you proposed.
hash_shift 0.038

hash_shift result shows that your implementation had flaws, so performance numbers are not representable. I do not expect performance change lesser than 0.95.

What if it is not LRU? What if it is general purpose 100_000_000 table?
You said a lot of about 1% improvement on such table.
What will you say about excessive 2 second pause for rebuilt such table?
How much pause will be for 1_000_000_000 table?

7 hours to test prototype? And it is just Int=>Int! With strings it will be several times more.

At this level people switches to Go, Closure or something else.

Yura, after reading your last two emails my first impulse was to answer them. But doing this I've realized that I need to repeat my arguments third or second time, do some experiments trying to figure out why the test for big parameters is slow on Ruby, thinking how it can be fixed in MRI or on language level. This discussion became unproductive for me and wasting my time. The direction of the discussion is just again confirming to me that I should have not participated in this discussion.

I did not change my opinion, you did not change yours. As I wrote it is not up to me to decide what size of index and hash (32-bit or 64-bit on 64-bit machines) we should use. I don't know how such decisions are made in Ruby/MRI community. If the community decides to use 32-bit indexes and hashes, I am ready to change my code. You just need to convince now the decision makers to do such change. As for me I am stopping to discuss these issues. I am sorry for doing this.

When you want to use a hash as an in-memory key-value store, it is quite natural for it to experience lots of random additions / deletions. I think this situation happens in real-world programs. Is this intentional or just a work-in-progress?

I'm also unsure about worst-case time complexity of your strategy. At a glance it seems not impossible for an attacker to intentionally collide entries (hope I'm wrong). On such situations, old hash we had did rebalancing called rehash to avoid long chains. Is there anything similar, or you don't have to, or something else?

Anyways, this is a great contribution. I wish I could do this. I hope you revise the patch to reflect other discussions and get it merged.

I don't think it is a leak. What you measure is the maximal residential size. I think the table is rebuilt many times and memory for previous version of tables is freed but it is not freed to OS by MRI (or glibc. I don't know yet what allocation library is used by MRI). Still this is very bad. I should definitely to investigate and fix it. I believe I know how to fix it. I should reuse the array elements when it is possible. Thanks for pointing this out.

Would you mind if I include your test in the final version of the patch as a benchmark?

When you want to use a hash as an in-memory key-value store, it is quite natural for it to experience lots of random additions / deletions. I think this situation happens in real-world programs. Is this intentional or just a work-in-progress?

The proposed hash tables will work with random additions/deletions. I only did not know what the exact performance will be in comparison with the current tables. As I became aware of your case now (Yura Sokolov also wrote about it) it will be a work-in-progress for me.

I am not sure your case is real world case scenario (removing the last element is) but it is definitely the worst case for my implementation.

I'm also unsure about worst-case time complexity of your strategy. At a glance it seems not impossible for an attacker to intentionally collide entries (hope I'm wrong). On such situations, old hash we had did rebalancing called rehash to avoid long chains. Is there anything similar, or you don't have to, or something else?

The worst case is probably the same as for the current tables. It is theoretically possible to create test data which results in usage the same entry for the current and proposed tables. But in practice it is impossible for medium size table even if murmur hash is not a cryptography level hash function as, for example, sha2.

I've specifically chosen a very small hash table load (0.5) to make chance of collisions very small and rebuilding less frequent (such parameter still results in about the same non-small hash table sizes). I think even maximal load 3/4 would work well to avoid collision impact. People can make experiments with such parameters and change them later if something works better of course if the proposed tables will be in the trunk.

But still if there are a lot of collisions the same strategy can be used -- table rebuilding. I'll think about this.

Anyways, this is a great contribution. I wish I could do this. I hope you revise the patch to reflect other discussions and get it merged.

Yes, I'll work on the revised patch. The biggest change I see it right now will be a compact representation of small tables. Unfortunately, I can not promise the revised patch will be ready soon as I am busy with fixing bugs for GCC6. But this work on GCC6 will be finished in a month.

I don't think it is a leak. What you measure is the maximal residential size. I think the table is rebuilt many times and memory for previous version of tables is freed but it is not freed to OS by MRI (or glibc. I don't know yet what allocation library is used by MRI). Still this is very bad. I should definitely to investigate and fix it. I believe I know how to fix it. I should reuse the array elements when it is possible. Thanks for pointing this out.

So the packed element approach could be implemented too for the proposed
implementation.

I agree.

I don't see it is necessary unless the hash tables will
be again used for method tables where most of them are small.

As some people said, there are many small Hash objects, like that:

def foo(**opts)
do_something opts[:some_option] || default_value
end

foo(another_option: customized_value)

BTW, from Ruby 2.2, most of passing keyword parameters does not create
Hash object. In above case, a hash object is created explicitly (using** a keyword hash parameter).

Hash
tables will be faster than the used binary search. But it is not a
critical code (at least for benchmarks in MRI) as we search method table
once for a method and all other calls of the method skips this search.
I am sure you know it much better.

Maybe we continue to use id_table for method table, or something. It is
specialized for ID key table.

BTW (again), I (intuitively) think linear search is faster than using
Hash table on small elements. We don't need to touch entries table.
(But no evidence I have.)

However, if we collect hash values (making a hash value array), we only
need to load 8 * 8B = 64B.

... sorry, it is not simple :p

Speaking of measurements. Could you recommend credible benchmarks for
the measurements. I am in bechmarking business for a long time and I
know benchmarking may be an evil. It is possible to create benchmarks
which prove opposite things. In compiler field, we use
SPEC2000/SPEC2006 which is a consensus of most parties involved in the
compiler business. Do Ruby have something analogous?

as other people, i agree. and Ruby does not have enough benchmark :(
I think discourse benchmark can help.

In the proposed implementation, the table size can be decreased. So in
some way it is collected.

Reading the responses to all of which I am going to answer, I see people
are worrying about memory usage. Smaller memory usage is important
for better code locality too (although a better code locality does not mean a
faster code automatically -- the access patter is important too). But
I consider the speed is the first priority these days (especially when memory
is cheap and it will be much cheaper with new coming memory
technology).

In many cases speed is achieved by methods which requires more memory.
For example, Intel compiler generates much bigger code than GCC to
achieve better performance (this is most important competitive
advantage for their compiler).

Case by case.
For example, Heroku smallest dyno only provides 512MB.

I think goods overcomes bads.

Thanks, I really appreciate your opinion. I'll work on the found
issues. Although I am a bit busy right now with work on GCC6 release.
I'll have more time to work on this in April.

I don't think it is a leak. What you measure is the maximal residential size. I think the table is rebuilt many times and memory for previous version of tables is freed but it is not freed to OS by MRI (or glibc. I don't know yet what allocation library is used by MRI). Still this is very bad. I should definitely to investigate and fix it. I believe I know how to fix it. I should reuse the array elements when it is possible. Thanks for pointing this out.

Thank you. I tried this too and see it is a real problem I should solve.

(rare case) so many deletion can keep spaces (does it collected? i need to read code more)

...

We can generalize the last issue as "compaction".
This is what I didn't touch this issue yet (maybe not a big problem).

And you didn't response about that :)

No, I did not. Sorry. I answered about another way how the proposed tables can reclaim memory. Now understanding what you meant I can say that there is no compaction. But you are already know it :)

My thought behind skipping compaction was that in any case we should traverse all elements and it will take practically the same time as for table rebuilding while this approach simplifies code considerably. Getting responses from people providing small tests I see that I was wrong. I missed the effect of memory allocation layer. So we should definitely have the compaction.

Another thought I got recently that the compaction traversing even all elements have more chances to work with cached data than in the case of rebuilding.

So the packed element approach could be implemented too for the proposed
implementation.

I agree.

I don't see it is necessary unless the hash tables will
be again used for method tables where most of them are small.

As some people said, there are many small Hash objects, like that:

def foo(**opts)
do_something opts[:some_option] || default_value
end

foo(another_option: customized_value)

BTW, from Ruby 2.2, most of passing keyword parameters does not create
Hash object. In above case, a hash object is created explicitly (using** a keyword hash parameter).

To be honest, I did not know that some parameters are passed as hash objects. I am not a Ruby programmer but I am learning.

Hash
tables will be faster than the used binary search. But it is not a
critical code (at least for benchmarks in MRI) as we search method table
once for a method and all other calls of the method skips this search.
I am sure you know it much better.

Maybe we continue to use id_table for method table, or something. It is
specialized for ID key table.

BTW (again), I (intuitively) think linear search is faster than using
Hash table on small elements. We don't need to touch entries table.
(But no evidence I have.)

However, if we collect hash values (making a hash value array), we only
need to load 8 * 8B = 64B.

... sorry, it is not simple :p

I am agree it is not simple. Especially if we take the effect of execution of other parts of the program on caches (how some code before and after and paralelly working with given code accesses memory). In general the environment effect can be important. For example, I read a lot research papers about compiler optimizations. They always claim big or modest improvements. But when you are trying the same algorithm in GCC environment (not in a toy compiler) the effect is frequently smaller, and even in rare cases the effect is opposite (a worse performance). Therefore, something affirmative can be said only after the final implementation.

If I correctly understood MRI VM code, you need to search id table element usually once and after that the found value is stored in corresponding VM call insn. The search is done again if an object method or a class method is changed. Although this pattern ("monkey patching") exists, I don't consider it frequent. So I conclude the search in the id table is not critical (may be I am wrong).

For non-critical code, on my opinion the best strategy is to minimize cache changes for other parts of a program. With this point of view, linear or binary search is probably a good approach as the used data structure has a minimal footprint.

Speaking of measurements. Could you recommend credible benchmarks for
the measurements. I am in bechmarking business for a long time and I
know benchmarking may be an evil. It is possible to create benchmarks
which prove opposite things. In compiler field, we use
SPEC2000/SPEC2006 which is a consensus of most parties involved in the
compiler business. Do Ruby have something analogous?

as other people, i agree. and Ruby does not have enough benchmark :(
I think discourse benchmark can help.

As I know there are a lot of applications written on Ruby. May be it is possible to adapt a few of non IO- or network bound programs for benchmarking. It would be really useful. The current MRI benchmarks are micro-benchmarks they don't permit to see a bigger picture. Some RedHat people recommended for me to use fluentd for benchmarking. But I am not sure about this.

In the proposed implementation, the table size can be decreased. So in
some way it is collected.

Reading the responses to all of which I am going to answer, I see people
are worrying about memory usage. Smaller memory usage is important
for better code locality too (although a better code locality does not mean a
faster code automatically -- the access patter is important too). But
I consider the speed is the first priority these days (especially when memory
is cheap and it will be much cheaper with new coming memory
technology).

In many cases speed is achieved by methods which requires more memory.
For example, Intel compiler generates much bigger code than GCC to
achieve better performance (this is most important competitive
advantage for their compiler).

Case by case.
For example, Heroku smallest dyno only provides 512MB.

I think goods overcomes bads.

Thanks, I really appreciate your opinion. I'll work on the found
issues. Although I am a bit busy right now with work on GCC6 release.
I'll have more time to work on this in April.

I don't think it is a leak. What you measure is the maximal residential size. I think the table is rebuilt many times and memory for previous version of tables is freed but it is not freed to OS by MRI (or glibc. I don't know yet what allocation library is used by MRI). Still this is very bad. I should definitely to investigate and fix it. I believe I know how to fix it. I should reuse the array elements when it is possible. Thanks for pointing this out.

OK. Looking forward to the fix.

Would you mind if I include your test in the final version of the patch as a benchmark?

No problem. Do so pleae.

When you want to use a hash as an in-memory key-value store, it is quite natural for it to experience lots of random additions / deletions. I think this situation happens in real-world programs. Is this intentional or just a work-in-progress?

The proposed hash tables will work with random additions/deletions. I only did not know what the exact performance will be in comparison with the current tables. As I became aware of your case now (Yura Sokolov also wrote about it) it will be a work-in-progress for me.

I am not sure your case is real world case scenario (removing the last element is) but it is definitely the worst case for my implementation.

(WIP is definitely OK with me.) Let me think of a more realistic use case.

Often, a key-value store comes with expiration of keys. This is typically done by storing expiration time in each hash elements, then periodically call st_foreach() with a pointer argument to an expiration-checking function. To simulate this, I wrote following snippet:

It ran with your patch and resulted in 37,668kb max resident memory. Without the patch, it took 8,988kb.

Above might still be a bit illustrative, but I believe this situation happens in wild.

The worst case is probably the same as for the current tables. It is theoretically possible to create test data which results in usage the same entry for the current and proposed tables. But in practice it is impossible for medium size table even if murmur hash is not a cryptography level hash function as, for example, sha2.

I've specifically chosen a very small hash table load (0.5) to make chance of collisions very small and rebuilding less frequent (such parameter still results in about the same non-small hash table sizes). I think even maximal load 3/4 would work well to avoid collision impact. People can make experiments with such parameters and change them later if something works better of course if the proposed tables will be in the trunk.

But still if there are a lot of collisions the same strategy can be used -- table rebuilding. I'll think about this.

Yes. Please do. "People are not evil(smart) enough to do this" -kind of assumptions tends to fail.

make benchmark-each ITEM=bm_hash OPTS='-r 3 -v' COMPARE_RUBY='<trunk ruby>' is broken :-(
it shows speedup even if compares with same ruby after make install.
I suppose, it measures startup time, which increases after install, or it measures miniruby vs ruby.

As you will see, only creating huge Hashes are improved by both mine and Vladimir's patches.
(ok, Vladimir's one also has better iteration time)

Second:

I've made my version of patch:

closed addressing + double linked list for insertion order

32bit indices instead of pointers and 32bit hashes

some tricks to reduce size of st_table, and fit array into sizes comfortable for allocators

Memory usage is lesser than Vladimir's, and comparable to trunk
(sometimes lesser than trunk, sometimes bigger).
Usage may be bigger cause array is preallocated with room of empty elements.
My patch has ability to specify more intermediate steps of increasing capacities,
but allocator behavior should be considered.

So, as I said, main advantage is from storing st_table_entry as an array, so
less calls to malloc performed, less TLB misses and less memory used.
Open addressing gives almost nothing to performance in this case, cause
it is not a case where open addressing plays well.

Open addressing plays well when you whole key-value structure is small and stored
inside of hash-array. But in case of Ruby's Hash we store st_table_entry outside of
open-addressing array, so jump is performed, and main benefit (cache locality) is lost.

Vladimir's proposal for storing insertion order by position in array can still
benefit in memory usage, if carefully designed.
By the way, PHP's array do exactly this (but uses closed addressing).

My opinion is the same -- we should stay with 64-bit for a perspective in the future and fix slowness of work with huge tables if it is possible.

You showed many times that most of languages permit maximum 231 elements but probably there are exclusions.

I know only one besides MRI -- Dino (http://dino-lang.github.io). I use it to do my research on performance of dynamic languages. Here are examples to work with huge tables on an Intel machine with 128GB memory:

Dino uses worse implementation of hash tables than the proposed tables (I wanted to rewrite it for a long time). As I wrote MRI took 7min on the same test for 100_000_000 elements on the same machine. So I guess it is possible to make MRI faster to work with huge hash tables too.

About 32bit - it is about cache locality and memory consumption.
Million men will pay for 64bit hash and indices, but only three men will ever use this ability.
One of this three men will be you showing:
"looks, my hash may store > 232 Int=>Int pairs (if you have at least 256GB memory)"
Second one will be a man that will check you, and he will say:
"Hey, Java cann't do it, but Ruby can! (if you have at least 256GB)"
And only third one will ever use it for something useful.
But every one will pay for.

I am aware of it therefore I put ??? in comments. I still think about strategy in changing table sizes. My current variant is far from the final. Actually I wanted to publish this work in April but some circumstances forced me to do it earlier. I am glad that I did it earlier as I am having a great response and now have time to think about issues I was not aware. That is what happens when you publish your work. In GCC community practically all patches are publicly reviewed. I am used to it.

Also I am going still to work on new table improvements which were never discussed here.

OK, thanks for pointing this out. The yesterday patch is buggy. I'll work on fixing it.

About 32bit - it is about cache locality and memory consumption.

32-bit is important for your implementation because it permits to decrease element size from 48B to 32B.

For my implementation, the size of elements still will be the same 24B after switching to 32-bit. Although it permits to decrease overall memory allocated to tables by 20% by decreasing size of the array entries.

I don't like lists (through pointer or indexes). This is a disperse data structure hurting locality and performance on modern CPUs for most frequently used access patterns. The lists were cool long ago when a gap between memory and CPU speed was small.

GCC has analogous problem with RTL on which all back back-end optimizations are done. RTL was designed > 25 years ago and it is a list structure. It is a common opinion in GCC community that it hurts the compiler speed a lot. But GCC community can not change it (it tried several times) because it is everywhere even in target descriptions. One of my major motivation to start this hash table work was exactly in removing the lists.

Million men will pay for 64bit hash and indices, but only three men will ever use this ability.
One of this three men will be you showing:
"looks, my hash may store > 232 Int=>Int pairs (if you have at least 256GB memory)"

I would not underestimate the power of such argument to draw more people to use Ruby.

Especially when it seems that python has not 231 constraint and CPython can create about 230 size dictionary on 128GB machine for less than 4 min.

That is unfortunate that MRI needs 7min for creation of dictionary of 10 times less. I believed it should be fixed although I have no idea how.

Second one will be a man that will check you, and he will say:
"Hey, Java cann't do it, but Ruby can! (if you have at least 256GB)"
And only third one will ever use it for something useful.
But every one will pay for.

Although I am repeating myself, the memory size and price can be changed in the near future.

In any case you did your point about hash and table sizes and now it will be discussed on a Ruby's committers meeting.

I don't like lists (through pointer or indexes). This is a disperse data structure hurting locality and performance on modern CPUs for most frequently used access patterns. The lists were cool long ago when a gap between memory and CPU speed was small.

But you destroy cache locality by secondary hash and not storing hashsum in a entries array.

To overcome this issue you ought use fill factor 0.5.
Providing, you don't use 32bit indices, you spend at least 24+8*2=40 bytes per element - just before rebuilding.
And just after rebuilding entries with table you spend 24*2+8*2*2=80bytes per element!
That is why your implementation doesn't provide memory saving either.

My current implementation uses at least 32+4/1.5=34 bytes, and at most 32*1.5+4=52 bytes.
And I'm looking for possibility to not allocate double-linked list until neccessary, so it will be at most 24*1.5+4=40 bytes for most of hashes.

Lists are slow when every element is allocated separately. Then there is also TLB miss together with cache miss for every element.
When element are allocated from array per hash, then there are less both cache and TLB misses.

And I repeat again: you do not understand when and why open addressing may save cache misses.
For open addressing to be effective one need to store whole thing that needed to check hit in an array itself (so at least hash sum owt to be stored).
And second probe should be in a same cache line, that limits you:

to simple schemes: linear probing, quadratic probing,

or to custom schemes, when you explicitely check neighbours before long jump,

I don't like lists (through pointer or indexes). This is a disperse data structure hurting locality and performance on modern CPUs for most frequently used access patterns. The lists were cool long ago when a gap between memory and CPU speed was small.

But you destroy cache locality by secondary hash and not storing hashsum in a entries array.

What is missing in the above calculations is the probability of collisions for the same size table. The result can be not so obvious (e.g. in first case we have collision in 20% but in the second one only 10%): 2 + 3/5 vs 2 + 4/10 or 2.6 vs 2.4 cache misses.

But I am writing about it below.

To overcome this issue you ought use fill factor 0.5.
Providing, you don't use 32bit indices, you spend at least 24+8*2=40 bytes per element - just before rebuilding.
And just after rebuilding entries with table you spend 24*2+8*2*2=80bytes per element!
That is why your implementation doesn't provide memory saving either.

One test mentioned in this thread showed that in 3 cases out of 4 my tables are more compact than the current one.

My current implementation uses at least 32+4/1.5=34 bytes, and at most 32*1.5+4=52 bytes.
And I'm looking for possibility to not allocate double-linked list until neccessary, so it will be at most 24*1.5+4=40 bytes for most of hashes.

It would be a fair comparison if you used 64-bit vs 64-bit. As I wrote 32-bit hashes and indexes are important for your implementation. So for 64-bit the numbers should look like

at least 48 + 8 / 1.5 = 51 bytes (vs. 40 for my approach)

and what number of collisions would be in the above case. Pretty big if you are planning in average 1.5 elements for a bin.

at most 48 * 1.5 + 8 = 80 bytes (vs. 80)

analogously a big number of collisions if you plan in average 1 element for a bin

If the community decides that we should constrain the table sizes, then I might reconsider my design. When I started work on the tables, I assumed that I can not put additional constraints to existing tables in other words that I should not change the sizes. To be honest at the start I also thought what if the index were 32-bit (although I did not think further about changing
hash size also as you did).

Lists are slow when every element is allocated separately. Then there is also TLB miss together with cache miss for every element.
When element are allocated from array per hash, then there are less both cache and TLB misses.

It is still the same cache miss when the next element in the bin list is out of cache line and it is very probable if you have moderate or big hash tables. It is an improvement that you put elements in the array, at least they will be closer to each other and will be not some freed elements created for other tables. Still, IMHO, it is practically the same pointer chasing problem.

And I repeat again: you do not understand when and why open addressing may save cache misses.

I think I do understand "why open addressing may save cache misses".

My understanding is that it permits to remove the bin (bucket) lists and decreases size of the element. As a consequence you can increase the array entries and have a very healthy load factor practically excluding collision occurrences while having the same size for the table.

For open addressing to be effective one need to store whole thing that needed to check hit in an array itself (so at least hash sum owt to be stored).

That will increase the entry size. It means that for the same size table I will have a bigger load factor. Such solution decreases cache misses in case of collisions but increases the collision probability for the same size tables.

And second probe should be in a same cache line, that limits you:

to simple schemes: linear probing, quadratic probing,

As I wrote only one thing interesting for me in quadratic probing is better data locality in case of collisions, I'll try it and consider if my final patch will be in the trunk.

or to custom schemes, when you explicitely check neighbours before long jump,

or exotic schemes, like Robin-Hood hashing.

You just break every best-practice of open-addressing.

No, I don't. IMHO.

You also omitted table traverse operations here. I believe in my approach it will work better. But I guess we could argue also about it a lot.

When the meeting decides about the sizes, we will have more clarity. There is no ideal solutions and speculations sometimes might be interesting but the final results on benchmarks should be a major criterion. Although our discussions are sometimes emotional, the competition will help to improve Ruby hash tables which is good for MRI community.

What is missing in the above calculations is the probability of collisions for the same size table. The result can be not so obvious (e.g. in first case we have collision in 20% but in the second one only 10%): 2 + 3/5 vs 2 + 4/10 or 2.6 vs 2.4 cache misses.

Sorry, I made a mistake in calculations for given conditions. It should be 2*0.8 + 3*0.2 vs 2*0.9+4*0.1 or 2.2 vs 2.2. It is just an illustration of my thesis. The different collision probabilities can change the balance to different directions.

I'm presenting my (pre)final version of patch.
It passes all tests and redmine works on it without issues.

Vladimir, I must admit your original idea about rebuilding st_table_entries were correct,
so there is no need in double-linked list. Ruby is not for huge hashes in low latency
applications, and GC still ought to traverse whole hash with marking referenced objects.

Differences from Vladimir's approach:

closed addressing, with fill factor close to 1,
(so no need to keep counter for deleted entries and rebuild main hash array).

32bit hashes and indices on 64bit platform

size is stored as index to static table, so allocation is not always 2x,
but instead it is 1.5x and 1.33x

single allocation both for entries and bins,
pointer points into middle of allocaction,
bins array is on one side, and entries on other.

Here is my work update. This is far from the final version. I'll
continue to work on it. As I wrote I can spend more time on MRI work
after mid-April (currently I'd like to inspect and optimize my code
and investigate more about hash functions and tune them).

I am glad that I submitted my patch earlier than I planned
originally. The discussion was useful for me.

The discussion reveals that I used a wrong benchmarking procedure (I
took it reading emails on ruby developer mailing lists):

make benchmark-each ITEM=bm_hash OPTS='-r 3 -v' COMPARE_RUBY=

This command measures installed ruby with the current miniruby. So
the results reported by me were much better than in reality. Thanks
to Koichi Sasada, now I am using the right one:

ruby ../gruby/benchmark/driver.rb -p hash -r 3 -e -e
current::

So I realized that I should work more to improve the performance
results because average 15% improvement is far away than 50% reported
the first time.

Since the first patch submitting, I did the following (I still use
64-bit long hashes and indexes):

I changed terminology to keep the same API. What I called elements
is now called entries and what I called entries are now calledbins.

I added a code for table consistency checking which helps to debug
the code (at least it helped me a lot to find a few bugs).

I implemented the compaction of the array entries and fixed a
strategy for table size change. It fixes the reported memory leak.

I made the entries array is cyclical to exclude overhead of table
compaction or/and table size change for usage the hash tables as a
queue.

I implemented a compact representation of small tables of up to 8
elements. I also added tests for small tables of size 2, 4, 8 to
check small hash tables performance.

I also tried to place hashes inside bins to improve data locality in
cases of collisions as Yura wrote. It did not help. The average
results were a bit worse. I used the same number of elements in the
bins. So the array bins became 2 times bigger and probably that
worsened the locality. I guess tuning ratio #bin
elements/#entries could improve the results but I believe the
improvement will not worth of doing this. Also implementing a
better hashing will probably make the improvement impossible at all.

Working on the experiment described above I figured that sometimes
MRI hash functions produces terrible hashes and the collisions
achieves 100% on some benchmarks. This is bad for open addressing
hash tables where big number of collisions results in more cache
line reads than for tables with chains. Yura Sokolov alreday wrote
about this.

I ran ruby murmur and sip24 hashing on smhasher test
(https://github.com/aappleby/smhasher). MurMurHashing in st.c in
about 3 faster than sip24 on bulk speed test (2GB/sec to 6GB/s) but
murmur purely performs practically on all hashing quality tests
except differential Tests and Keyset 'Cyclic' Tests. E.g. on
avalanche tests the worst bias is 100% (vs 0.6% for sip24). It is
very strange because all murmur hash functions from smhasher behave
well.

I did not start to figure out the reason because I believe we should
use City64 hash (distributed under MIT license) as currently the
fastest high quality non-crypto hash function. Its speed achieves
12GB/s. So I changed murmur by City64.

I also changed the specialized hash function (rb_num_hash_start)
used for bm_hash_ident tests. It permits to improve collisions for
these tests, e.g. from 73% to 0.3% for hash_ident_num.

I believe usage of City64 will help to improve the table performance
for most widely used case when the keys are strings.

Examining siphash24. I got to conclusion that although using a fast
crypto-level hash function is a good thing but there is a simpler
solution to solve the problem of possible denial attack based on
hash collisions.

When a new table element is inserted we just need to count
collisions with entries with the same hash (more accurately a part
of the hash) but different keys and when some threshold is achieved,
we rebuild the table and start to use a crypto-level hash function.
In reality such function will be never used until someone really
tries to do the denial attack.

Such approach permits to use faster non-crypto level hash functions
in the majority of cases. It also permits easily to switch to usage
of other slower crypto-level functions (without losing speed in real
world scenario), e.g. sha-2 or sha-3. Siphash24 is pretty a new
functions and it is not time-tested yet as older functions.

So I implemented this approach.

I also tried double probing as Yora proposed. Although
performance of some benchmarks looks better it makes worse code in
average (average performance decrease is about 1% but geometric mean
is about 14% because of huge degradation of hash_aref_flo). I guess
it means that usage of the double probing can produce better results
because of better data locality but also more collisions for small
tables as it always uses only small portions of hashes, e.g. 5-6
lower bits. It might also mean the the specialized ruby hash
function still behaves purely on flonums although all 64 bits of the
hash avoids collisions well. Some hybrid scheme of usage of double
probing for big tables and secondary hash function using other hash
bits for small tables might improve performance more. But I am a
bit skeptical about such scheme because of additional overhead code.

All the described work permitted to achieve about 35% and 53% (using
the right measurements) better average performance than the trunk
correspondingly on x86-64 (Intel i7-4790K) and ARM (Exynos 5410).

I've just submitted my changes to the github branch (again it is far
from the final version of the patch). The current version of the pach
can be seen as

I made the entries array is cyclical to exclude overhead of table
compaction or/and table size change for usage the hash tables as a
queue.
I doubt using hash table as a queue is a useful case. But I could be mistaken.
And cyclic allocation doesn't solve LRU usecase at all :-(

I mean, I really think st_table_entry compaction is a right way to go
(so I agree with your initial proposal in this area).

In my implementation I just check that there is at least 1/4 of entries are deleted.
If so, then I compact table, else grow.

Single call to st_hash in hash.c is just to combine two hashsums (calculated with SipHash) into one.

All usage of st_init_strtable are not performance critical.

Yura, I suspect you missed my point. What I proposed is to use also City24 in cases where siphash24 is currently used and use siphash24 only when the attack is on the way. Therefore I added a code to the tables to recognize the attack and switch to crypto-level hash function which in reality will be extremely rare. City24 is 6 times faster than siphash24, therefore Hash will be faster and still crypto-strong. Moreover you can use other crypto-level hash functions without losing the speed for real world scenarios.

I made the entries array is cyclical to exclude overhead of table
compaction or/and table size change for usage the hash tables as a
queue.

I doubt using hash table as a queue is a useful case. But I could be mistaken.
And cyclic allocation doesn't solve LRU usecase at all :-(

Still it is nice to have because it costs practically nothing (one additional mask operation whose tiny latency time will be hidden between execution of parallely executed insns on modern superscalar OOO processors).

also changed the specialized hash function (rb_num_hash_start) used for bm_hash_ident tests

I've seen you do it. Great catch!
I also fixed it in other way (cause I don't use perturb I must be ensure all bits
are mixed into lower bits).

I also tried double probing

It takes me a time to realize that you mean quadratic probing :-)double probing may be: test slot, then test neighbor, do long jump and repeat.
It is not wildly used technique.

Sorry, my memory failed me again. That was quadratic probing which you proposed to use for open-addressing tables.

Do you configure with --with-jemalloc ?trunk is much faster with jemalloc linked, and it is hard to beat it by performance.

Not yet but will do.

Unfortunately, Redmine still doesn't work with your branch, so I cann't benchmark it.

Thanks. I'll investigate this. Unfortunately I am too far away from web development and have no experience with such applications but I will learn it.

The most exciting thing for me is that better hash tables can improve the performance of real applications. Before this, I looked at the table work more as an exercise to become familiar with MRI before deciding to do a real performance work.

The most exciting thing for me is that better hash tables can improve the performance of real applications.

So then you should benchmark with real application.

For example, you will be surprised how small effect gives changing siphash to faster function.

By the way, do you run make check regulary?

Not really. I do regularly make test. I did make check on my branch point before committing my first patch to the branch but I had a lot of messages something about leaked file descriptor. So I thought make check was broken.

I guess I should not ignore it. Just running it now I see a failure on test_numhash.rb. I am just going to start work on fixing it.

You may take the same function from my patch: cause my hash table uses same close addressing as cutrent trunk st, I ought to mix high bits into lower.
Also, there is a fix for current murmur_finish function.

I also changed the specialized hash function (rb_num_hash_start)
used for bm_hash_ident tests. It permits to improve collisions for
these tests, e.g. from 73% to 0.3% for hash_ident_num.

Nice. Do you think it's worth it to split this change out for
use with the current st implementation?

Thank you. Right now it has a little sense especially for small tables. The current implementation simply ignores a lot of higher bits, e.g. a table with 32-bit entries uses only 5 lower bits of 64-bit hash. The change I've done is good when you use all 64-bits of the hash and my implementation finally consumes all these 64 bits although several collisions should occur for this.

To be a good hash function when you use only small lower part of the hash, the function should behave well on avalanche test. It means that when you change just one bit in any part of the key, "avalanche" of changes happens in the result hash (the best when half of hash bits through the all hash changes). I guess any simple specialized function will be bad on such test. Major their advantage is high speed. Murmur, City64, SipHash24 are quite good on such tests but they are much slower.

I think I'll have a few patches when I am done with the hash tables: the hash table itself, hash functions, code for recognizing a denial attack and switching to stronger hash functions. I am not sure that all (or any) will be finally accepted.

About 64bit versus 32bit index: several developers discussed this on this month's developer meeting. Consensus there was that we do not want to limit Hash size, while it is true that over 99% of hashes are smaller than 4G entries, and it is definitely a good idea to optimize to them. We did not reach a consensus as to how should we achieve that, though. Proposed ideas include: switch index type using configure, have 8/16/32/64bit versions in parallel and switch smoothly with increasing hash size.

About 64bit versus 32bit index: several developers discussed this on this month's developer meeting. Consensus there was that we do not want to limit Hash size, while it is true that over 99% of hashes are smaller than 4G entries, and it is definitely a good idea to optimize to them. We did not reach a consensus as to how should we achieve that, though. Proposed ideas include: switch index type using configure, have 8/16/32/64bit versions in parallel and switch smoothly with increasing hash size.

Compliment:

(1) switch index type using configure

Easy. However, we can not test both versions.

(2) have 8/16/32/64bit versions in parallel and switch smoothly with
increasing hash size.

There is one more:

(3) st only supports 32bit, but Hash object support 64bit using another
technique (maybe having multiple tables).

Anyway, we concluded that this optimization should be introduced for Rub
2.4 even if it only supports 64bit. Optimization with 32bit length index
is nice, but we can optimize later.

About 64bit versus 32bit index: several developers discussed this on this month's developer meeting. Consensus there was that we do not want to limit Hash size, while it is true that over 99% of hashes are smaller than 4G entries, and it is definitely a good idea to optimize to them. We did not reach a consensus as to how should we achieve that, though.

Matz also at the beginning of the discussion of this issue said that
currently, he would emphasize speed over size, so if e.g. the 32-bit
version led to a 10% speed increase, going with 32-bit would be okay.

We also agreed that currently, a Hash with more than 4G elements is
mostly theoretical, but that will change in the future. So it may be
okay to stay with 32-bit if we can change it easily to 64-bit in the future.

I strongly prefer configure option with default to 32bit.
99% should not pay for 1% ability.

32bit version doesn't lead to performance improvement (may be 1%, not more).
But it leads to memory consumption reduction (up to 10%).

(3) st only supports 32bit, but Hash object support 64bit using another
technique (maybe having multiple tables).

Is it ok for Hash to have different implementation than st_table ?
There is source level dependency in some gems for RHASH_TBL :-(
May we force library authors to fix their libraries?

It will be very promising to switch implementation to something that needs
one allocation for small hashes instead of current two allocations
(one for st_table and one for entries).
Also, implementation could be more tightly bound with Hash needs and
usage patterns.

About 64bit versus 32bit index: several developers discussed this on this month's developer meeting. Consensus there was that we do not want to limit Hash size, while it is true that over 99% of hashes are smaller than 4G entries, and it is definitely a good idea to optimize to them. We did not reach a consensus as to how should we achieve that, though. Proposed ideas include: switch index type using configure, have 8/16/32/64bit versions in parallel and switch smoothly with increasing hash size.

Compliment:

(1) switch index type using configure

Easy. However, we can not test both versions.

(2) have 8/16/32/64bit versions in parallel and switch smoothly with
increasing hash size.

There is one more:

(3) st only supports 32bit, but Hash object support 64bit using another
technique (maybe having multiple tables).

Thank you for the clarification.

I'll investigate the possibility of usage less bits for indexes (and may be hashes) without introducing constraints on the table size. Right now I can say that using 32-bit for 64-bit Haswell improves the average performance of my tables by about 2% on MRI hash benchmarks. On my estimations, the size should be decreased by 12.5%. So it is worth to investigate this.

Anyway, we concluded that this optimization should be introduced for Rub
2.4 even if it only supports 64bit. Optimization with 32bit length index
is nice, but we can optimize later.

I'll work on it. I thought that my patch was a final variant of the table when I submitted it first. But having all the feedback I believe I should work more on the code.

As I wrote I'll probably submit a few patches because some work I've already done might seem controversial. I think about the new hash table patch, new hash functions patch, and a patch for the tables dealing with a denial attack. Yura Sokolov could submit his patch for using alternative allocation.

I planning to do some research, actual coding and submit the patches definitely before the summer but it might happen earlier (it depends on how busy I am with my major work). I believe we will have enough time to introduce the new hash tables in 2.4.

I think I'll have a few patches when I am done with the hash
tables: the hash table itself, hash functions, code for
recognizing a denial attack and switching to stronger hash
functions. I am not sure that all (or any) will be finally
accepted.

I look forward to these and would also like to introduce some
new APIs for performance.

One feature would be the ability to expose and reuse calculated
hash values between different tables to reduce hash function
overheads.

This can be useful in fstring and symbol tables where the same
strings are repeatedly reused as hash keys.

For example, I was disappointed we needed to revert r43870 (git
commit 0c3b3e9237e8) in [Misc #9188] which deduplicated all
string keys for Hash#[]=:

https://bugs.ruby-lang.org/issues/9188

So, perhaps being able to reuse hash values between different
tables can improve performance to the point where we can dedupe
all string hash keys to save memory.

I am also holding off on committing a patch to dedupe keys
from Marshal.load [Feature #12017] because I hope to resurrectr43870 before Ruby 2.4:

https://bugs.ruby-lang.org/issues/12017

I had another feature in mind, but can't remember it at the
moment :x

I doubt I'll have time to work on any of this until you're done
with your work. My time for ruby-core is limited for the next
2-3 months due to vacations and other projects.

I think I'll have a few patches when I am done with the hash
tables: the hash table itself, hash functions, code for
recognizing a denial attack and switching to stronger hash
functions. I am not sure that all (or any) will be finally
accepted.

I look forward to these and would also like to introduce some
new APIs for performance.

Thank you for sharing this. I am still working on the hash tables when I have spare time.

I already implemented variable indexes (8-, 16-, 32-, 64-bits). It gave about 3% average improvement on MRI hash benchmarks. I tried a lot and did some other changes too. __builting_prefetch gave a good improvement while __builtin_expect did nothing. So the current average improvement on MRI hash benchmarks is close to 40% on Intel Haswell and >55% on ARMv7 (using the right comparison of one miniruby vs. another miniruby).

One feature would be the ability to expose and reuse calculated
hash values between different tables to reduce hash function
overheads.

This can be useful in fstring and symbol tables where the same
strings are repeatedly reused as hash keys.

For example, I was disappointed we needed to revert r43870 (git
commit 0c3b3e9237e8) in [Misc #9188] which deduplicated all
string keys for Hash#[]=:

So, perhaps being able to reuse hash values between different
tables can improve performance to the point where we can dedupe
all string hash keys to save memory.

The hash reuse would be nice. Hash calculation can be expensive. May be it even will permit to remove hash from the entries at least w/o losing performance. I actually tried to remove hash and recalculate it when it is necessary but it gave worse performance (i also tried to put smaller hashes into bins to decrease cache misses -- it worked better but still worse than storing full hash in entries).

I am also holding off on committing a patch to dedupe keys
from Marshal.load [Feature #12017] because I hope to resurrectr43870 before Ruby 2.4:

I doubt I'll have time to work on any of this until you're done
with your work. My time for ruby-core is limited for the next
2-3 months due to vacations and other projects.

This project is holding me back too (it can be quite addicted but I feel I am already repeating the same ideas with small variations). I'd like to move on to another project. So I'll try to submit a major patch before the end of April in order you have time to work on the new code.

I am sending the current overall patch. I think it is pretty close to what I'd like to see finally. I hope the patch will help with your work.

The patch is relative to trunk at 7c14876b2fbbd61bda95e42dea82c954fa9d0182 which was the head about 1 month ago.

Since submitting my first patch for new Ruby hash tables, a lot of
code was reworked and a lot of new implementation details were tried.
I feel that I reached a point from where I can not improve the hash
table performance anymore. So I am submitting the final patches (of
course I am ready to change code if/after people review it and propose
meaningful changes).

I broke the code into 4 patches.

The first patch is the base patch. The most work I did for new hash
table implementation is in this patch. The patch alone gives 32-48%
average performance improvement on Ruby hash table benchmarks

The second patch changes some simple hash functions to make them
faster. I gives 5-8% speedup additionally.

The third patch might seem controversial. It implements a code to
use faster but not strong hash functions and to recognize hash table
denial attacks and switch to stronger hash function (currently to Ruby
siphash). It gives additional 6-7% average performance improvement.

The forth patch changes MurMur hash to Google City64 hash. Although
on very long keys City64 is two times faster than MurMur in Ruby, its
usage actually makes Ruby hash table benchmarks slower. So I do
not propose to add this patch to Ruby trunk. I put this patch here
only for people who might be interesting to play with it.

All patches were tested on x86/x86-64, ARM, and PPC64 with make
check. I did not find any additional test regressions.

The patches were also benchmarked on Ruby hash table benchmarks on
Intel 4.2GHz i7-4970K, 2GHz ARM Exynos5422, and 3.55GHz Power7. This
time I used more accurate measurements using the following script:

Anyways, I've taken a light look at patches 1-3 and applied them
to some of my own machines. There was some whitespace warnings from
git, so I also pushed them out in the "f-12142-hash-160429" branch
to git://bogomips.org/ruby.git (or https://bogomips.org/ruby.git)

Will wait for others to have time to review/test, too, since it's
the weekend :>

It is relatively easy to setup Redmine http://www.redmine.org/ - the same software which powers this site (bugs.ruby-lang.org).
Add project, some issues and use ab (apache benchmark) to stress load it.

Anyways, I've taken a light look at patches 1-3 and applied them
to some of my own machines. There was some whitespace warnings from
git, so I also pushed them out in the "f-12142-hash-160429" branch
to git://bogomips.org/ruby.git (or https://bogomips.org/ruby.git)

Will wait for others to have time to review/test, too, since it's
the weekend :>

Thank you, Eric. I don't expect a quick response. In GCC community, a review of the patches of such size can take a few weeks. I promised you to send the patches at the end of April, in order you can work on it (by the way the previous patch I sent you was buggy).

I read the documents but I can not find anything about copyright assignment procedure. Is there any? It is not a trouble for me because I am willing to give the code to whoever is the copyright holder of MRI.

As Yura wrote I missed the new benchmarks. I am adding them here. It should be part of the base patch.

It is relatively easy to setup Redmine http://www.redmine.org/ - the same software which powers this site (bugs.ruby-lang.org).
Add project, some issues and use ab (apache benchmark) to stress load it.

Well, I am too far away from the world of WEB developers. I've tried and it was not easy for me. After spending some time, I decided that if it is not easy for me, it will be not easy for some other people too. The most important part of benchmarking is its reproducibility. For example, it was interesting for me when you reported a big improvement on Redmine. I was interesting what part of the improvement belongs to new hash tables and what part belongs to jemalloc usage.

After that I tried easier setup with Sinatra using puma and wrk (a benchmark client) on i7-4790K. I used 4 threads and 4 workers. The results were varying in a big range, so I did 10 runs each for 30 sec for the both hash table implementation. The best speed I got was 11450 reqs/sec for the old table and 11744 reqs/sec for the new ones. But again I'd take the results with a grain of salt as I wrote the results were varying too much for runs.

I'd like only to add that, IMHO, switching to jemalloc by default should be a well tested decision. I read that some people reported that for their load jemalloc used >2GB memory when glibc malloc used only 900MB. Typical Facebook and Google work load can be very different from other people load. For example, GCC has features which nobody except Google needs them (whopr which is a link time optimization mode for such huge C++ programs whose IR can not fit into memory of very beefy servers).

I know that glibc people are well aware about jemalloc/tcmalloc competition and recently started to work on investigating the allocators behavior and probably malloc modernization.

Overall performance is on par with Vladimir's patch.
(note: Vladimir's patch removes seeding for fixnum/double/symbol hashing,
so it could be faster for keys of such types)
(I've tested against Eric's Wong mailbox applied on trunk, so no CityHash)

I want to help to merge these excellent results into trunk.
I’m now reviewing both codes, and trying to consider other benchmarks to provide more solid evidence toward merge and to compare three implementations.

Firstly, Clang reports errors. (It seems to be caused by -Wshorten-64-to-32)
I sent PRs to both repos.

To compare apples to apples, I believe you should use this non-zero
macro. It will definitely change the result as it will make Yura's
table elements size by 33% bigger in comparison with the elements size
of Yura's smaller default tables and my implementation.

Hashing numbers on the trunk is based on shifts and logical
operations. My hash for numbers is faster not only on 64-bit but on
32-bit targets too, even with embedded shifter as in ARM when an
operand of an arithmetic insn can be preliminary shifted during the
insn execution.

I think, potentially, there is a security issue if an application getting
strings, transforms them into numbers and uses them as keys. But I
don't know such Ruby applications.

My implementation itself does not introduce the security issue as
the original code on trunk has the same problem. Moreover producing
conflicting integer keys for the trunk code is very easy while for my
hash of integers it will take several months on modern supercomputers
to produce such keys for integers on 64-bit machines.

Still I can make a patch quickly which completely eliminates this issuewithout any slowdown in my hashing for numbers if people decide
that it is a real security issue. Instead of a constant seed used for now,
I can produce a seed once per MRI execution from crypto-level PRNG. Moreover,
I can speed up hashing for numbers even more for the last 4 generations
of Intel CPUs.

Still I can make a patch quickly which completely eliminates this issue
without any slowdown in my hashing for numbers

Please, do it now. So no one will ever argue.

while for my
hash of integers it will take several months on modern supercomputers

May you give a link to paper studying problem of cracking algorithm you use? I really like to read it for self education.

BTW, I doubdt its faster (than st_hash_uint) when compiled as 32bit x86. But I agree, it doesn't really matter which is faster, cause almost no one runs Ruby on 32bit this days.

Still, my fix to st_hash (especially st_hash_uint and st_hash_end) are valuable, and they can coexist with your patch.
Changing SipHash to SipHash13 is also valid and also independent of hash table algo.

Still I can make a patch quickly which completely eliminates this issue
without any slowdown in my hashing for numbers

Please, do it now. So no one will ever argue.

After some thoughts, I believe the security issue (collision
exploitation) we are discussing is not a problem for my implementation.

First of all, we have a translation of N-bit value into N-bit hash
where N=64 is most interesting. For a quality hash function, it is
mostly 1-to-1 function. I can not measure maximum number of
collisions for N=64 but for 32, after half hour on my desktop machine,
I got maximum collision number equal to 15 for 32-bit variant of my
function using 32x32->64 bit multiplication. So hypothetically, even
an attacker spends huge CPU time on 64-bit machine to get keys with
the same hash, I guess, he will produce a small number of collisions for
the full hash.

There is still an issue for a table with chains as it uses a small
part of hash (M bits) to choose the right bin, it makes an actual hash
function to map 2N values into 2M values where N > M. It is easy to
generate a lot of keys with the same least significant M-bits for the original
function (k>>11 ^ k<>3).

In the table with open addressing the M-bits of N-bits hash is used
only initially but after each collision, the other bits of hash are
used until all hash bits are consumed.

Changing SipHash to SipHash13 is also valid and also independent of hash table algo.

I don't think it is that important. I tried your tables with
siphash13 and siphash24 and for siphash13 the average improvement of the
hash table benchmarks increased only by 1% (out of 37%).

Does my version leads to increased memory consumption? I change array size a bit more often than 2x, so there should not be much unused allocated space. (It could be done with Vladimirs patch also after some refactoring)

"attack" detection is not by full hash equality,
but with "collision chain is statistically too long" (simpler to implement).

improvements to st_hash* functions and siphash (it could be applied to Vladimir's patch also).

Cons of my version:

It doesn't mark deleted entries in bins array.
It allows simplify code significantly, and usually doesn't matter.
But may lead to a bit more cache misses on insertion into hash with deleted entries.

Single pointer to bins and entries needs single allocation. It leads to more coarse allocations, and could be disadvantage.
And it is a trick, that could be considered disgusting.

I do not make separate "reserving" search for st_insert, cause st_insert is not used in Hash.
(st_update of Vladimir's version also do not use reservation).
It simplifies code, but could be slower in some other places.
It is not hard to implement it, if considered neccessary.

"statistically too long collision chain" may happen occasionally, though, it is unlikely.

I didn't run benchmark.
I've measured by making documentation for ruby, and my version were on par with Vladimir's.
(and yes, it was really hard to make it close to).
(hope benchmark will not be shameful :-) )

I'm adding alternative patch version for st_table.
It is my compromise with Vladimir and his proposals.

Yura, I am glad that finally you accepted open address tables. Now
there is no major differences between my original table proposal and
your current implementation. Although there are differences in details.

I don't like idea adding more and more new things to the patch without
separate evaluation of them. Therefore I stopped to change my patch
five mounths ago and only did a rebase. Effect of my 4 major parts of
the patch on performance were evaluated and posted by me to confirm
their inclusion or not.

The base patch could be what they already evaluated
or agree to evaluate. Otherwise, you and me will create new "better" versions
of the patch creating stress to them. Adding new features separately is also
good for keeping this thread readable (I suspect we made a record for the longest
discussion on this Ruby discussion board).

Your last new features are interesting but some of them is not obvious
to me, for example, using tables with less 232by default or
faster hash table grow as you wrote for jemalloc allocation pattern
(As I know MRI Ruby does not use jemalloc yet and I am not sure it
should be used because it uses more memory for some loads and also
because Glibc community is serious to improve their malloc --https://gcc.gnu.org/wiki/cauldron2016#WholeSysTrace).

We (mainly ko1) had a deeper look at this issue in developer meeting today. Ko1 showed his microbenchmark result over the three (proposed two and the current) implementations. I don't have the result at hand right now so let me refrain from pointing the winner. I believe he will reveal his benchmark script (and hopefully the result).

What I can say now is that he found the most use of hash tables in a Rails application was lookups of small (entryies <= 64) hashs. So to speed up real-world applications, that kind of usage shall be treated carefully.

Anyways, we (at least myself) would like to accept either one of proposed hash implementations and in order to do so, ko1 is measuring which one is better. Thank you for all your efforts. Please let us give a little more time.

You wrote "I agree to determine the implementation with coin toss"
in your report. I think it is better not do this.

My implementation is the original one. Yura just repeated my major
ideas: putting elements in array, open hash addressing (although he
fiercely arguing with me a lot that the buckets are better), using
weaker hash function when there is no supension of a collision attack.
I did not use any his ideas.

I believed the competition was unhealthy and it seems I was right as
Yura spent his time on his implementation and you had to spend your time on your thorough
investigation and still you need "a coin toss" to decide.

I'd like to propose a plan which is to use my code as the base and
Yura can add his own original code later:

performance of "many" tables is usually better
(cause of smaller header and single pointer to bins+entries,
that leads to lesser TLB misses)

Unexpected behaviour:

unstable performance of single small table.

my attempts to make murmur strong makes it remarkably slower.

Using this information I added couple of commits:

first commit switches table to version with "bins" earlier.
It improves performance of tables of size 5-8,
and reduces size of tables 9-10.
Although, tables of size 8 are larger a bit,
and tables of size 10 could be a bit slower.

second commit improves performance of "murmur".
"Murmur" still remains seeded with random seed,
and it is not as trivial for flood as original murmur,
but it becomes weaker than previous variant.
Additionally, read of string tail improved for cpu with unaligned word access.

(branch on github is updated and rebased on trunk:https://github.com/funny-falcon/ruby/commits/st_table_with_array3
It contains additional commit with adding rb_str_hash_weak
for symhash and rb_fstring_hash_type.
It improves "#{dynamic}".to_sym a bit.
But this commit doesn't affect your benchmark or bm_hash*,
so I didn't include it to mbox)

Hello, the following patch contains a new implementation of hash
tables (major files st.c and include/ruby/st.h).
Modern processors have several levels of cache. Usually,the CPU
reads one or a few lines of the cache from memory (or another level of
cache). So CPU is much faster at reading data stored close to each
other. The current implementation of Ruby hash tables does not fit
well to modern processor cache organization, which requires better
data locality for faster program speed.
The new hash table implementation achieves a better data locality
mainly by
o switching to open addressing hash tables for access by keys.
Removing hash collision lists lets us avoid *pointer chasing*, a
common problem that produces bad data locality. I see a tendency
to move from chaining hash tables to open addressing hash tables
due to their better fit to modern CPU memory organizations.
CPython recently made such switch
(https://hg.python.org/cpython/file/ff1938d12240/Objects/dictobject.c).
PHP did this a bit earlier
https://nikic.github.io/2014/12/22/PHPs-new-hashtable-implementation.html.
GCC has widely-used such hash tables
(https://gcc.gnu.org/svn/gcc/trunk/libiberty/hashtab.c) internally
for more than 15 years.
o removing doubly linked lists and putting the elements into an array
for accessing to elements by their inclusion order. That also
removes pointer chaising on the doubly linked lists used for
traversing elements by their inclusion order.
A more detailed description of the proposed implementation can be
found in the top comment of the file st.c.
The new implementation was benchmarked on 21 MRI hash table benchmarks
for two most widely used targets x86-64 (Intel 4.2GHz i7-4790K) and ARM
(Exynos 5410 - 1.6GHz Cortex-A15):
make benchmark-each ITEM=bm_hash OPTS='-r 3 -v' COMPARE_RUBY='<trunk ruby>'
Here the results for x86-64:
hash_aref_dsym 1.094
hash_aref_dsym_long 1.383
hash_aref_fix 1.048
hash_aref_flo 1.860
hash_aref_miss 1.107
hash_aref_str 1.107
hash_aref_sym 1.191
hash_aref_sym_long 1.113
hash_flatten 1.258
hash_ident_flo 1.627
hash_ident_num 1.045
hash_ident_obj 1.143
hash_ident_str 1.127
hash_ident_sym 1.152
hash_keys 2.714
hash_shift 2.209
hash_shift_u16 1.442
hash_shift_u24 1.413
hash_shift_u32 1.396
hash_to_proc 2.831
hash_values 2.701
The average performance improvement is more 50%. ARM results are
analogous -- no any benchmark performance degradation and about the
same average improvement.
The patch can be seen as
https://github.com/vnmakarov/ruby/compare/trunk...hash_tables_with_open_addressing.patch
or in a less convenient way as pull request changes
https://github.com/ruby/ruby/pull/1264/files
This is my first patch for MRI and may be my proposal and
implementation have pitfalls. But I am keen to learn and work on
inclusion of this code into MRI.

I believe, chaining is better.
But difference in size, and quirks with maintaining "optimized" bins on deletion
convinced me to switch to open-addressing.

If you wish, I may attempt to make "chaining" again on top of current code.
There is not big difference between code bases.
I think, using same trick on deletion as in open addressing (don't bothering with
maintaining exact bin), code could be as clear.
But then table will use 30% more memory on 64bit with HUGE=1 and always on 32bit.

Vladimir:

I'm glad you returned to competition. Together we will make st_table great!

I'm glad you returned to competition. Together we will make st_table great!

To be honest I am not glad. I am working on RTL VM insns and JIT and it is a big distraction for me.

Instead of accepting one (original) variant and then improving it (as I proposed yesterday), here we have more variants and more variants coming as there is no deadline of this stupid competition without exact rules.

Koichi did a big job comparing already existing tables. Now this job should be done again and again especially when you are returning back saying that "I believe, chaining is better." (which means at least 2 more variants from you).

(it is applied after last additional two commit in hash_improvements_additional.mbox)

It has better performance than both my and Vladimir's open-addressing variants.
It will consume same amount of memory on 64bit with HUGE=0 as my open-addressing variant.
It will consume 30% more memory on 32bit or HUGE=1.

I did not find any meaningful changes in average performance on MRI hash table benchmarks.

I also merged last changes on the trunk to my branch.

I'll post table size graphs tomorrow (my benchmarking machine is currently busy for gcc benchmarking). It takes about 20 min for building tables of sizes from 1 to 6OK.

I don't know how we will benchmark the speed as on my estimation Koichi's script will take about 2 hours on my fastest machine for one variant. As I understand now we need to test 7 variants (trunk, my April and my current variants, and 4 Yura's). I probably will not do it as I can not reserve my machine exclusively for this work for such long time. Sorry.

'vlad april' means my April variant without the last two commits to
st.c.

trunk means the current trunk (03c9bc2)

yura means latest Yura's variant
(https://github.com/funny-falcon/ruby.git, branch
st_table_with_array3, 39d9b2c3658c285fbe4b82edc1c7aebaabec6aaf) of
tables with open addressing. I used a 64-bit variant, there is no
practically difference in performance and memory consumption with
32-bit variant.

Here are the tables speed improvements relative to the trunk on MRI hash benchmarks:

My current variant requires less memory than April one. The current
variant is close to average memory consumption of Yura's one (see
the integral, a square of areas under the curves).

Yura's curve is more smooth because of more frequent and smaller
rebuilding but probably it results in the speed decrease.

There is trade-off in my implementation (and probably Yura's ones)
in table speed and its size.

More size reduction in my tables by increasing table fill rate
(e.g. from 0.5 to 3/4) is not worth it as one bin size is 1/24 -
1/12 of one entry size for tables with less 215 elements.

Unfortunately, it is not known what part of the overall footprint
belongs to the tables in Rails. Hash tables might be a small part
in comparison with other Rails data. Even if the tables are the
lion part, their elements and keys might require more memory than
table entries.

Therefore it is hard to say what trade-off in performance/table size
should be for Rails. I personally still prefer my April variant.

'vlad april' means my April variant without the last two commits to
st.c.

trunk means the current trunk (03c9bc2)

yura means latest Yura's variant
(https://github.com/funny-falcon/ruby.git, branch
st_table_with_array3, 39d9b2c3658c285fbe4b82edc1c7aebaabec6aaf) of
tables with open addressing. I used a 64-bit variant, there is no
practically difference in performance and memory consumption with
32-bit variant.

Here are the tables speed improvements relative to the trunk on MRI hash benchmarks:

My current variant requires less memory than April one. The current
variant is close to average memory consumption of Yura's one (see
the integral, a square of areas under the curves).

Yura's curve is more smooth because of more frequent and smaller
rebuilding but probably it results in the speed decrease.

There is trade-off in my implementation (and probably Yura's ones)
in table speed and its size.

More size reduction in my tables by increasing table fill rate
(e.g. from 0.5 to 3/4) is not worth it as one bin size is 1/24 -
1/12 of one entry size for tables with less 215 elements.

Unfortunately, it is not known what part of the overall footprint
belongs to the tables in Rails. Hash tables might be a small part
in comparison with other Rails data. Even if the tables are the
lion part, their elements and keys might require more memory than
table entries.

Therefore it is hard to say what trade-off in performance/table size
should be for Rails. I personally still prefer my April variant.

Vladimir,
Yes, I excluded tables, and I measured after preprocessor,
so all #if are removed:

removed function that checks table consistency (I also has such function, and it is also removed)

removed all calls to such function and assertions
Here is result after removing single 'break;' and function definition types
(and I ought to redefine NULL and assert, cause they contained newlines
and your st_assert, cause mine is no-op):
https://gist.github.com/funny-falcon/581ecf363c4aa7d8ac1d805473080a1b
392 lines vs 546 lines.
Is it fair now?

Any way,
I apologize again for my messages after decision.
It were inappropriate.
It is just emotional reaction.
I was ready to loose, but I really thought my version is simpler, so it is reaction only on that remark.

You can see all of code history here:
<https://github.com/vnmakarov/ruby/tree/hash_tables_with_open_addressing>
This improvement is discussed at
<https://bugs.ruby-lang.org/issues/12142>
with many people, especially with Yura Sokolov.
* st.c: improve st_table.
* include/ruby/st.h: ditto.
* internal.h, numeric.c, hash.c (rb_dbl_long_hash): extract a function.
* ext/-test-/st/foreach/foreach.c: catch up this change.

Sorry for not choosing yours. Both are as good, the difference is so slight, and I had to make a decision at the moment.
I felt Vlad's version has some root for improvement that we see in yours. That's the reason I said "baseline".

I was disappointed not because you chose Vlad's version,
but because you called it "simpler" than mine.
I'm glad that Ruby will have new implementation.
I do respect your choice.
I do respect Vladimir's work. I learned a lot from it.

(1) Your last version is not open addressing.
It is hard to compare many evaluation points.
This was main reason.

(2) You mentioned that chained version will increase memory usage.
At this time (before Ruby 2.4), we need to decide
(maybe this is last chance to merge),
so I decide to see memory usage more for Ruby 2.4.

(3) Your patch is not on github, but with an attached patch.
It is slightly tough for me.
(I made a script to build ruby on svn/git branch)

(4) I measured them last week and I have no plenty time.
(this is why I post this update version today)

(3) and (4) is not fair reason. Only because my laziness (if I had
shorten my sleeping time, I could do).