Juliacon 2024

Unlocking lightning speed of Julia's Dict with Robinhood Hashing
2024-07-10 , For Loop (3.2)

Rooted in Google's Swiss Tables brilliance, the current implementation flaunts 1-byte metadata, elevating successful lookups to unprecedented efficiency. However, this leads to inefficient performance for unsuccessful queries. Enter the paradigm shift: the fusion of metadata intricacies and the precision of Robinhood hashing. Visualize this – rehashing without redundant hash recalculations and lightning fast querying time (both success and failure scenarios). Brace for a Dict metamorphosis!


Problems with Julia's Dict

Julia's Dict is currently using a variation of Swiss Tables which is an open addressing based linearly probed dictionary, but it has certain shortcomings.
1. For each key, value pair in the Dict, we store a byte where 7 bit is hash and 1 bit indicates validity. Storing this metadata allows us to improve the performance of successful lookups but unsuccessful lookups do not benefit from this.
2. During rehashing of the hash table, we have to recompute the hash again for inserting the elements into its proper location and it might be an expensive operation to compute the hash for complex data types. We can make use of metadata to get rid of this shortcoming and improve Dict’s performance during rehash.

Proposed Solution

We can do all of this using a simple concept called RobinHood hashing. We can use metadata to effectively improve rehash performance and time taken for unsuccessful lookups, at the same time not compromising on the lightning speed that we all crave.

Blatantly speaking, Robin Hood hashing technique is a variation of linear probing technique which improves over certain shortcomings of the linear probing. The basic idea is to take normal open addressing, and then use one clever trick in order to drastically reduce the variance of the expected average and maximum probe lengths. Shorter expected probe length leads to faster insert, delete and lookup operation on the hash table.

The way to do this is: when you probe for a position to insert a new element, if the probe length for the existing element is less than the current probe length for the element being inserted, swap the two elements and keep going. This small operation leads to drastic change in expected probe lengths and we can use this fact to improve the time of unsuccessful lookups. How? Intuitively, this simple swapping operation leads to non-decreasing probe lengths among keys which suffer from "hash collision". During search of a particular key, we can use this fact to terminate our search "early" if our current probe length is greater than the element that we are currently hovering over.

Keeping all of the above in mind, the idea that I am proposing is to use 4 byte of metadata per hash table entry and store 3 byte for hash and 1 byte for storing probe sequence length. Storing 3 bytes of hash allows us to save on the time needed to rehash the hash table. 1 byte for storing the probe length allows us to compute the “cost” of occupying a particular location cheaply, thus allowing us to terminate the lookups faster.

Another important aspect to discuss with respect to hash tables is the Load Factor. Load factor is often expressed as % and determines the max capacity of the hash table until it has to rehash again. So, a load factor of 90% means that the table can be 90% full before it needs to rehash again. This takes into account both valid keys and “tombstone” entries. Tombstone entries are dummy entries for keys which have been recently deleted. Having these entries results in improving performance of deletions but they contribute adversely to load factor.

I experimented with the backward shift deletion technique, which does not keep tombstone entries, but removes the entry from the table and then proceeds to shift back all the neighboring entries to fill the “void” created on deletion. On one hand, this helps to simplify branch conditions for insertions and lookups (because you don’t have to bother about “missing” entries), but on the other hand, this operation of backward shift on the adjacent entries increases the time for deletion.

Benchmarks

  1. Although the mean probe length of Dict and RobinDict both stays close to 1 for 10K entries, the variance of RobinDict is much less compared to Dict. For 10K entries, it’s almost one-third of the variance shown by Dict. We see ~3x improvement in the max probe length of RobinDict for 10K entries. For Dict, max probe length can go up to 50 entries, whereas in RobinDict never crosses more than 15.

  2. While inserting an entry into Dict, we would see large spikes in the time as we inserted more and more (key, value) pairs. It happens during the time the table rehashes. With the usage of metadata, that “spike” while insertion has completely disappeared.

  3. For deleting entries from RobinDict, it’s a tradeoff - while lower probe length would lead to small lookup time, but backward shifting deletion would lead to increase in time. From my benchmarks, ~60% of the time RobinDict outperforms Dict in terms of deletion.

  4. For successful lookups, 80% of the time RobinDict outperforms current Dict implementation, while for unsuccessful lookups it outperforms about 60% of the time. To be honest, I expected this value to be slightly on the higher side (like 90%), but there is an extra branch condition introduced (as compared to Dict) in lookup code flow which is causing the increase in time.