Data locality is very important. Keeping data close together in memory is a huge win on basically every modern computing device due to our systems’ inherently multi-level memory hierarchy. Contiguous data leads to speed, and even some perhaps counter-intuitive results (like insert-and-shift into the middle of a dynamic array actually being faster than the equivalent linked list insertion1).
Despite these facts, interestingly enough the C++11 (and beyond) standard
library provides a hash table implementation in std::unordered_map
that
is a separate chaining hash table, with linked list buckets. This
implementation implies a strong trade-off: while elements inserted into the
map are guaranteed to be stable in memory (never needing to be moved or
copied), we now must chase pointers when walking through the buckets in the
hash table.
What if we allow for keys to be moved/copied around as the hash table grows? If we relax the requirement that we never move key/value pairs after insertion, we now open the door for implementing a hash table using open addressing. Here, the table is stored as a single huge array containing either a key/value pair or nothing. On an insertion where a collision occurs, an empty slot is located by “probing” through the array looking for an empty slot, following some probing strategy. The most well known is linear probing, which has developed a bit of a bad reputation. But is the disdain deserved?
In this post, I’m going to walk you through some results that motivated the development of an open source (insertion-only) probing hash table framework that’s distributed as part of the MeTA toolkit, and attempt to prove to you that naive, linear probing (or, at least, blocked probing) strategies are not nearly so bad as you might initially think. In particular, we’ll be benchmarking hash tables with both integer and string keys, taking a look at how they perform in terms of building time, memory consumption, and query throughput.
Integer Hash Functions
First, let’s address something about hashing. Hash tables depend very strongly on their hash function behaving close enough to a random function (as in, the output of the function over the key space is uniform over the output space) in order to give their nice amortized guarantees2.
Furthermore, if your hash function is predictable (as in, someone can guess what your hash function might be), people will find out and will craft inputs to take your server/application down. Having a good (nearly uniform), unpredictable hash function is very important.
With these criteria in mind, let’s see how two C++ standard library implementations fare in the case of hashing integers. I tested with the latest libstdc++ from GCC (packaged with 5.3.0 as of the time of writing) and libc++ (packaged with llvm/clang 3.7.1 as of the time of writing).
Here’s a very basic snippet:
Can you guess what the output is? To my surprise, it was:
0
1
2
3
for both standard library implementations I have available.
Every run.
Yes my friends, the standard library hash function for integral types for the most commonly used standard libraries on Linux and OS X, respectively, is the identity function and has no randomization whatsoever, making it perfectly predictable.
This is capital-B Bad and is a wide open door for attacks.
OK, so fine, we’ll just not use std::hash
and write one ourselves, right?
And we’ll definitely be able to write a good one, right?
And we’ll definitely not forget one of the 30 different specializations, right?
And you’ll definitely notice that I’m strongly hinting to you that this is a Very Bad Idea(TM), right?
So what do we do? We certainly need to replace the hash function that’s being used by default if we care about minimizing our hash table attack surface, first and foremost. But you should never, ever write your own hash function3, if you can help it.
Fortunately, the C++ standard proposal paper N3980: Types Don’t Know # has got this problem figured out (seriously, read that paper if you’re a C++ person: it’s absolutely brilliant). Hinnant et al. provide an easily extensible framework that decouples the hashing problem into two steps: providing to the implementation what to hash and then hashing those elements with a generic, byte-oriented hash function. This allows you to easily substitute your hash algorithm with a different one without having to rewrite all of your 30+ specializations to integrate it properly.
You write, once, the definition of what it means for your specific type
to be hashed (via a namespace-local, potentially friend
, overload of
hash_append()
for your type, implemented in terms of hash_append()
for
the member elements of that type that need to contribute to generating the
hash code for it).
You write, once, the definition of what it means to has a sequence of
bytes using your chosen hash function. Changing the hash function to
another one at a later date is trivial, and you do not have to write
any specializations beyond providing the overload for hash_append()
to
specify how to hash your type.
As our starting point, I implemented this framework and provided
several randomly-seeded hash functions that we can use for implementing our
hash tables. By default, we’re currently using FarmHash for the default
hash function on 64-bit machines, and Murmur34 as the default
hash function on 32-bit machines. But the wonderful thing is that I can
change my mind at any date, change one line of code, and be off to the
races using any of the other hash function’s I’ve implemented. This is
impossible if you’re using std::hash
.
Probing Hash Tables for Integers
Now that we’ve settled on a hash function, let’s do some benchmarking,
shall we? The code for this benchmark is available on Github.
For each hash table I compared against, I generated data for insertion
calls by randomly generating uint64_t
elements from the range ,
where was varied from 1,000,000 to 100,000,000 in increments of
1,000,000.
To evaluate the tables, we’ll take three looks on efficiency: building time, memory consumption, and query throughput.
Our probing hash table in this case consists of a single probing array that
contains the key/value pairs directly, ensuring that (in this case) the
cells are exactly 16 bytes. However, we do need some way of telling
whether a particular cell in that array is occupied by a valid key/value
pair. To do this, and not waste any space, we sacrifice a single key value
(called the sentinel) that will be stored in the cells that are not
occupied by valid key/value data. The particular value that’s sacrificed to
the hash table deities can be configured via key_traits
. (And, if
sacrificing a specific key value is not an option, you can use an
alternative storage strategy that I’ll talk about later in the section on
string hash tables.)
Building Time
For the running time, there’s an interesting trade-off here. If you aren’t
concerned about std::hash<uint64_t>
being the identity function (maybe
you can carefully control the input to your hash table, for example, to
guarantee no evilness), it’s clear that just using unordered_map<uint64_t,
uint64_t>
is fine. In fact, it’s the fastest among the tables we tested.
But if you’re at least a little concerned about those DoS attacks I
mentioned earlier, you can see that the probing hash table (probe_map
in
the graph) is the clear winner.
So then I asked the question: what if I use std::hash
in a probe_map
?
Clearly, evaluating the hash function is more expensive when using
meta::hashing::hash<>
(FarmHash, by default), so maybe I can beat out
unordered_map
if I use a faster hash function?
…and after about 10 hours of running what was supposed to be a “quick” test, I realized the problem: as soon as the table expands to larger than my random number input range, the keys are essentially guaranteed to coalesce into the front half of the table. The only case where elements can end up in cells and beyond is if there was a collision at cell . So, because of our naive hash function, we’ve got the primal clustering problem on steroids. Not good.
So don’t do that.
I also compared gcc
against clang
here, and there was no discernible
difference (the lines are literally right on top of one another in the
graph, so I don’t present these results here).
Memory Consumption
Next, let’s have a look at the memory consumption during building. I
measured this by using getrusage()
on my Linux-based testing environment
and extracting the ru_maxrss
field from the struct to get the number of
kilobytes used by the program at its peak memory consumption.
Here we can see one clear advantage of using the probe_map
: it (most of
the time) consumes less memory than the equivalent unordered_map
. You’ll
also see the obvious resizing jumps occurring more often, reflecting the
library’s choice of a default resizing ratio of 1.5x instead of the
commonly used 2x (though you have full control over this with probe_map
,
unlike unordered_map
’s hardcoded default).
I also compared against gcc
again and, to no one’s surprise, it is
exactly the same as clang
here, so I didn’t bother cluttering the graph.
Query Throughput
So far so good. Now let’s consider the query throughput: how quickly can I look up every element (including the duplicates) that I inserted in the table? We report the number of queries one can perform in a second here, in millions.
Since you can finally tell the difference between gcc
and clang
here, I
report both in the graph (though their differences are insignificant). At
the very top we have probe_map
, which seems to remain relatively constant
at around 1.56 million lookups/second after it dips around the 10 million
insertion mark. Below that, perhaps predictably, are unordered_map
with
std::hash
and unordered_map
with meta::hashing::hash<>
. The
throughput is hurt significantly by having a more sophisticated hash
function. Fortunately, it seems that you can easily get the best of both
worlds by using probe_map
.
Probing Hash Tables with String Keys
Strings are perhaps the most common key type used in hash tables, and
represent a significantly different challenge compared to hashing integer
keys. A C++11 compliant std::string
implementation is significantly
larger than a uint64_t
: as of the time of writing libc++ uses 24 bytes
and libstdc++ uses 32 bytes. This makes using a simplistic probing hash
table significantly less efficient. When using integer keys and values, we
could get away with just storing them directly in the array used for
probing due to their small size (the probing elements were only 16 bytes).
If we store a probing element that contains a std::string
key and a
uint64_t
value, we waste somewhere between 32 and 40 bytes per empty
cell, and there will typically be a lot of empty cells in a probing hash
table (common wisdom suggests maximum load factors of around 0.7 for
probing tables5; we’ve opted for a more aggressive 0.85
default, but these values are configurable in probe_map
just as they are
for unordered_map
.).
To get around this, the probe_map
implementation detects whether your key
is “inlineable” or not using a key_traits
class. Right now, only integral
values are considered “inlineable” and everything else is not. If a key
type is not “inlineable”, we change our representation of the table up
slightly. We still have a large array for probing, which contains two
integers: one that indicates the hash code of the element that occupies
that cell, and one that indicates the index into another array where that
key/value pair can be found, plus 1 (if the index value is 0, this
signifies an empty cell).
This has a couple advantages. First, these kinds of nontrivial keys tend to
have more complicated equality operators than just a single integer
comparison. In the case of a std::string
, this is a standard
lexicographic comparison against a byte buffer that could be located in a
non-local part of memory provided the string is long enough. If we call
operator==
on every string as we probe, each call could potentially
result in a cache-miss even if the cells are right next to each other
since the “real” data (the byte buffer) is elsewhere. Instead, we’ll first
check whether the hash code of the element we’re looking for is equal to
the hash code that we’ve cached at that cell in the probing array (which
won’t incur an additional cache miss). Only if those hash codes are equal
do we go ahead and perform potentially expensive operator==
call.
Second, it allows the wasted space for empty cells in the probing table to
continue to be only 16 bytes as opposed to 32 to 40. This can save quite a
bit of space since the probing table is likely to have a lot of empty
cells. Unfortunately, it’s not all space savings, since the secondary array
storing the key/value pairs is also likely to have many empty cells, since
we use an exponential-resizing std::vector
for their storage.
With that explanation out of the way, let’s compare our probe_map
against
unordered_map
again, using the same criteria as we did for the integer
case: building speed, memory consumption, and query throughput. Here, we’ve
generated string keys by taking the exact same data as we used in the
integer benchmarks, but converting the integers to their string
representations first.
Building Time
Building time is essentially the same across the tables, with probe_map
performing slightly worse in some spots due to the more frequent resizing
cost. However, it’s interesting to see that unordered_map
with
std::hash
actually starts to perform worse than unordered_map
with
meta::hashing::hash<>
towards the very end of the benchmark (at around
85,000,000 insertions). It’s also interesting to note that unordered_map
and probe_map
converge to about the same performance around 95,000,000
insertions if they are both using FarmHash.
Memory Consumption
probe_map
loses the memory battle to unordered_map
by significant
margins at many insertion counts, and we suspect the secondary array’s
blank cells account for a lot of the extra weight. Fortunately, the extra
memory usage is frequently followed by a long period of growth that is at a
significantly lower rate than unordered_map
, allowing it to “catch up” at
a couple of insertion count points.
Query Throughput
Here we can start to see some interesting differences between clang
and
gcc
. At the top of the graph is probe_map
, which has a higher query
throughput than any of the standard library implementations by a good
margin. Interestingly, though, gcc
seems to deteriorate faster than
clang
starting around the 50,000,000 insertion mark (and I have no
explanation for why it suddenly jumps back up at the very last 100,000,000
insertion count, especially because it doesn’t appear to be due to
variance).
gcc
, however, does redeem itself when comparing the unordered_map
implementations. When both are using std::hash
, gcc
starts to pull away
from clang
at around the 55,000,000 insertion mark. Here, clang
with
std::hash
begins a steady decline that makes it even slower than the
unordered_map
implementations with FarmHash at around 70,000,000
insertions.
Conclusions
These results to me speak the following trade-offs for inlineable and non-inlineable keys:
-
For small (integer) keys,
probe_map
trades build time for query throughput. It would be interesting to see whether increasing the resize ratio or maximum load factor could change the incline of the build time graph. -
For large (string) keys,
probe_map
trades memory usage for query throughput. Investigating different secondary storage types might be helpful in reducing the wasted space, but may imply other trade-offs.
If you have a need for a hash table with very fast query speed, using
probe_map
over unordered_map
might make sense (provided you won’t miss
erase()
, which is currently unimplemented because we simply don’t use it
often enough to care).
Acknowledgments
I’d like to thank Sean Massung and Chelsea Neely for their feedback on a draft version of this post, and Jeff Erickson’s Advanced Data Structures course for inspiring the work.
Edits
- 2016-01-31: Added a paragraph that discusses the use of a sentinel value for inlineable keys to signify the empty cell.
-
Seriously. Check out Bjarne Stroustrup’s GoingNative 2012 keynote at about 44 minutes. ↩
-
For linear probing (and blocked probing strategies in general), it’s been shown that your hash function needs to be five-way independent in order to guarantee good performance. Fortunately, one can construct hash functions that satisfy this property relatively easily. Unfortunately, nobody really seems to be using functions that guarantee this property in practice (including me). ↩
-
Or your own crypto system. These are both Very Bad Ideas(TM). ↩
-
Although, since then, it’s been discovered that Murmur3 hash has seed-independent collision problems, meaning that even in the face of a randomly-generated seed at program startup, people can still break your stuff. The best alternative thus far seems to be SipHash, which can be easily integrated into this framework. ↩
-
With the exception of quadratic probing, which isn’t guaranteed to find an empty slot in the table if it’s more than half full! ↩