Five Myths about Hash Tables

A hash table is data structure that is used to search for exact matches to a search key. For searching, they work like this:

  1. Take a search key (example: the word “cat”)
  2. Hash the search key: pass the key to a function that returns an integer value (example: the hash function returns 47 when the input key is “cat”)
  3. Use the integer value to inspect a slot to see if it contains the search key (example: look in an array at element 47)
  4. If the key matches, great, you’ve found what you’re looking for and you can retrieve whatever metadata you’re looking for (example: in array element 47, we’re found the word “cat”. We retrieve metadata that tells us “cat” is an “animal”)
  5. If the key doesn’t match and the slot contains something besides the key, carry out a secondary search process to make sure the search key really isn’t stored in the hash table; the secondary search process varies depending on the type of hashing algorithm used (example: in array element 47, we found the word “dog”. So, let’s look in slot 48 and see if “cat” is there. Slot 48 is empty, so “cat” isn’t in the hash table. This is called linear probing, it’s one kind of secondary search)

I had a minor obsession with hash tables for ten years.

The worst case is terrible

Engineers irrationally avoid hash tables because of the worst-case O(n) search time. In practice, that means they’re worried that everything they search for will hash to the same value; for example, imagine hashing every word in the English dictionary, and the hash function always returning 47. Put differently, if the hash function doesn’t do a good job of randomly distributing search keys throughout the hash table, the searching will become scanning instead.

This doesn’t happen in practice. At least, it won’t happen if you make a reasonable attempt to design a hash function using some knowledge of the search keys. Indeed, hashing is usually in the average case much closer to O(1). In practice, that means almost always you’ll find what you’re looking for on the first search attempt (and you will do very little secondary searching).

Hash tables become full, and bad things happen

The story goes like this: you’ve created a hash table with one million slots. Let’s say it’s an array. What happens when you try and insert the 1,000,001st item? Bad things: your hash function points you to a slot, it’s full, so you look in another slot, it’s full, and this never ends — you’re stuck in an infinite loop. The function never returns, CPU goes through the roof, your server stops responding, and you take down the whole of _LARGE_INTERNET_COMPANY.

This shouldn’t happen if you’ve spent time on design.

Here’s one solve: there’s a class of hash tables that deal with becoming full. They work like this: when the table becomes x% full, you create a new hash table that is (say) double the size, and move all the data into the new hash table by rehashing all of the elements that are stored in it. The downside is you have to rehash all the values, which is an O(n) [aka linear, scanning, time consuming] process.

Here’s another solve: instead of storing the metadata in the array, make your array elements pointers to a linked list of nodes that contain the metadata. That way, your hash table can never get full. You search for an element, and then traverse the linked list looking for a match; traversing the linked list is your secondary hash function. Here’s a diagram that shows a so-called chained hash table:

A chained hash table, showing “John Smith” and “Sandra Dee” both in slot 152 of the hash table. You can see that they’re chained together in a linked list. Taken from Wikipedia, http://en.wikipedia.org/wiki/Hash_table

Chained hash tables are a very good idea. Problem solved.

By the way, I recommend you create a hash table that’s about twice the size of the number of elements you expect to put into the table. So, suppose you’re planning on putting one million items in the table, go ahead and create a table with two million slots.

Trees are better

In general, trees aren’t faster for searching for exact matches. They’re slower, and here’s peer-reviewed evidence that compares B-trees, splay trees, and hashing for a variety of string types. So, why use a tree at all? Trees are good for inexact matches — say finding all names that begin with “B” — and that’s a task that hash tables can’t do.

Hash functions are slow

I’ve just pointed you to research that shows that must not be true. But there is something to watch out for: don’t use the traditional modulo based hash function you’ll find in your algorithms text book; for example, here’s Sedgewick’s version, note the “% M” modulo operation that’s performed once per character in the input string. The modulo is used to ensure that the hash value falls within the size of the hash table — for example, if we have 100 slots and the hash function returns 147, the modulo turns that into 47. Instead, do the modulo once, just before you return from the hash function.

Here’s the hash function I used in much of my research on hashing. You can download the code here.

A very fast hash function written in C. This uses bit shifts, and a single modulo at the end of the function. If you find a faster one, would love to hear about it.

Hash tables use too much memory

Again, not true. The same peer-reviewed evidence shows that a hash table uses about the same memory as an efficiently implemented tree structure. Remember that if you’re creating an array for a chained hash table, you’re just creating an array of pointers — you don’t really start using significant space until you’re inserting nodes into the linked lists that are chained from the hash table (and those nodes typically take less space than nodes in a tree, since they only have one pointer).

One Small Trick

If you want your hash tables to really fly for searching, move any node that matches your search to the front of the chained list that it’s in. Here’s more peer-reviewed evidence that shows this works great.

Alright. Viva Hash Tables! See you next time.

18 thoughts on “Five Myths about Hash Tables

  1. Ardent Logophile

    Hugh,

    Thank you for the post! Love it! I am one of those engineers who is obsessed with Hash Tables.

    Isn’t it true that every hash function has its pathological data set, and hence the use of universal hashing? I believe its not easy to come up with one best hash function that performs well with every possible data set in the universe.

  2. pixl8ed

    Great post! How about a followup talking about one of my favorites — perfect hash functions, for the special case when you know all of your inputs?

  3. Brian Hurt (@bhurt42)

    The problem is that if you don’t resize your hash table (the O(N) operation), you can end up with O(N) searches if you greatly misjudge the initial size. If you create a hash table with 100 entries, and stick a million elements into it, the average node is going to have 10,000 entries in it’s linked list- and a search takes O(10,000) time. In general, search takes O(N/k) time for a k-bucket hash table. The nice thing about trees is that they “fail safe”- a million element tree is only ~3x slower than a 100 element tree. The base speed is slower, but you don’t fall off a cliff if you screw up (are the subject of an exploit).

    Argent Logophile: it’s easy to prove that all hash functions have pathological data sets, via the pigeon hole principle.

  4. Alan

    Does anyone actually believe any of those 5 myths? I often ask interview candidates—at a well-known Internet company—to implement a hashtable, and I’ve never encountered anyone who could implement a hashtable yet suffered from these misconceptions.

    (1) The worst-case O(n) performance is only really relevant in adversarial situations, when an attacker is trying to deny service; it never happens by chance unless you’ve screwed up your hash functions.

    (2) The use of linked lists for buckets is nearly universal, AFAICT.

    (3) Hash functions may be slow for custom objects, but they can often be computed on first use and saved in an object field.

    Also, it’s better not to do “mod M” in your hash function at all: the hashtable logic should be doing this, since it knows the current size of the table, and it may vary as resizes occur. Sedgwick’s style dates from a time when users would roll their own type-specific fixed-size (M) hashtable rather than use a generic library implementation.

    (4) Trees have log(n) insert/update/delete time, so obviously they’re slower unless they’re very small. One uses a tree when you want an ordered set or map.

    (5) Empty hash tables usually preserve some space (e.g. 10 words) so they do use more space than an empty tree, and since “most collections are empty” this can be a waste. I suspect that’s why C++ programmers still prefer std::map when they don’t expect it to get very big.

    Java, Python and Go programmers use hashtables like there’s no tomorrow—far more than they use red/black tree alternatives. C++ users seem to use both about equally in my experience.

    From which group did you hear these myths?

  5. Ardent Logophile

    Just finished reading both the papers cited as evidence in the post. Really enjoyed reading the extensive treatment of burst tries! Thank you!

  6. Anon Ymous

    As has been pointed out, the issue arises when you have an adversarial input. Read the relevant BlackHat talks on computational complexity attacks. A lot of software systems are used in security critical applications nowadays, and you can’t just have a controller in a 747 stall because of nasty input values — and yes, 747’s also run off the shelf unixes nowadays

  7. Jorge

    Normally small and medium size applications should not use hash tables with millions of rows in their memory space. The design/architecture of this type of applications should include a Database Engine behind in order to handle big tables. Under this conditions you won’t have issues you are mentioning here with hash tables (and others) because the DB Engine will do that for (and better than) you.

    Could someone please illustrate an example of an application in a valid scenario that requires handling a big hash table by itself? But please, make sure that adding a database engine was not the best answer for your example.

  8. terry

    mmmm… Jorge. The article is about hashtable myths, no? Not about DB engines vs hashtables.

    Nevertheless, to anwer your question… twenty-plus years ago I wrote a full text search for a CD product (a collection of articles). The text and index were fixed. The base target computer was a 386 with a 2X CD reader, running Windows 3.1. I used a large closed hashtable with two finely tuned hashing functions to minimize the chain length. I was able to achieve subsecond full text search on this ridiculously slow platform, generating O(n) I/O to the CD drive, where n signified the number of non-noise words in the phrase being searched. The second hash value let me avoid strcmp on search values, (vastly) far enough from the primary hash to let me measurably guarantee no primary+secondary hash collisions.

    And Chris…
    Mind you, the example hashing function above is just a starting point when developing like I did, with a fixed dataset. I experimented for days to reach an optimal chain length < 3 compared to table size.

    In other words, you are allowed stand on someone else's shoulders and improve on their work when you need to. I rarely need to tune hashing functions now since they're rarely a bottleneck. (Don't optimize the fast parts, ya know what I mean?)

  9. Mike

    return((unsigned int)((h&0x7fffffff) % tsize));

    It concerns me when folks use return as if it were a function. No harm, no foul, tho.

    When I have a hash conflict, I just put the item into the next free slot. So at that point future lookups become a linear search with small N — quite simple and effective. The key is to keep N small for your dataset.

  10. Leonid Boytsov (@srchvrs)

    I would note that there are ways to avoid worst-case search time, e.g.: cuckoo hashing and perfect hashing. These are not ideal, of course. A perfect hash function may not materialize (a small probability, but I have seen this in practice). And it can in fact be slower than regular hashing. Cuckoo hashing, AFAIK can also suffer from adversarial input, but only at the time of construction.

    I agree about different hashing functions being essentially same: computation of a hash function takes very little. A random access to a memory cell takes long: hundreds of CPU cycles.

    Final note: you can replace a hash with a trie. Runtime is comparable and I saw some tries being faster than hashes.

  11. Pingback: Five Myths about Hash Tables « thoughts…

  12. Admin

    Great post indeed. Being a java developer, most people struggle to choose between HashMap and HashTable and that’s it.
    Your research will force them to think more, including me.

  13. Pingback: Hash Tables | Jeff Plumb

  14. Pingback: Reflecting on my hash table post | Hugh E. Williams

  15. Pingback: Armchair Guide to Data Compression | Hugh E. Williams

  16. Toth Arpad

    Some of these “myths” are actually very true. What you say is true only for string keys, but what about hashing variable length int arrays? Then all your myths may become reality, and the sad thing is that it is usual to hash custom keys.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s