“Let’s do it Haskell!”, a friend said to me as we discussed the various choices surrounding implementing hash tables. Ultimately we opted for a quick python implementation, but the idea stuck: later I did exactly that.

Definitions

Haskell is a highly abstract programming language that leans heavily on mathematical concepts. By organizing around (mathematical!) functions rather than objects, it helps the programmer avoid a common boogeyman: shared mutable state. Since I’ve always loved mathematics and had already been refreshing my knowledge before I got to RC, it seemed a natural language to study.

Hash tables are a common computing data structured used for fast storage/lookup of data. The basic concept of a hash table involves “hashing” a part of the data into a value that is indistinguishable from random, and storing the data in a position derived from this hash value. Beyond that minimal definition, there are a variety of techniques that can be used to build hash tables.

Questions…

Starting out with a blank source file, I had some important questions on my mind before I could even deal with “how” to implement the hash table.

  1. Haskell values are all “immutable”, so how can I create a “mutable” data structure?

    • Write insert/delete functions in such a way that they return new/rebuilt versions of the data structure! My insert function declaration ended up looking like this: insert :: (k, v) -> HashTable k v -> HashTable k v
    • The above can be read as: given a key/value pair and a HashTable, return a HashTable with that pair inserted.
  2. How can I have constant time (aka O(1)) access to my hash table, if the main Haskell data structure I’ve seen so far is a linked list?

    • Lean on existing libraries that provide random access rather than sequential access!
    • Haskell does have an Array type, although the version I used isn’t perfectly ideal.
  3. I can have polymorphism?

    • Yes! Haskell makes polymorphism (handling any type of data) easy but still safe. So far I’ve only needed a few constraints on the data to be stored: keys must be Hashable and Equal, and values must be Equal.
    • The key constrains fall pretty naturally from how hash tables are intended to work; the constraint on data is unfortunately driven by an implementation detail and is something I’d like to remove eventually.

This is what I wrote to search for a key value in the table and return the data assocatiated with that key if it exists.

search :: (Hashable k, Eq k) => k -> HashTable k v -> Maybe (k, v)
search key (HashTable array) =
  List.find (\(k,v) -> k == key) bucket
  where
    position = mHash (List.length array) key
    bucket = array ! position

Collisions in space and time

Something discussed exhaustively in a recent conversation was what kind of strategy to use for hash collisions. The Birthday Paradox means that any hash table implementation must deal with collisions. In other words, what happens if the keys “Alice” and “Bob” happen to hash (modulo the size of the table) down to the same value?

There are two basic approaches: chaining and probing.

Chaining creates a “bucket” at each position of the table. Rather than storing key/value pairs in the table directly, there is another data structure (commonly a linked list) in each bucket that handles all the data hashed to that bucket.

Probing continues to store key/value pairs directly in the table, but collisions mean the table keeps checking the “next” slot, until an empty slot is available. To better spread the data and avoid too many future collisions, this is typically not done in a straight line.

For the sake of simplicity I chose the chaining strategy. The implications of this choice are fascinating and deserve a separate discussion.

Problems!

After getting a minimally correct implementation working, I started digging into the time complexity of what I’d done. Unfortunately since the Data.Array incremental update operator (//) must build a new copy of the underlying array, my insert and delete operations also operate in linear O(n) time.

This is very bad! It breaks the expected performance of hash tables, where all three basic operations should complete in constant time.

To fix this, I’ll need to replace the underlying Data.Array with something like Data.Array.MArray. And at least figure out how to work with monads, if not “what they are” quite yet.

Future Proofing

Just the process of writing this simple/naive library has proven highly educational, driving me to learn some of the details of the stack tool, how modules are organized, and “smart” constructors

mkHashTable :: Eq a => Int -> HashTable a b
mkHashTable n =
  HashTable (array(0,n-1) [(i,[]) | i <- [0..n-1]])

that simplify the process of actually creating a HashTable to mkHashTable 10, where you want 10 buckets to begin with. My goal was to abstract away the details of array creation for the user.

While I’d also like to dig into something more exciting, I’m hoping to continue expanding this little library as I learn more about Haskell:

  • change underlying immutable Array to a mutable MArray
    • …but ideally still provide a mechanism for safe concurrent access
  • dynamic resizing when the HashTable is overloaded
  • replace chaining with quadratic probing
    • trying out several different options for “tombstones”
  • implement sensible typeclasses: Functor and Foldable at least, so that a client can transform or collapse all the data in the table
  • QuickCheck to implement property testing/fuzzing
  • investigate Google style “sparsehash”