07 Jun 2014, 14:29

B-trees are overrated; try hashing instead


There’s nothing wrong with B-trees as a data structure. In fact, there’s so little wrong that even 30 years ago, they counted as “ubiquitous”. That paper gives a few reasons why they rose to their present dominance, but the gist of it is:

  • Relatively consistent performance across a range of data sizes and key distributions
  • Good performance on range queries (in fact, if you want range queries, you usually need some kind of tree structure)
  • OK performance on random lookups
  • Reasonably compact, reasonably resistant to fragmentation
  • Well-understood concurrency mechanisms

In short, they’re generally good for a lot of very general database requirements. Hence, they’re used heavily in multi-purpose databases.

But pretty good isn’t all it’s cracked up to be; B-trees (and I’m going to use B-tree as a synecdoche for the entire family of B-trees; most databases use something at least slightly more advanced) have some problems. IO speed relative to CPU speed has been dropping steadily for decades; other than a network ping just about the most expensive thing you can do on a computer is a random disk seek - and B-trees are potentially laden with random disk seeks as you traverse the tree structure. You can avoid some of these if the interior nodes happen to be cached, except for when you do something like writing a bunch of data, a table scan, or using a bunch of other indexes. Let’s not forget that writes get written eventually - updating a dozen tree data structures in memory eventually results in disk seeks, and potentially a catastrophic cascade of writes as all the nodes split. Some extensions to B-trees (notably TokuTek’s fractal tree indexes) and some filesystems do their best to linearize those writes, but they’re at best trying to save you from yourself.

When following pointers gets too slow, the alternative is packed storage - and the packed alternative to a tree is a hash table. They are ubiquitous for in-memory storage and have well-understood performance characteristics, with a bevy of addressing and conflict-resolution strategies. The key is making them amenable to on-disk storage.

That line of research has a rich genealogy. One of the most promising variants is linear hashing. Across a dozen or so papers (culminating here), the underlying insight is that as long as your table’s base size is a power of two, you can incrementally rehash each element to either its present position, or that position + table size. Rehash them sequentially, and your table expands incrementally, maintaining validity and average load the entire way.

Of course, there are complications. Expanding front-to-back unevenly distributes the load (prior buckets have less conflicts than subsequent buckets, which may or may not be a problem depending on your conflict resolution method and hash function). You need to actually choose a conflict resolution strategy - chaining introduces random disk seeks (and the need to track allocations), and linear probing gets a little complicated to manage in combination with incremental rehashing. Range queries are right out.

There are of course other implementations with other tradeoffs, but these kinds of complications are probably the reason hash tables haven’t seized Global Durable Storage Hegemony from their cousins. Remember the part about B-trees having consistent performance? Other than high volumes of random inserts, which are merely very slow, there really aren’t a lot of good ways to break B-trees, and there are actual, effective attacks that break improperly implemented hash tables, as well as non-intentional pathological failures.

Despite the caveats, if you have a good hash algorithm, an appropriate collision resolution mechanism, and a load factor that admits decent performance given your algorithms, it’s possible to make hash tables work very well (i.e. faster than a B-tree) on disk, especially for data stores that have lots of random inserts. Immodestly, I claim to be doing so so here, in a project called lash.

The repo has details on the implementation, but I’ll mention a few highlights. Relative computing costs, and the sheer boundary of possibility, has shifted substantially since the 80s. The systems most on-disk hash algorithms were originally developed on didn’t have modern affordances like sparse files, delayed allocation, or in some cases even memory-mapped files. One thing lash gets a lot of mojo out of is the ability to do a sparse expansion of a file followed by quasi-random writes to the new space, relying on the filesystem to allocate the actual blocks lazily and schedule the writes relatively efficiently.

Essentially it’s a bog-standard power-of-two sized hash table, with incremental rehashing. When load reaches a threshold, we sparsely double the size of the backing file, and lazily rehash buckets’ worth of data from the first half into the whole space. There is some magic around the locking scheme to coordinate concurrent lazy updaters, readers, and writers - but in general it’s not anything that would be unrecognizable to the paper authors.

Performance is pretty good for my use cases, but seems to be very system dependent - fiddling with sysctl.conf’s vm.* settings I was able to bring insertion time for 30M small records from 9 minutes to 2, and whatever OS X’s default settings are resulted in a Macbook Air trouncing a Thinkpad running Ubuntu, despite having much less RAM available. It could be the PCIe SSD is just that much better than mSATA, or some qualitative difference in the way the page caches are managed - but, across environments, I see far fewer IOPs being generated per insert than with the equivalent B-tree.

The archaeology of computer science leaves plenty more good ideas to be rediscovered.