I’ve made a couple of enhancements to the lash on-disk hash table library I’ve been working on. The major addition is a BucketDiskMap class, that organizes record pointers into page-sized (4k) buckets. The previous implementation, VarSizeDiskMap, has significantly simpler internals by virtue of organizing pointers in a flat table, rather than a 2-level structure.
Why is the additional complexity worth it? Speed, of course. By organizing pointers into buckets and chaining the buckets, rather than chaining single pointers, we have a better chance that hash collisions will end up being stored together, ideally in the same memory page / disk sector. In the single-pointer case, collision resolution requires jumping all over a mmap’d file, reading more pages into RAM and thrashing more cache than we should. If instead we pack them into buckets, we’re guaranteed that the first N (in this case 128-170, depending on load) collisions will be stored in the same page, which is more or less guaranteed to be in cache already.
Writes are faster as well, mostly due to faster rehashing. VarSizeDiskMap spends a solid 60% of its time rehashing, and because it is following and rewriting pointers all over the place, an “incremental” rehash often results in a more or less complete rewrite of the backing files. Altering the rehash scheme (doing traditional full rehashes rather than quasi-linear hashing, or rehashing by evenly divided chunks rather than every-Nth-element stripes) could increase that performance, but either adds complexity or makes latency even more inconsistent. In contrast, rehashing by bucket rather than by single pointer mostly keeps us from dirtying more pages than we have to. (Side note: if you configure your system’s page cache so that it can avoid flushing dirty pages to disk over the life of your benchmark, this difference shrinks quite a bit).
So we should expect a better constant factor for the BucketDiskMap, but there is a tradeoff: now we must actually search entire buckets for a given record. Linear scans of a single memory page aren’t actually that expensive compared to a disk seek, but if the entire table is in cache, it ends up being a bottleneck. To reduce this cost, each bucket is really a mini hash table that uses the top N bits of the hash for placement within the bucket (the bottom N bits are used to choose the bucket). We use linear probing for collision resolution. Buckets are still chained into another file when they overflow, but with a decent hash function that evenly distributes keys between buckets we can make a pretty good guarantee that those chains will be rare.
The actual speed impact is downright pleasant. On a benchmark where we repeatedly putIfAbsent string keys whose frequency follows a Zipfian distribution, the results are as follows:
- 320M keys, 36M distinct values, average size of 10 bytes
- VarSizeDiskMap: 47 minutes, 23 seconds
- BucketDiskMap: 10 minutes, 1 seconds
There is still a really large and odd performance disparity between OS X and Linux (a Macbook Air smokes my Thinkpad), which is probably a combination of the IO scheduler and faster hardware.
The second enhancement is to provide a wrapper DiskMap
Having finally run some large benchmarks locally and written some decent unit tests, I’m going to call lash ready for actual usage.