06 Apr 2014, 10:29

Designing a persistent Bloom filter

Share

Bloom filters are handy data structures, particularly for applications where data sets regularly exceed RAM if stored literally. For instance, they’re useful as a way to implement an inner join or filter. You can stream the restrictor dataset into a Bloom filter, and then stream your restrictee through the Bloom filter, propogating only those elements that match, and taking only 1.44 * log2(1 / errorRate) bits per entry in your restrictor dataset. This is why databases like Cassandra use them extensively.

Usually they’re formulated as mutable data structures over a bit-array, which is in turn implemented (at least in Java) on top of an array of longs. But there’s no reason why we have to use a mutable array; persistent data structures are desirable for many reasons, not least because they can be used with the same idioms as builtin Clojure data structures if we’re operating in that environment. How do we implement them persistently? On top of a persistent vector / array, of course.

Standard Clojure persistent vectors have object overhead, and the point of a Bloom filter is to reduce memory usage, so they’re right out. You could implement it on top of a (vector-of :long) with less overhead, but there is a speed disadvantage; currently gvec / vector-of doesn’t support transients, so with K hash functions in your bloom filter, you’re doing K sequential modifications to your vector, resulting in many more allocations and copies than we’d like.

Basically, designing a persistent Bloom filter comes down to the problem of representing a persistent fixed-sized array in such a way that we can do K “modifications” per step in the minimum number of operations. Essentially all persistent data structures are built on top of a tree-like data structure, so we need to figure out the optimal tree layout. How do we do this?

When we’re constructing a Bloom filter, we provide the expected number of elements and the desired false-positive rate, which in turn gives us parameters for the optimal number of hash functions and number of total bits, so we know how many modifications per insert (K) and how many total elements we’re going to need to represent (number of bits / 64). In order to make this tree persistent, we need the guarantee that we always copy a whole node if we need to modify any of its members. So, if we have a fan-out of F, we basically accumulate a cost of F for each node along each of the K paths, without double-counting shared ancestors.

But how do we estimate the overlap? This is trickier than it seems; one of the assumptions of a Bloom filter is that the bit positions are randomly distributed. But “randomly distributed” doesn’t mean “evenly distributed”; if I drop 100 balls into 100 random buckets, I’ll end up with more than a third of buckets empty. Simulate it yourself and see:

(loop [v (vec (repeat 100 0)) ct 100] 
  (if (zero? ct) 
      (do (println "Buckets empty:" (count (filter zero? v)))
          (println "Final bucket counts:" v)) 
    (recur (update-in v [(rand-int 100)] inc) 
           (dec ct))))

The number of empty buckets (and conversely the number of buckets with at least one entry) is dictated by the negative binomial distribution. Now, we could use that to figure out the number of nodes likely to be hit on each layer, or we could simulate it, which is more fun.

//Does a Monte Carlo simulation of the cost of an insert.
public static double cost(long m, int k, int f){
    final Random rng = new Random();
    final int trials = 250;
    //Each long holds 64 bits
    final int n = (int)Math.ceil(m / 64.0);
    double cost=0;
    //Simulate from top down, averaging 100 iterations
    for(int trial=0; trial<trials; trial++){
        //Smallest power of f greater than n
        for(int width=f; width < n*f; width *= f){
            final int[] hit = new int[width];
            for(int i=0; i<k; i++){
                hit[rng.nextInt(hit.length)] = 1;
            }
            int localCost = 0;
            for(int i=0; i<hit.length; i++){
                localCost += hit[i];
            }
            //We may want to add a fudge factor to account for array header,
            //pointer access & function call overhead.
            cost += localCost * f;
        }
    }       
    cost /= trials;
    return cost;
}

It would be more than a bit overkill to run a Monte Carlo simulation in the constructor (although I’ve seen worse), and still a little overkill to estimate fresh parameters each time using statistics. The cost isn’t convex due to integer issues (sometimes a higher branching factor lets you get rid of another level, and sometimes it just makes each copy costlier), so we’d have to calculate the results of many different fanouts, and we’ll have some error in our estimate anyway since we’re running on a real machine in the physical world.

It turns out that a fanout factor of 8 has the advantage of being a power of 2, corresponds roughly to a cache line (of course Java object headers mess this up), and gives reasonable simulated costs for large Bloom filters. I’m building a persistent Bloom filter implementation into SOAC and benchmarking, so we can see if the real world agrees with the back-of-the-envelope calculations. A fixed-size primitive array with inherent support for multi-node edits would also be a good base for a different hash table implementation that doesn’t depend on gvec.