Consider the problem of finding JSON objects/documents that satisfy a complex query, with elaborate boolean combinations of required key-value pairs. Let’s restrict to the one-level case for now (ie, no nested data structures - or, equivalently, representing nested keys on a single level via the tuple of their ancestor keys). Indexing all of our KV pairs in a giant tree / trie / hash data structure is expensive; that’s essentially the size of our whole dataset. And, that only gives us one-item searches out-of-the-box - we must retrieve the (potentially large) list of objects that have a certain KV pair, then the next list, combining them as appropriate, etc.
An alternative to storing a list (or even more expensively, set) of literal objects IDs is a Bloom filter. In return for a small and configurable false positive rate, we can filter out most documents before we scan the remainder for an explicit match. Using one giant Bloom filter over
(key, value, object ID) tuples seems ugly though, since we’d have to compute the hashes with every potential object to do the check. It’s quite possibly cheaper than hitting disk, but probably not fast enough.
We could have Bloom filter for each key, and add the value / object ID pair to the Bloom filter. Xiao and Hua essentially take that approach. However, if we have an uneven distribution of keys (ie, our key space is sparse; most documents don’t have most keys) we don’t know how long to make each Bloom filter. With any constant Bloom filter length, for relatively common keys we’ll end up with lots of false positives, and for relatively uncommon keys we’ll end up with lots of wasted space.
Our proposed solution is a different two-level Bloom filter scheme. We have an array of
P Bloom filters, each containing
M bits. Based on the KV pair we’re searching for, we choose the Bloom filter at index
I = hash(key,value) % P. Indexing by hash will create collisions, particularly so if
P is much less than the total number of distinct keys or key-value pairs. This is intentional - the idea is to smooth the population of each Bloom filter so we have approximately even membership count. Essentially this associates each Bloom filter with some random shard of key-value pairs.
Within each Bloom filter, we track the presence of particular object IDs, via standard Bloom filter techniques (eg, we could use a counting Bloom filter to support removals). The entire query can be evaluated over the bitsets underlying the Bloom filters, efficiently leaving us with a (hopefully small) number of set bits, corresponding to the bit-patterns of objects which may satisfy the query. We just have to get the object IDs back out of those bit positions. We can do this by hashing all object IDs on demand, or storing a rainbow table of
hash -> object ID.
Or, consider the case when we have a relatively dense range of IDs (for instance, integers [0,whatever]), and a quasi-reversible “hash” function (eg, “rotate and XOR with random number” or “encrypt & take bottom M bits”). One would have to generate M / N keys per bit (since the high bits could be in any range, up to our max inserted key), and see if the key corresponds to the other set bits, but if the number of bits set in the output is very low, or M ends up being near N, or fetching is really expensive compared to computing the hash, it can come out worthwhile. Using quotient filters or a variant, where the keys are actually stored (and incidentally behaves better for disk access) would make this unnecessary.
The optimal number of bits
M, and number of hash functions
K, can be estimated based on the approximate average number of key-value pairs per object, the distribution of the number of key-value pairs per query, and the desired per-query false positive rate.
The two problems are:
Because the Bloom filters will “fill up” over time, they unfortunately do require periodic reconstruction. If you’re using an on-disk hash as your backing storage layer, you’re periodically rehashing every record anyway.
The need to empirically determine optimal indexing parameters.
As far as I can tell, this is a novel data structure (horray!). Eventually I will write this up “properly” with math and theoretical and empirical performance results, and it will become part of full document-DB support for Lash.