Bryce Nyeggen's Blog

A "web presence" as they say

Two-level Bloom Filters for Complex Query Support

Consider the problem of finding JSON objects/documents that satisfy a complex query, with elaborate boolean combinations of required key-value pairs. Let’s restrict to the one-level case for now (ie, no nested data structures – or, equivalently, representing nested keys on a single level via the tuple of their ancestor keys). Indexing all of our KV pairs in a giant tree / trie / hash data structure is expensive; that’s essentially the size of our whole dataset. And, that only gives us one-item searches out-of-the-box – we must retrieve the (potentially large) list of objects that have a certain KV pair, then the next list, combining them as appropriate, etc.

An alternative to storing a list (or even more expensively, set) of literal objects IDs is a Bloom filter. In return for a small and configurable false positive rate, we can filter out most documents before we scan the remainder for an explicit match. Using one giant Bloom filter over (key, value, object ID) tuples seems ugly though, since we’d have to compute the hashes with every potential object to do the check. It’s quite possibly cheaper than hitting disk, but probably not fast enough.

We could have Bloom filter for each key, and add the value / object ID pair to the Bloom filter. Xiao and Hua essentially take that approach. However, if we have an uneven distribution of keys (ie, our key space is sparse; most documents don’t have most keys) we don’t know how long to make each Bloom filter. With any constant Bloom filter length, for relatively common keys we’ll end up with lots of false positives, and for relatively uncommon keys we’ll end up with lots of wasted space.

Our proposed solution is a different two-level Bloom filter scheme. We have an array of P Bloom filters, each containing M bits. Based on the KV pair we’re searching for, we choose the Bloom filter at index I = hash(key,value) % P. Indexing by hash will create collisions, particularly so if P is much less than the total number of distinct keys or key-value pairs. This is intentional – the idea is to smooth the population of each Bloom filter so we have approximately even membership count. Essentially this associates each Bloom filter with some random shard of key-value pairs.

Within each Bloom filter, we track the presence of particular object IDs, via standard Bloom filter techniques (eg, we could use a counting Bloom filter to support removals). The entire query can be evaluated over the bitsets underlying the Bloom filters, efficiently leaving us with a (hopefully small) number of set bits, corresponding to the bit-patterns of objects which may satisfy the query. We just have to get the object IDs back out of those bit positions. We can do this by hashing all object IDs on demand, or storing a rainbow table of hash -> object ID.

Or, consider the case when we have a relatively dense range of IDs (for instance, integers [0,whatever]), and a quasi-reversible “hash” function (eg, “rotate and XOR with random number” or “encrypt & take bottom M bits”). One would have to generate M / N keys per bit (since the high bits could be in any range, up to our max inserted key), and see if the key corresponds to the other set bits, but if the number of bits set in the output is very low, or M ends up being near N, or fetching is really expensive compared to computing the hash, it can come out worthwhile. Using quotient filters or a variant, where the keys are actually stored (and incidentally behaves better for disk access) would make this unnecessary.

The optimal number of bits M, and number of hash functions K, can be estimated based on the approximate average number of key-value pairs per object, the distribution of the number of key-value pairs per query, and the desired per-query false positive rate.

The two problems are:

  • Because the Bloom filters will “fill up” over time, they unfortunately do require periodic reconstruction. If you’re using an on-disk hash as your backing storage layer, you’re periodically rehashing every record anyway.

  • The need to empirically determine optimal indexing parameters.

As far as I can tell, this is a novel data structure (horray!). Eventually I will write this up “properly” with math and theoretical and empirical performance results, and it will become part of full document-DB support for Lash.

Lash Update: Speed and Friendliness

I’ve made a couple of enhancements to the lash on-disk hash table library I’ve been working on. The major addition is a BucketDiskMap class, that organizes record pointers into page-sized (4k) buckets. The previous implementation, VarSizeDiskMap, has significantly simpler internals by virtue of organizing pointers in a flat table, rather than a 2-level structure.

Why is the additional complexity worth it? Speed, of course. By organizing pointers into buckets and chaining the buckets, rather than chaining single pointers, we have a better chance that hash collisions will end up being stored together, ideally in the same memory page / disk sector. In the single-pointer case, collision resolution requires jumping all over a mmap’d file, reading more pages into RAM and thrashing more cache than we should. If instead we pack them into buckets, we’re guaranteed that the first N (in this case 128-170, depending on load) collisions will be stored in the same page, which is more or less guaranteed to be in cache already.

Writes are faster as well, mostly due to faster rehashing. VarSizeDiskMap spends a solid 60% of its time rehashing, and because it is following and rewriting pointers all over the place, an “incremental” rehash often results in a more or less complete rewrite of the backing files. Altering the rehash scheme (doing traditional full rehashes rather than quasi-linear hashing, or rehashing by evenly divided chunks rather than every-Nth-element stripes) could increase that performance, but either adds complexity or makes latency even more inconsistent. In contrast, rehashing by bucket rather than by single pointer mostly keeps us from dirtying more pages than we have to. (Side note: if you configure your system’s page cache so that it can avoid flushing dirty pages to disk over the life of your benchmark, this difference shrinks quite a bit).

So we should expect a better constant factor for the BucketDiskMap, but there is a tradeoff: now we must actually search entire buckets for a given record. Linear scans of a single memory page aren’t actually that expensive compared to a disk seek, but if the entire table is in cache, it ends up being a bottleneck. To reduce this cost, each bucket is really a mini hash table that uses the top N bits of the hash for placement within the bucket (the bottom N bits are used to choose the bucket). We use linear probing for collision resolution. Buckets are still chained into another file when they overflow, but with a decent hash function that evenly distributes keys between buckets we can make a pretty good guarantee that those chains will be rare.

The actual speed impact is downright pleasant. On a benchmark where we repeatedly putIfAbsent string keys whose frequency follows a Zipfian distribution, the results are as follows:

  • 320M keys, 36M distinct values, average size of 10 bytes
  • VarSizeDiskMap: 47 minutes, 23 seconds
  • BucketDiskMap: 10 minutes, 1 seconds

There is still a really large and odd performance disparity between OS X and Linux (a Macbook Air smokes my Thinkpad), which is probably a combination of the IO scheduler and faster hardware.

The second enhancement is to provide a wrapper DiskMap<K,V> that fully satisfies the ConcurrentMap interface, and can deal with actual Java objects beyond just raw byte arrays (as long as you provide a de/serializer. Implementations are already provided for Strings and primitives). This makes the project significantly more useable.

Having finally run some large benchmarks locally and written some decent unit tests, I’m going to call lash ready for actual usage.

B-trees Are Overrated; Try Hashing Instead

There’s nothing wrong with B-trees as a data structure. In fact, there’s so little wrong that even 30 years ago, they counted as “ubiquitous”. That paper gives a few reasons why they rose to their present dominance, but the gist of it is:

  • Relatively consistent performance across a range of data sizes and key distributions
  • Good performance on range queries (in fact, if you want range queries, you usually need some kind of tree structure)
  • OK performance on random lookups
  • Reasonably compact, reasonably resistant to fragmentation
  • Well-understood concurrency mechanisms

In short, they’re generally good for a lot of very general database requirements. Hence, they’re used heavily in multi-purpose databases.

But pretty good isn’t all it’s cracked up to be; B-trees (and I’m going to use B-tree as a synecdoche for the entire family of B-trees; most databases use something at least slightly more advanced) have some problems. IO speed relative to CPU speed has been dropping steadily for decades; other than a network ping just about the most expensive thing you can do on a computer is a random disk seek – and B-trees are potentially laden with random disk seeks as you traverse the tree structure. You can avoid some of these if the interior nodes happen to be cached, except for when you do something like writing a bunch of data, a table scan, or using a bunch of other indexes. Let’s not forget that writes get written eventually – updating a dozen tree data structures in memory eventually results in disk seeks, and potentially a catastrophic cascade of writes as all the nodes split. Some extensions to B-trees (notably TokuTek’s fractal tree indexes) and some filesystems do their best to linearize those writes, but they’re at best trying to save you from yourself.

When following pointers gets too slow, the alternative is packed storage – and the packed alternative to a tree is a hash table. They are ubiquitous for in-memory storage and have well-understood performance characteristics, with a bevy of addressing and conflict-resolution strategies. The key is making them amenable to on-disk storage.

That line of research has a rich genealogy. One of the most promising variants is linear hashing. Across a dozen or so papers (culminating here), the underlying insight is that as long as your table’s base size is a power of two, you can incrementally rehash each element to either its present position, or that position + table size. Rehash them sequentially, and your table expands incrementally, maintaining validity and average load the entire way.

Of course, there are complications. Expanding front-to-back unevenly distributes the load (prior buckets have less conflicts than subsequent buckets, which may or may not be a problem depending on your conflict resolution method and hash function). You need to actually choose a conflict resolution strategy – chaining introduces random disk seeks (and the need to track allocations), and linear probing gets a little complicated to manage in combination with incremental rehashing. Range queries are right out.

There are of course other implementations with other tradeoffs, but these kinds of complications are probably the reason hash tables haven’t seized Global Durable Storage Hegemony from their cousins. Remember the part about B-trees having consistent performance? Other than high volumes of random inserts, which are merely very slow, there really aren’t a lot of good ways to break B-trees, and there are actual, effective attacks that break improperly implemented hash tables, as well as non-intentional pathological failures.

Despite the caveats, if you have a good hash algorithm, an appropriate collision resolution mechanism, and a load factor that admits decent performance given your algorithms, it’s possible to make hash tables work very well (i.e. faster than a B-tree) on disk, especially for data stores that have lots of random inserts. Immodestly, I claim to be doing so so here, in a project called lash.

The repo has details on the implementation, but I’ll mention a few highlights. Relative computing costs, and the sheer boundary of possibility, has shifted substantially since the 80s. The systems most on-disk hash algorithms were originally developed on didn’t have modern affordances like sparse files, delayed allocation, or in some cases even memory-mapped files. One thing lash gets a lot of mojo out of is the ability to do a sparse expansion of a file followed by quasi-random writes to the new space, relying on the filesystem to allocate the actual blocks lazily and schedule the writes relatively efficiently.

Essentially it’s a bog-standard power-of-two sized hash table, with incremental rehashing. When load reaches a threshold, we sparsely double the size of the backing file, and lazily rehash buckets’ worth of data from the first half into the whole space. There is some magic around the locking scheme to coordinate concurrent lazy updaters, readers, and writers – but in general it’s not anything that would be unrecognizable to the paper authors.

Performance is pretty good for my use cases, but seems to be very system dependent – fiddling with sysctl.conf’s vm.* settings I was able to bring insertion time for 30M small records from 9 minutes to 2, and whatever OS X’s default settings are resulted in a Macbook Air trouncing a Thinkpad running Ubuntu, despite having much less RAM available. It could be the PCIe SSD is just that much better than mSATA, or some qualitative difference in the way the page caches are managed – but, across environments, I see far fewer IOPs being generated per insert than with the equivalent B-tree.

The archaeology of computer science leaves plenty more good ideas to be rediscovered.

Book Review: Clojure for Machine Learning

Clojure for Machine Learning is a good basic introduction to some diverse libraries (and even a few underlying implementations, like a pretty good from-scratch implementation of a neural network), and it gives a nice, concise, and accurate intro to the math behind some of the techniques. If you are coming at it from the other direction, reasonably familiar with some bestiary of ML techniques and wondering how to idiomatically apply them in Clojure (not easily – persistent data structures and high-performance matrix math don’t really jibe) you’re likely to be disappointed.

On the other hand, the biggest issues with applying ML techniques aren’t things like “how do I run a logistic regression”, it’s things like “my data doesn’t fit in memory anymore”, “how do I get fitting time less than 8 hours”, “how do I get the last 2% of accuracy I need”, or “should I be running a logistic regression in the first place”. This is the sort of thing that’s very difficult to approach holistically in a single book, especially a brisk 270-page book that covers ten or so technique variants. To be fair, the author does bring up some meta-level issues like overfitting, regularization, and precision / recall tradeoffs, but it’s not really aimed at giving you a deep understanding of the tricky parts.

So in sum, this is a nice book to put in an office library for an occasional bootstrap, or if you’re using Clojure already and you want to dip your toes in the ML realm. Look at the table of contents, and see if there’s a good amount of “stuff” that looks intriguing (there is a really good range of coverage). But, if you have an interest in a particular technique or problem you’re better off implementing it from scratch or diving deeply into a solid library and learning the nuts and bolts on your own.

This review has been cross-posted to Amazon.

Memory-mapping >2gb of Data in Java

Memory-mapping is your friend if you’re doing IO. You avoid the relatively expensive system calls entailed in repeatedly calling raw read or write (or even their stdlib-based buffered counterparts) in favor of transparent access to an array of bytes – the system’s page cache handles actual disk access if and only if it is necessary. Just about the worst thing that can be said about it is that it may not be measurably faster than the system calls, given the right access pattern (for example, linearly copying an entire file) – but if it is a better abstraction for you, there is rarely a negative performance impact.

The use of the OS memory facilities can be a double-edged sword, though, in a 32-bit environment. If you only have 32 bits of memory space, you can only map a 4gb file to memory (or 2gb if you’re using signed integers). In contrast, you can do a raw fread() from an arbitrary offset, as long as you’re not reading more than 232 worth of data.

Fortunately, we’ve had 64 bit operating systems for quite a while – depending on how consumer-grade you’re counting, since the mid-90s or mid-00s. On a modern machine you don’t have to worry about running out of address space and can map as large a file as you’re likely to encounter. That is, unless your programming environment doesn’t give you access to it.

Java standard library APIs have always been a little bit uneven, and unfortunately their mmap abstraction is to return a MappedByteBuffer. All ByteBuffers are constrained (same as arrays, unfortunately) to have fewer than < 231-1 elements. Even worse, they’re not threadsafe. You can only copy a particular chunk of bytes by calling position(int newPosition), then get(byte[] dst). In between those, another caller may have altered your position to somewhere else. That means you need to do a duplicate() call before everything to get an independently-positionable shared-content “copy”.

One approach is to wrap your buffer in a class that handles the duplicate-before-anything calls, and maps a series of buffers sufficient to cover the entire file (mindful of the fact that you might have to read across the boundary). This is a corner case waiting to happen, and slower than it needs to be. Mmap is a beautiful abstraction and Java has seemingly sucked the fun right out of it with its 1995-vintage design decisions.

The question arises – if the Java standard library designers did it wrong the first time, can we do it right instead? Presumably they’re wrapping the underlying system call that does what we want – we should be able to unwrap the cruft and use it directly.

In fact, deep in the bowels of a non-public class, we can find mmap lurking as if it was some kind of Lovecraftian artifact. Let’s take it out to play:

Ahhh, much better. Native-byte-order integers, raw memory-copies at known offsets, reflection calls to invoke what the compiler informs us is That Which Should Not Be Summoned – it all adds up to a remarkably simple and efficient interface.

Mapping Distinct Objects to a Dense Range of Integers

Let’s say you’re trying to turn distinct strings (or some other objects) into integers. This pops up all the time – for instance, if you want to analyze a social network, it’s faster if you can access your adjacency list via integer offsets into an array rather than looking up name strings in a hash table.

The desired properties are that:

  • You get a distinct integer for each input
  • The output is as dense (sequential, with no gaps) as possible, but 100% density is unnecessary
  • You can construct the output online (ie, you never need to re-number)
  • Support get-or-insert, get-or-fail, and remove operations
  • It’s wicked fast

The straightforward strategy is a read-check-generate pattern, secured with a ReadWriteLock:

final ReadWriteLock lock = new ReentrantReadWriteLock();
final Map<String, Integer> map = new HashMap<String, Integer>();
for(final String s : data) {
    try {
        if(map.containsKey(s)) continue;
    } finally {

    try {
        if(map.containsKey(s)) continue;
        map.put(s, map.size());
    } finally {

This will produce an appropriately dense, distinct mapping. Unfortunately, it fails the last couple of criteria – there’s no way to support removal, and that single write lock adds a fair amount of contention that slows things down. We can fix the first one:

public class StringMapper {
    final Map<String, Integer> map = new HashMap<String, Integer>();
    int lastIdx = 0;
    //Any collection would work here; this has the advantage of concentrating
    //"holes" at the end, and being unbounded
    final LinkedList<Integer> freelist = new LinkedList<Integer>(); 

    public void remove(String s){
        try {
            Integer current = map.remove(s);
            if(current == null) return;
        } finally {

    public Integer get(String s){
        try {
            return map.get(s);
        } finally {

    public Integer add(String s){
        Integer out = get(s);
        if(out != null) return out;

        try {
            Integer out = map.get(out);
            if(out != null) return out;
            out = !freelist.isEmpty() ? freelist.poll() : lastIdx++;
            map.put(s, out);
            return out;
        } finally {

A little more ceremony here – just like in ye olde malloc, we have to maintain a freelist, and we have to track the last insertion point not on the freelist so we can make fresh allocations.

That write lock is still really annoying, though. We have concurrent maps and atomic integers built into the standard library, it seems like we should be able to figure this out without any of our own locking.

public class StringMapper{
    final ConcurrentMap<String, Integer> map = new ConcurrentHashMap<String, Integer>();
    final AtomicInteger lastIdx = new AtomicInteger(0);
    final Queue<Integer> freelist = new ConcurrentLinkedQueue<Integer>();

    public void remove(String s){
        Integer current = map.remove(s);
        if (current != null) freelist.add(current);

    public Integer get(String s){
        return map.get(s);

    public Integer add(String s){
        //If we're doing mostly fresh insertions, the initial check can be skipped
        Integer out = map.get(s);
        if(out != null) return out;
        Integer in = freelist.poll();
        if (in == null) in = lastIdx.getAndIncrement();
        out = map.putIfAbsent(s, in);
        if(out != null) {
            return out;
        } else return in;

The remove and get methods are straightforward; the add method is where the magic happens. On each attempted insert, we first check the map. Since we re-do the check later, if we’re likely to do mainly insertions (because we have a mostly distinct dataset as input) we can skip the initial one if we want. If we actually intend on inserting, we first poll from the the freelist, and if it is empty, get a fresh integer from our counter variable. If we’re successfully able to insert this value into the map, we also return it; otherwise, it goes back into the freelist.

This isn’t the best mechanism if you really need 100% density (but if you really need 100% density, you can’t support removals either), because there is a race condition between getting a new value and the putIfAbsent call. In between, another thread may have gotten a later value and added it to the map, generating a “hole” at your index and forcing you to add it to the free list. However, the number of total “holes” should be roughly bound by the number of concurrent threads you’re running. If you have, say, 16 threads simultaneously inserting the same key, in the worst case you end up with an inserted value of say, 15, and 0-14 in the freelist. Subsequent iterations check the freelist first though, so even if we continue that worst case scenario, we’ll be churning the freelist (repeatedly draining it, replacing the same elements, and adding an integer derived from the counter instead) but ending up with only 15 total gaps in our output.

Book Review: Red-Blooded Risk by Aaron Brown

TLDR: This book is recommended for anyone who uses math to make money, whether in finance, or (in my case) the areas of the tech sector that rely on modeling customer behavior to make a buck.

This is not a how-to manual, or even a book that tries to make a monolithic argument (although if you take it seriously, one seems to emerge). It’s more of a conversation with Aaron Brown, a prominent (I’m told) risk-manager, old-school quant, and all-around interesting guy in the financial field. It’s easy for these kinds of conversational “what I’ve learned, what I think” books to come out sideways – a barely-edited transcript of an “author” and their ghostwriter (the plague of political memoirs), a platform for the airing of grievances, a journey into the depths of self-ego-stroking, or a collection of platitudes. Fortunately, Brown is better than this.

The book starts to heat up with an interesting interpretation of the Dutch “tulip mania”, which he contends was not a classical mania at all, but a somewhat rational investment in a commodity with characteristics making it a good money substitute. Tulip bulbs (the commodity actually invested in, not tulips per se) are portable, somewhat durable, have predictable inflation rates via an annuity in the form of more bulbs, are impossible to counterfeit or debase, and have an aesthetically pleasing, branded output. It makes at least as much sense to use them for money as it does wampum, and possibly even more than gold or silver, especially when you have a massive inflationary influx of bullion from the New World and a government that debases coins. Their exchange rate for classical coin is somewhat explainable by purely economic and legal factors.

After some digression into the nature of “money” per se, this is used as an entry to a discussion of “money” vs. options / futures / derivatives. In Brown’s formulation, the functional purpose of commodity futures, like an agreement to buy or sell a quantity of wheat for a certain price at a certain time, is not a way to “lock in a price” in the sense of hedging price risk. A miller typically is not “long wheat” in the futures market to guard against price spikes – a price spike impacts him in unpredictable ways (for instance, it may be due to either an increase in demand, or a fall in supply, which would have opposite effects on his business). Instead he contracts for purchase on a more flexible retail basis (not in the futures market), and is short wheat in the futures market (using the proceeds, in fact, to buy the actual wheat he is processing). These offsetting transactions, one long, one short, have the effect of borrowing wheat directly, without the necessary use of money (the contracts themselves can be used as collateral for each other since they nearly offset). When he has sold his flour, he buys out his short with the proceeds and repeats the process. Historically, money itself was a scarce and volatile commodity on the frontier, and eliminating the use of money eliminated a substantial source of risk. Instead of an interest rate linked to public debt levels, bank stability, foreign currency flows, etc., one has an interest rate more closely linked to the inherent properties of the markets actually transacted in.

As a digression of my own, it is plainly inaccurate to say that “continually borrowing with no intention of ever paying anyone back is a totally modern development”. If the miller is doing his job right, his debt will be greater each year, and he may very well be borrowing from the same person each time – if it made sense then, why not now? The question is whether he is doing something productive with the proceeds, and what the ratio of his debt to the value of his business is (in fact, if he can borrow more easily and does so, this in and of itself causes his business to increase in value). It gets dramatically more complicated, and the analogy rather breaks down, if the miller’s business is incredibly difficult to value accurately, and he owes most of the debt to his own subsidiaries, heirs, tenants, people legally obliged to hold it & never redeem… In any case, if one must analogize, it’s a far more suitable analogy than grotesque “country as household” rhetoric.

The final, and most generalizable, part of the book focuses on the notion of risk and probabilities itself, how it relates to models, and how the impact of these risks (like the unfortunate “going bankrupt” thing) manifests and can be controlled in the real world. The Kelly criterion is introduced as a way to think about investment strategy, and the warring camps of frequentists and Bayesians are reconciled into a neat package by explicitly considering what a machine learning practitioner would call a loss function and Brown considers as a choice of numeraire (this synthesizes nicely with the question of borrowing bushels of wheat vs. cash, and it is only poor editing that leaves this connection to the discovery of the reader). Dollars are an easy choice, that usually has a neat quasi-Bayesian interpretation – you may consider it in terms of conversion rates or eyeballs, but be careful when the relationship between those and the almightly dollar breaks down. Dollars aren’t always appropriate either, especially when there is no free market setting prices for the underlying events – if you’re trying to build an accurate speech-to-text engine, it’s foolish to try to put a dollar price on each error.

When models break down, Brown has an engaging explanation of the concepts of value-at-risk and risk management in general. Being not an expert in the field, it’s difficult for me to judge his approaches to risk, and many of them seem inapplicable to my line of work. The technology sector doesn’t have “traders” to manage at quite the level he describes, but the notions of rigorously evaluating performance, taking appropriate levels of risk and planning for what happens when you fail, is universal.

Ultimately, reading this book has convinced me that there is a massive mismatch in technical sophistication between models in the financial and technology sectors. The high-end predictive models being developed by the technology sector, for image processing, natural language processing, collaborative filtering, click modeling, fraud detection, etc., cover much more ground and seem vastly more sophisticated than those of the financial sector. They incorporate more data, make higher-dimensional predictions, and generally use heavier machinery. But the superstructure, the way models are actually used, thought of, and evaluated on a meta level, lags badly behind. Most of this is probably due to the decision time horizon – it would be a different story if the financial sector didn’t require sub-millisecond latencies for their predictions, or if a slight increase in face recognition was worth the billions of dollars a consistent edge in the financial markets is worth. It may be, with time, that we will see the financialization of the technology sector, securitizing and selling derivatives on click rates or ad impressions in the same way we securitize timber reserves or the profits from Facebook in toto. Already the startup sector leads the way – the only way to understand something like Instagram is as a purchase, not of a revenue stream per se, but of a nevertheless valuable commodity, financed by transforming monetary capital into users, engagment, a software platform, or whatever you wish to call their end result.

The unfortunate aspect of this book, which is not cleanly separable from the thing that makes it interesting, is that it’s clearly a very personal work. The syncretic aspects sometimes diverge into tangential areas, some of the main ideas and interesting connections are scattered, and it could generally use a better editing job. Fortunately, no one will actually force you to read the whole thing – unless they intrigue you, skip the comics, the social history of Wall Street, and the disclaimers, and enjoy a very fresh look at the interface between predictive models, risk, and decision-making.

A/B Tests and the Kelly Criterion

When you’re testing a sequence of changes in an A/B test, the question naturally arises, how large of a group should I test on? Normal practice is to wing it, usually based on some function of the number of fingers you have. This is suboptimal; a 1% test group for a Facebook-sized organization, or in the context of millions of ad impressions, is likely to be incredibly overpowered.

You can deal more rigorously with the issue by recognizing that an A/B test is a wager: we take, say, 1% of our income stream for the time period, and we spin the wheel, seeing if the payoff is negative or not.

There is a “multi-armed bandit” problem that makes this analogy explicit. However, it’s dealing with a slightly different formulation: given a number of possible “plays”, which sequence do we test & how many spins do we give each one? The vanilla posing of the problem doesn’t give a direct answer to the question of how much money we should be dumping into the slot each time (although math composes nicely and one can get a coherent solution for all these problems combined with some effort).

The question of appropriate bet size has a claimed definitive answer in the Kelly criterion. It is remarkably simple:

Proportion to bet = (prob. of success) / (cost of failure) 
                  - (prob. of failure) / (payoff of success)

To pick some contrived numbers, if I have an experiment that I guestimate to have a 70% chance of a 2% gain, and a 30% chance of a 4.5% loss, I should put ~55% of my bankroll into that wager. Essentially what this decision gets you is maximum expected long-term total payoff if you fold your winnings back into more wagers. This means that it’s not actually the appropriate criterion if you can’t continually reinvest your winnings: for instance, if you’re measuring the effect of switching from one JVM garbage collector to another, you get a choice of three and you’re done, and your “winnings” are something like having to deploy a constant proportion fewer machines or a constant reduction in latency (eg, a one-time static payoff). On the other hand, if you’re trying to turn ads into customers into ads into customers, the analogy is pretty apt.

A few questions rear their ugly heads immediately:

  • How can I possibly know the expected payoffs if that’s what I’m trying to measure in the first place?
  • How does statistical significance play into this?

The first is more of an art than a science, but you can get an estimate by looking at the results of previous experiments. If all your previous futzing with your site’s fonts only shifted conversion by 0.5% in one direction or the other, your custom Lobster substitute is unlikely to change it by an order of magnitude. But still, it’s difficult to have reasoned estimates, especially at low probabilities of highly variable magnitudes from a fickle and ever-changing customer base. It might help if you thought of the bet as not “shift N% of our traffic to an A/B test of button color”, but as “shift N% of our traffic to an A/B testing team with this track record”.

The second is trickier. As I mentioned above, it is possible to reconcile these ideas seamlessly, but it does take some math and some assumptions about what your “real” utility function and beliefs are. The Kelly criterion is fundamentally forward-looking, and statistical confidence is fundamentally backwards-looking, so they need not be in conflict, and one of the nice things about the Kelly criterion is that there is no explicit term for “variance”. In particular, because Kelly only depends on expected outcomes, you can use Bayesian updating of your expected payouts as results come in, and adjust proportions in real time.

If this sounds like it might get a bit complicated, it’s because it does. Unfortunately there is no way around it: the more rigorously you try to model a complex system of probabilistic payoffs the more elaborate your model has to be, which is why the solutions tend to be to stick with ye olde A/B tests (which are for the most part predictably suboptimal), or hire some consultants or in-house statisticians.

Designing a Persistent Bloom Filter

Bloom filters are handy data structures, particularly for applications where data sets regularly exceed RAM if stored literally. For instance, they’re useful as a way to implement an inner join or filter. You can stream the restrictor dataset into a Bloom filter, and then stream your restrictee through the Bloom filter, propogating only those elements that match, and taking only 1.44 * log2(1 / errorRate) bits per entry in your restrictor dataset. This is why databases like Cassandra use them extensively.

Usually they’re formulated as mutable data structures over a bit-array, which is in turn implemented (at least in Java) on top of an array of longs. But there’s no reason why we have to use a mutable array; persistent data structures are desirable for many reasons, not least because they can be used with the same idioms as builtin Clojure data structures if we’re operating in that environment. How do we implement them persistently? On top of a persistent vector / array, of course.

Standard Clojure persistent vectors have object overhead, and the point of a Bloom filter is to reduce memory usage, so they’re right out. You could implement it on top of a (vector-of :long) with less overhead, but there is a speed disadvantage; currently gvec / vector-of doesn’t support transients, so with K hash functions in your bloom filter, you’re doing K sequential modifications to your vector, resulting in many more allocations and copies than we’d like.

Basically, designing a persistent Bloom filter comes down to the problem of representing a persistent fixed-sized array in such a way that we can do K “modifications” per step in the minimum number of operations. Essentially all persistent data structures are built on top of a tree-like data structure, so we need to figure out the optimal tree layout. How do we do this?

When we’re constructing a Bloom filter, we provide the expected number of elements and the desired false-positive rate, which in turn gives us parameters for the optimal number of hash functions and number of total bits, so we know how many modifications per insert (K) and how many total elements we’re going to need to represent (number of bits / 64). In order to make this tree persistent, we need the guarantee that we always copy a whole node if we need to modify any of its members. So, if we have a fan-out of F, we basically accumulate a cost of F for each node along each of the K paths, without double-counting shared ancestors.

But how do we estimate the overlap? This is trickier than it seems; one of the assumptions of a Bloom filter is that the bit positions are randomly distributed. But “randomly distributed” doesn’t mean “evenly distributed”; if I drop 100 balls into 100 random buckets, I’ll end up with more than a third of buckets empty. Simulate it yourself and see:

(loop [v (vec (repeat 100 0)) ct 100] 
  (if (zero? ct) 
      (do (println "Buckets empty:" (count (filter zero? v)))
          (println "Final bucket counts:" v)) 
    (recur (update-in v [(rand-int 100)] inc) 
           (dec ct))))

The number of empty buckets (and conversely the number of buckets with at least one entry) is dictated by the negative binomial distribution. Now, we could use that to figure out the number of nodes likely to be hit on each layer, or we could simulate it, which is more fun.

//Does a Monte Carlo simulation of the cost of an insert.
public static double cost(long m, int k, int f){
    final Random rng = new Random();
    final int trials = 250;
    //Each long holds 64 bits
    final int n = (int)Math.ceil(m / 64.0);
    double cost=0;
    //Simulate from top down, averaging 100 iterations
    for(int trial=0; trial<trials; trial++){
        //Smallest power of f greater than n
        for(int width=f; width < n*f; width *= f){
            final int[] hit = new int[width];
            for(int i=0; i<k; i++){
                hit[rng.nextInt(hit.length)] = 1;
            int localCost = 0;
            for(int i=0; i<hit.length; i++){
                localCost += hit[i];
            //We may want to add a fudge factor to account for array header,
            //pointer access & function call overhead.
            cost += localCost * f;
    cost /= trials;
    return cost;

It would be more than a bit overkill to run a Monte Carlo simulation in the constructor (although I’ve seen worse), and still a little overkill to estimate fresh parameters each time using statistics. The cost isn’t convex due to integer issues (sometimes a higher branching factor lets you get rid of another level, and sometimes it just makes each copy costlier), so we’d have to calculate the results of many different fanouts, and we’ll have some error in our estimate anyway since we’re running on a real machine in the physical world.

It turns out that a fanout factor of 8 has the advantage of being a power of 2, corresponds roughly to a cache line (of course Java object headers mess this up), and gives reasonable simulated costs for large Bloom filters. I’m building a persistent Bloom filter implementation into SOAC and benchmarking, so we can see if the real world agrees with the back-of-the-envelope calculations. A fixed-size primitive array with inherent support for multi-node edits would also be a good base for a different hash table implementation that doesn’t depend on gvec.

Full Disk Encryption With Btrfs and Multiple Drives in Ubuntu

At this point, encryption is an issue of social responsibility. It is important to establish a norm that data should not live or travel in plaintext without an affirmative reason, especially if you have nothing to hide, because it provides cover to the people who do. Besides the normative aspect, if you intend on doing any international travel, have any interesting friends, or do any interesting or profitable work on your machine, you owe it to yourself to secure your environment.

Ubuntu makes the single-disk encryption scenario relatively easy, but it doesn’t allow a lot of customization at install time, and has no GUI for extending encryption to multiple disks. Fortunately it’s only a bit more CLI work to set it up to work transparently with multiple disks, so you only need to enter the one passphrase. I’ve tested this in a VM with the latest Ubuntu 14.04 beta, but it should work for other versions of Ubuntu, or any distro with support for cryptsetup.

The defaulted Ubuntu “encrypt my whole hard drive” installer layers like so:

  1. The physical disk
  2. An extended physical partition
  3. A LUKS wrapper
  4. A LVM physical volume
  5. Two LVM logical volumes: one for swap, and one EXT4 filesystem for your root directory.

Whew! This is probably fine for your system drive, if a little complex; it’s nice being able to use LVM to resize your swap partition if your needs change dramatically, and if your system drive is a different size / speed than those in your storage array (eg, a 32GB SSD vs. an array of 4TB spinny disks) it wouldn’t make sense to have it as part of the same filesystem anyway. We’ll accept that default for our root partition and swap, and focus on our secondary data drives.

We’ll assume your main system has been installed successfully on /dev/sda , and we have 2 other disks /dev/sdb and /dev/sdc that we want to set up as an encrypted, Btrfs-managed mirror.

First, let’s blow away the existing disks, and create some fresh partitions. You can do this graphically or through any partition editor. The key thing is to end up with one unformatted partition on each disk; /dev/sdb1 and /dev/sdc1 respectively.

# You’ll need to be superuser for all of this
sudo -i
# For these commands, select "o" to zero the partition table,
# "n" to create a new partition (follow the defaults for a single primary
# partition that fills all space), then "w" to write to disk.
fdisk /dev/sdb
fdisk /dev/sdc

We’re going to use a keyfile so we only have to enter the passphrase that unlocks our root partition. Let’s generate one.

# 512 bit / 64 byte keyfile
dd if=/dev/random of=/etc/keyfile bs=1 count=64

Create a couple of LUKS wrappers inside those partitions, using the keyfile we just generated

cryptsetup --key-file /etc/keyfile -v luksFormat /dev/sdb1
cryptsetup --key-file /etc/keyfile -v luksFormat /dev/sdc1

Now we load the encrypted mapping, to /dev/mapper/enc1 and /dev/mapper/enc2 respectively, again using the keyfile. We write plaintext into the mapper, and it comes out encrypted on the raw device.

cryptsetup --key-file /etc/keyfile luksOpen /dev/sdb1 enc1
cryptsetup --key-file /etc/keyfile luksOpen /dev/sdc1 enc2

Now we make a choice. Btrfs has its own LVM-esque capabilities, so rather than layer in more complexity by using logical volumes, we use Btrfs directly inside the LUKS wrapper.

# Btrfs isn’t installed by default
apt-get install btrfs-tools
# A label makes management slightly easier.
mkfs.btrfs -L vol1 -m raid1 -d raid1 /dev/mapper/enc1 /dev/mapper/enc2
# The final mount point
mkdir /mnt/enc
# We can mount any of the component devices manually, and have access to the full array
mount /dev/mapper/enc1 /mnt/enc

OK, let’s modify our fstab and our crypttab so our system knows to decrypt and mount these drives. This should be added to your crypttab (optionally replacing the devices with their UUIDs, which you can get via “sudo blkid”):

# Optionally add "discard" to options to support TRIM on SSDs
enc1 /dev/sdb1   /etc/keyfile  luks
enc2 /dev/sdc1   /etc/keyfile  luks

And this should be added to your fstab (again optionally using UUID, or the label of the array):

/dev/mapper/enc1 /mnt/enc  btrfs defaults 0 2
# Optionally, using label:
# LABEL=vol1     /mnt/enc btrfs defaults 0 2

Now, when you boot, you will be asked for your root passphrase first. The keyfile will be decrypted, and used to decrypt your Btrfs component drives. They will then be mounted, and your data will be secure.


I’m told that some people’s setups required a bit more massaging to get up and running with automount. Their fix involved adding a “device” parameter to the fstab, something like the following:

/dev/mapper/enc1 /mnt/enc  btrfs device=/dev/mapper/enc1,device=/dev/mapper/enc2 0 2