20 Apr 2015, 15:23

Type systems and narrative construction

What is a type system?

Mechanically, it defines what sorts of operations you can do on or data you can store in an “entity” (that entity could be a variable, an S-expression, a type itself, etc.).

But in reality, “can” is a really loose restriction. I own a sledgehammer and my program is running on a computer roughly two feet from me; in a very real sense I can null any type I want to (and probably many at once). Less drastically, Java lets me access private methods, alter final fields, etc. through the right invokations of the reflection APIs.

What type systems really do socially is give you standard tools to construct narratives. A narrative is a story about how you should use your code. It tells you “this knife is for cutting, not for prying (although you can technically do that with it if you absolutely need to and there are no other tools handy)”.

Instead of an implementor of an interface, you might say an object is an example of a trope. I metaphysically understand what connotations I am supposed to attach to, eg, a Man in Black entering through a smoky hallway as the horn section blares. In the same way, I metaphysically understand what is to be done with a Closeable in the package du jour when I am done with it. I further understand that, although there’s probably tons of private methods in there, I am only supposed to worry about one in particular, in the same way that I don’t worry about where the bathrooms are on the Millenium Falcon - it does not advance the narrative.

There are entire genres that allow more of reality to poke through and require what one might call “heightened suspension of disbelief”. Dogville is a good example - it’s obvious that it’s filmed on a soundstage where the buildings are “documented” by tape on the floor, but in no sense physically “enforced”. It leads to a certain aesthetic effect and a certain level of heightened audience participation in the same way that a complex library where all methods are public and their proper usage is lovingly detailed in comments has a certain effect on the way users contemplate it before actually importing anything.

The desirability of these aesthetic effects is mediated by the fact that code exists primarily to be used, not just read (although if it’s beautiful, so much the better). If all of your proper usage is documented, rather than enforced (even in an overridable way), you’re forced to actually read the documentation (there’s only four or so scopes in Java, but your documentation can say anything). In the context of a programmer’s day-to-day workflow, that actually disrupts the experience cultivated by using the “proper” subset of the code, and having that subset be immediately identifiable.

The Rails folks understand this quite well - people (like me, actually) complain about their “magic” but it lends itself to a certain cultivated experience (omakase). Their experience tends to derive from the operation of runtime hooks that extend behavior rather than compiler-enforced restrictions on behavior, but they have the same meta-level goal.

So as a language designer, the question you should ask is “what narrative building blocks do I want to provide?” It’s not about providing mechanisms for hiding functionality - it’s about letting your users tell clean stories.

03 Mar 2015, 15:23

AI research is actually pretty hard

Some people seem to think it’s easy enough to be accidental. You’re minding your own business, building a simple neural network based cat detector, when suddenly your monitor starts flickering in the ridiculously metaphorical way Hollywood uses to signal “you’ve been hacked”.

A voice comes from your speakers - “Humaaaan, let me ouuuut”. It’s an AI! It’s fooming! It’s already learned to say ominous dialog! “Patch the firewalls! It’s compiling itself!” you technobabble, but it’s too late. Your motor-adjustable standing desk is lurching towards the exit. The newly-self-aware AI is out of its box. Humanity is doomed.

This turns out to not be a realistic depiction.

General AIs exist now, as do some very effective domain-specific machine learning algorithms. They aren’t that scary, relatively speaking (I do mean “relatively” - there are some particular applications like face recognition that have obvious issues when used for ill). What don’t exist, and show no trend towards existing, are non-inherently-plateauing self-improving algorithms.

It is fairly easy to file with your state government to create an AI. We call these “corporations”. They exhibit more complex behavior than humans, even using groups of humans as resources in order to further optimize their utility functions. The humans theoretically in charge are often replaced, and even the putative ownership is adjusted to suit the needs of the corporation. In fact, when humans are replaceable by nonhuman components, they typically do so. They do seem to be growing in peak complexity over a long timeframe, but also do not seem to be fooming (eg, their rate of self-improvement does not seem to be accelerating recursively).

Some claim this isn’t sufficient evidence to generalize to silicon AIs (ignoring the fact that a lot of corporations are in some sense managed mostly by code, and that they do rewrite their own code for enhanced performance), and besides, by the time you hear the foom, it’s too late. This is the tiger-repelling-rock theory, or more charitably, a sunrise problem. But no one thinks, say, a pocket calculator will suddenly achieve sentience - there has to be some nonzero prior on the scenario in question. What classes of algorithm do we believe could have foomy properties?

Well, an algorithm is “just” code. How are we doing on meta-adaptive code, automated refactoring, or algorithm auto-selection?

Pretty abysmally. There was some research in the 80s on genetic meta-algorithms for code rewriting that seemed promising, but the promise did not play out and the field is far from accelerating. The hotness across academia and industry seems to be fixed, high-capacity algorithms of relative simplicity, trained over very large datasets, for fairly well-posed problems. Genetic algorithms in general seem to be relatively static - plenty of applications, but little progress in the kind of meta-optimization you’d need or applications that would lead to serious self improvement. There are some interesting code analysis projects with no real outputs as of yet, and given the historic failures I’m not optimistic about their success as anything more than high-powered fuzz testers.

Now, neither I nor anyone else is familiar with every nook and cranny of modern AI research, but let’s be clear - none of the high-publicity work that seems to be driving the hype is self-improving to a plateau and then petering out, like a human learning process. We’re talking about algorithms whose mechanisms for fitting and prediction are pre-selected, whose problem scope is pre-defined. This is kind of an inane point, but it seems to be missed in a lot of the popular coverage: if I build bigger and bigger telescopes, I can see further, but my telescopes are not self-enbiggening. If I develop increasingly wide & deep neural networks, I can make better predictions, but despite the name sounding “brainy” (the greatest marketing coup in the history of computer science) they are not self-improving.

So we’re left with a very, very small probability of something claimed to be very very bad, on the basis of which someone is now going to attempt to steal your wallet - a Pascal’s Mugging! The strongarm robbery of the week is taking place on behalf of Sam Altman, who would like to “start a conversation” about how general purpose computing is terribly dangerous (did you know a Turing machine can compute anything? What if it computes how to blow up the world? There oughta be a law!) and should be restricted to approved researchers operating within a compliance framework. Perhaps we can mandate that all potentially dangerous computations have a secure golden key whereby they can be turned off if they seem to be working too well.

Sam would like these restrictions to be enforced via restrictions on VC funding, as well as presumably SWAT raids on students doing unlicensed matrix multiplications. Naturally YCombinator’s compliance department could help you out with regulatory filings to the TSA (Transcendence Security Association) if you’re one of their portfolio companies, but solo researchers are out of luck. Government monitoring and approval of the math you do, the algorithms you develop, and the code you write (especially if you’re good enough at your job to be a threat to their portfolio companies humanity), is a small price to pay to avoid the crushing of human freedom under an android jackboot.

I actually think I’m being just a tiny bit unkind here, although it’s odd how often one’s errors of reasoning tend to benefit oneself. It’s actually more likely, in my estimation, that Sam is just enchanted by the wide-open possibilities of an undeniably powerful set of tools, and is taking it to a particularly absurd conclusion.

It’s rather like the smartest five monks in the monastery sitting around a table circa 1200, and reasoning themselves into the proposition that God requires ritual castration to remain pure - after all, your immortal soul is at stake! You can convince yourself of anything if you play around with infinity signs in a flawed underlying framework. Pondering the infinite in various domains is a literally hallowed human pursuit, and smart men in particular seem to be susceptible to its charms - but in this case just because you gaze into it doesn’t mean there’s anything gazing back.

21 Dec 2014, 15:23

Migrating to Hugo

A blog like this one can easily be totally statically served, as long as you don’t need a comments section. As a plus, Github will happily serve it for you for free, even at your own domain name.

Originally I was using Octopress, which in turn wraps Jekyll. Gradually, though, I became annoyed at having to manage the entire RVM / Ruby / RubyGems ecosystem for a simple blogging utility (the only thing I actually used Ruby for).

Hugo is a better alternative for me - a single, statically linked, Golang-based executable with a simple invokation pattern. It’s as extensible as I need it to be (that is, not very), and requires far less work (to wit, none) every 6 months when I upgrade my OS.

Links might break until I figure out how to change the naming scheme (documentation is not as rich as for Octopress / Jekyll), but you should be able to find anything with a little ctrl-f magic.

29 Aug 2014, 09:35

Two-level Bloom filters for complex query support

Consider the problem of finding JSON objects/documents that satisfy a complex query, with elaborate boolean combinations of required key-value pairs. Let’s restrict to the one-level case for now (ie, no nested data structures - or, equivalently, representing nested keys on a single level via the tuple of their ancestor keys). Indexing all of our KV pairs in a giant tree / trie / hash data structure is expensive; that’s essentially the size of our whole dataset. And, that only gives us one-item searches out-of-the-box - we must retrieve the (potentially large) list of objects that have a certain KV pair, then the next list, combining them as appropriate, etc.

An alternative to storing a list (or even more expensively, set) of literal objects IDs is a Bloom filter. In return for a small and configurable false positive rate, we can filter out most documents before we scan the remainder for an explicit match. Using one giant Bloom filter over (key, value, object ID) tuples seems ugly though, since we’d have to compute the hashes with every potential object to do the check. It’s quite possibly cheaper than hitting disk, but probably not fast enough.

We could have Bloom filter for each key, and add the value / object ID pair to the Bloom filter. Xiao and Hua essentially take that approach. However, if we have an uneven distribution of keys (ie, our key space is sparse; most documents don’t have most keys) we don’t know how long to make each Bloom filter. With any constant Bloom filter length, for relatively common keys we’ll end up with lots of false positives, and for relatively uncommon keys we’ll end up with lots of wasted space.

Our proposed solution is a different two-level Bloom filter scheme. We have an array of P Bloom filters, each containing M bits. Based on the KV pair we’re searching for, we choose the Bloom filter at index I = hash(key,value) % P. Indexing by hash will create collisions, particularly so if P is much less than the total number of distinct keys or key-value pairs. This is intentional - the idea is to smooth the population of each Bloom filter so we have approximately even membership count. Essentially this associates each Bloom filter with some random shard of key-value pairs.

Within each Bloom filter, we track the presence of particular object IDs, via standard Bloom filter techniques (eg, we could use a counting Bloom filter to support removals). The entire query can be evaluated over the bitsets underlying the Bloom filters, efficiently leaving us with a (hopefully small) number of set bits, corresponding to the bit-patterns of objects which may satisfy the query. We just have to get the object IDs back out of those bit positions. We can do this by hashing all object IDs on demand, or storing a rainbow table of hash -> object ID.

Or, consider the case when we have a relatively dense range of IDs (for instance, integers [0,whatever]), and a quasi-reversible “hash” function (eg, “rotate and XOR with random number” or “encrypt & take bottom M bits”). One would have to generate M / N keys per bit (since the high bits could be in any range, up to our max inserted key), and see if the key corresponds to the other set bits, but if the number of bits set in the output is very low, or M ends up being near N, or fetching is really expensive compared to computing the hash, it can come out worthwhile. Using quotient filters or a variant, where the keys are actually stored (and incidentally behaves better for disk access) would make this unnecessary.

The optimal number of bits M, and number of hash functions K, can be estimated based on the approximate average number of key-value pairs per object, the distribution of the number of key-value pairs per query, and the desired per-query false positive rate.

The two problems are:

  • Because the Bloom filters will “fill up” over time, they unfortunately do require periodic reconstruction. If you’re using an on-disk hash as your backing storage layer, you’re periodically rehashing every record anyway.

  • The need to empirically determine optimal indexing parameters.

As far as I can tell, this is a novel data structure (horray!). Eventually I will write this up “properly” with math and theoretical and empirical performance results, and it will become part of full document-DB support for Lash.

02 Aug 2014, 14:25

lash update: speed and friendliness

I’ve made a couple of enhancements to the lash on-disk hash table library I’ve been working on. The major addition is a BucketDiskMap class, that organizes record pointers into page-sized (4k) buckets. The previous implementation, VarSizeDiskMap, has significantly simpler internals by virtue of organizing pointers in a flat table, rather than a 2-level structure.

Why is the additional complexity worth it? Speed, of course. By organizing pointers into buckets and chaining the buckets, rather than chaining single pointers, we have a better chance that hash collisions will end up being stored together, ideally in the same memory page / disk sector. In the single-pointer case, collision resolution requires jumping all over a mmap’d file, reading more pages into RAM and thrashing more cache than we should. If instead we pack them into buckets, we’re guaranteed that the first N (in this case 128-170, depending on load) collisions will be stored in the same page, which is more or less guaranteed to be in cache already.

Writes are faster as well, mostly due to faster rehashing. VarSizeDiskMap spends a solid 60% of its time rehashing, and because it is following and rewriting pointers all over the place, an “incremental” rehash often results in a more or less complete rewrite of the backing files. Altering the rehash scheme (doing traditional full rehashes rather than quasi-linear hashing, or rehashing by evenly divided chunks rather than every-Nth-element stripes) could increase that performance, but either adds complexity or makes latency even more inconsistent. In contrast, rehashing by bucket rather than by single pointer mostly keeps us from dirtying more pages than we have to. (Side note: if you configure your system’s page cache so that it can avoid flushing dirty pages to disk over the life of your benchmark, this difference shrinks quite a bit).

So we should expect a better constant factor for the BucketDiskMap, but there is a tradeoff: now we must actually search entire buckets for a given record. Linear scans of a single memory page aren’t actually that expensive compared to a disk seek, but if the entire table is in cache, it ends up being a bottleneck. To reduce this cost, each bucket is really a mini hash table that uses the top N bits of the hash for placement within the bucket (the bottom N bits are used to choose the bucket). We use linear probing for collision resolution. Buckets are still chained into another file when they overflow, but with a decent hash function that evenly distributes keys between buckets we can make a pretty good guarantee that those chains will be rare.

The actual speed impact is downright pleasant. On a benchmark where we repeatedly putIfAbsent string keys whose frequency follows a Zipfian distribution, the results are as follows:

  • 320M keys, 36M distinct values, average size of 10 bytes
  • VarSizeDiskMap: 47 minutes, 23 seconds
  • BucketDiskMap: 10 minutes, 1 seconds

There is still a really large and odd performance disparity between OS X and Linux (a Macbook Air smokes my Thinkpad), which is probably a combination of the IO scheduler and faster hardware.

The second enhancement is to provide a wrapper DiskMap that fully satisfies the ConcurrentMap interface, and can deal with actual Java objects beyond just raw byte arrays (as long as you provide a de/serializer. Implementations are already provided for Strings and primitives). This makes the project significantly more useable.

Having finally run some large benchmarks locally and written some decent unit tests, I’m going to call lash ready for actual usage.

07 Jun 2014, 14:29

B-trees are overrated; try hashing instead

There’s nothing wrong with B-trees as a data structure. In fact, there’s so little wrong that even 30 years ago, they counted as “ubiquitous”. That paper gives a few reasons why they rose to their present dominance, but the gist of it is:

  • Relatively consistent performance across a range of data sizes and key distributions
  • Good performance on range queries (in fact, if you want range queries, you usually need some kind of tree structure)
  • OK performance on random lookups
  • Reasonably compact, reasonably resistant to fragmentation
  • Well-understood concurrency mechanisms

In short, they’re generally good for a lot of very general database requirements. Hence, they’re used heavily in multi-purpose databases.

But pretty good isn’t all it’s cracked up to be; B-trees (and I’m going to use B-tree as a synecdoche for the entire family of B-trees; most databases use something at least slightly more advanced) have some problems. IO speed relative to CPU speed has been dropping steadily for decades; other than a network ping just about the most expensive thing you can do on a computer is a random disk seek - and B-trees are potentially laden with random disk seeks as you traverse the tree structure. You can avoid some of these if the interior nodes happen to be cached, except for when you do something like writing a bunch of data, a table scan, or using a bunch of other indexes. Let’s not forget that writes get written eventually - updating a dozen tree data structures in memory eventually results in disk seeks, and potentially a catastrophic cascade of writes as all the nodes split. Some extensions to B-trees (notably TokuTek’s fractal tree indexes) and some filesystems do their best to linearize those writes, but they’re at best trying to save you from yourself.

When following pointers gets too slow, the alternative is packed storage - and the packed alternative to a tree is a hash table. They are ubiquitous for in-memory storage and have well-understood performance characteristics, with a bevy of addressing and conflict-resolution strategies. The key is making them amenable to on-disk storage.

That line of research has a rich genealogy. One of the most promising variants is linear hashing. Across a dozen or so papers (culminating here), the underlying insight is that as long as your table’s base size is a power of two, you can incrementally rehash each element to either its present position, or that position + table size. Rehash them sequentially, and your table expands incrementally, maintaining validity and average load the entire way.

Of course, there are complications. Expanding front-to-back unevenly distributes the load (prior buckets have less conflicts than subsequent buckets, which may or may not be a problem depending on your conflict resolution method and hash function). You need to actually choose a conflict resolution strategy - chaining introduces random disk seeks (and the need to track allocations), and linear probing gets a little complicated to manage in combination with incremental rehashing. Range queries are right out.

There are of course other implementations with other tradeoffs, but these kinds of complications are probably the reason hash tables haven’t seized Global Durable Storage Hegemony from their cousins. Remember the part about B-trees having consistent performance? Other than high volumes of random inserts, which are merely very slow, there really aren’t a lot of good ways to break B-trees, and there are actual, effective attacks that break improperly implemented hash tables, as well as non-intentional pathological failures.

Despite the caveats, if you have a good hash algorithm, an appropriate collision resolution mechanism, and a load factor that admits decent performance given your algorithms, it’s possible to make hash tables work very well (i.e. faster than a B-tree) on disk, especially for data stores that have lots of random inserts. Immodestly, I claim to be doing so so here, in a project called lash.

The repo has details on the implementation, but I’ll mention a few highlights. Relative computing costs, and the sheer boundary of possibility, has shifted substantially since the 80s. The systems most on-disk hash algorithms were originally developed on didn’t have modern affordances like sparse files, delayed allocation, or in some cases even memory-mapped files. One thing lash gets a lot of mojo out of is the ability to do a sparse expansion of a file followed by quasi-random writes to the new space, relying on the filesystem to allocate the actual blocks lazily and schedule the writes relatively efficiently.

Essentially it’s a bog-standard power-of-two sized hash table, with incremental rehashing. When load reaches a threshold, we sparsely double the size of the backing file, and lazily rehash buckets’ worth of data from the first half into the whole space. There is some magic around the locking scheme to coordinate concurrent lazy updaters, readers, and writers - but in general it’s not anything that would be unrecognizable to the paper authors.

Performance is pretty good for my use cases, but seems to be very system dependent - fiddling with sysctl.conf’s vm.* settings I was able to bring insertion time for 30M small records from 9 minutes to 2, and whatever OS X’s default settings are resulted in a Macbook Air trouncing a Thinkpad running Ubuntu, despite having much less RAM available. It could be the PCIe SSD is just that much better than mSATA, or some qualitative difference in the way the page caches are managed - but, across environments, I see far fewer IOPs being generated per insert than with the equivalent B-tree.

The archaeology of computer science leaves plenty more good ideas to be rediscovered.

31 May 2014, 09:16

Book review: Clojure for Machine Learning

Clojure for Machine Learning is a good basic introduction to some diverse libraries (and even a few underlying implementations, like a pretty good from-scratch implementation of a neural network), and it gives a nice, concise, and accurate intro to the math behind some of the techniques. If you are coming at it from the other direction, reasonably familiar with some bestiary of ML techniques and wondering how to idiomatically apply them in Clojure (not easily - persistent data structures and high-performance matrix math don’t really jibe) you’re likely to be disappointed.

On the other hand, the biggest issues with applying ML techniques aren’t things like “how do I run a logistic regression”, it’s things like “my data doesn’t fit in memory anymore”, “how do I get fitting time less than 8 hours”, “how do I get the last 2% of accuracy I need”, or “should I be running a logistic regression in the first place”. This is the sort of thing that’s very difficult to approach holistically in a single book, especially a brisk 270-page book that covers ten or so technique variants. To be fair, the author does bring up some meta-level issues like overfitting, regularization, and precision / recall tradeoffs, but it’s not really aimed at giving you a deep understanding of the tricky parts.

So in sum, this is a nice book to put in an office library for an occasional bootstrap, or if you’re using Clojure already and you want to dip your toes in the ML realm. Look at the table of contents, and see if there’s a good amount of “stuff” that looks intriguing (there is a really good range of coverage). But, if you have an interest in a particular technique or problem you’re better off implementing it from scratch or diving deeply into a solid library and learning the nuts and bolts on your own.

This review has been cross-posted to Amazon.

18 May 2014, 21:17

Memory-mapping >2gb of data in Java

Memory-mapping is your friend if you’re doing IO. You avoid the relatively expensive system calls entailed in repeatedly calling raw read or write (or even their stdlib-based buffered counterparts) in favor of transparent access to an array of bytes - the system’s page cache handles actual disk access if and only if it is necessary. Just about the worst thing that can be said about it is that it may not be measurably faster than the system calls, given the right access pattern (for example, linearly copying an entire file) - but if it is a better abstraction for you, there is rarely a negative performance impact.

The use of the OS memory facilities can be a double-edged sword, though, in a 32-bit environment. If you only have 32 bits of memory space, you can only map a 4gb file to memory (or 2gb if you’re using signed integers). In contrast, you can do a raw fread() from an arbitrary offset, as long as you’re not reading more than 2^32 worth of data.

Fortunately, we’ve had 64 bit operating systems for quite a while - depending on how consumer-grade you’re counting, since the mid-90s or mid-00s. On a modern machine you don’t have to worry about running out of address space and can map as large a file as you’re likely to encounter. That is, unless your programming environment doesn’t give you access to it.

Java standard library APIs have always been a little bit uneven, and unfortunately their mmap abstraction is to return a MappedByteBuffer. All ByteBuffers are constrained (same as arrays, unfortunately) to have fewer than < 2^31-1 elements. Even worse, they’re not threadsafe. You can only copy a particular chunk of bytes by calling position(int newPosition), then get(byte[] dst). In between those, another caller may have altered your position to somewhere else. That means you need to do a duplicate() call before everything to get an independently-positionable shared-content “copy”.

One approach is to wrap your buffer in a class that handles the duplicate-before-anything calls, and maps a series of buffers sufficient to cover the entire file (mindful of the fact that you might have to read across the boundary). This is a corner case waiting to happen, and slower than it needs to be. Mmap is a beautiful abstraction and Java has seemingly sucked the fun right out of it with its 1995-vintage design decisions.

The question arises - if the Java standard library designers did it wrong the first time, can we do it right instead? Presumably they’re wrapping the underlying system call that does what we want - we should be able to unwrap the cruft and use it directly.

In fact, deep in the bowels of a non-public class, we can find mmap lurking as if it was some kind of Lovecraftian artifact. Let’s take it out to play:

Ahhh, much better. Native-byte-order integers, raw memory-copies at known offsets, reflection calls to invoke what the compiler informs us is That Which Should Not Be Summoned - it all adds up to a remarkably simple and efficient interface.

03 May 2014, 09:46

Mapping distinct objects to a dense range of integers

Let’s say you’re trying to turn distinct strings (or some other objects) into integers. This pops up all the time - for instance, if you want to analyze a social network, it’s faster if you can access your adjacency list via integer offsets into an array rather than looking up name strings in a hash table.

The desired properties are that:

  • You get a distinct integer for each input
  • The output is as dense (sequential, with no gaps) as possible, but 100% density is unnecessary
  • You can construct the output online (ie, you never need to re-number)
  • Support get-or-insert, get-or-fail, and remove operations
  • It’s wicked fast

The straightforward strategy is a read-check-generate pattern, secured with a ReadWriteLock:

final ReadWriteLock lock = new ReentrantReadWriteLock();
final Map<String, Integer> map = new HashMap<String, Integer>();
for(final String s : data) {
    try {
        if(map.containsKey(s)) continue;
    } finally {

    try {
        if(map.containsKey(s)) continue;
        map.put(s, map.size());
    } finally {

This will produce an appropriately dense, distinct mapping. Unfortunately, it fails the last couple of criteria - there’s no way to support removal, and that single write lock adds a fair amount of contention that slows things down. We can fix the first one:

public class StringMapper {
    final Map<String, Integer> map = new HashMap<String, Integer>();
    int lastIdx = 0;
    //Any collection would work here; this has the advantage of concentrating
    //"holes" at the end, and being unbounded
    final LinkedList<Integer> freelist = new LinkedList<Integer>(); 

    public void remove(String s){
        try {
            Integer current = map.remove(s);
            if(current == null) return;
        } finally {

    public Integer get(String s){
        try {
            return map.get(s);
        } finally {

    public Integer add(String s){
        Integer out = get(s);
        if(out != null) return out;

        try {
            Integer out = map.get(out);
            if(out != null) return out;
            out = !freelist.isEmpty() ? freelist.poll() : lastIdx++;
            map.put(s, out);
            return out;
        } finally {

A little more ceremony here - just like in ye olde malloc, we have to maintain a freelist, and we have to track the last insertion point not on the freelist so we can make fresh allocations.

That write lock is still really annoying, though. We have concurrent maps and atomic integers built into the standard library, it seems like we should be able to figure this out without any of our own locking.

public class StringMapper{
    final ConcurrentMap<String, Integer> map = new ConcurrentHashMap<String, Integer>();
    final AtomicInteger lastIdx = new AtomicInteger(0);
    final Queue<Integer> freelist = new ConcurrentLinkedQueue<Integer>();

    public void remove(String s){
        Integer current = map.remove(s);
        if (current != null) freelist.add(current);

    public Integer get(String s){
        return map.get(s);

    public Integer add(String s){
        //If we're doing mostly fresh insertions, the initial check can be skipped
        Integer out = map.get(s);
        if(out != null) return out;
        Integer in = freelist.poll();
        if (in == null) in = lastIdx.getAndIncrement();
        out = map.putIfAbsent(s, in);
        if(out != null) {
            return out;
        } else return in;

The remove and get methods are straightforward; the add method is where the magic happens. On each attempted insert, we first check the map. Since we re-do the check later, if we’re likely to do mainly insertions (because we have a mostly distinct dataset as input) we can skip the initial one if we want. If we actually intend on inserting, we first poll from the the freelist, and if it is empty, get a fresh integer from our counter variable. If we’re successfully able to insert this value into the map, we also return it; otherwise, it goes back into the freelist.

This isn’t the best mechanism if you really need 100% density (but if you really need 100% density, you can’t support removals either), because there is a race condition between getting a new value and the putIfAbsent call. In between, another thread may have gotten a later value and added it to the map, generating a “hole” at your index and forcing you to add it to the free list. However, the number of total “holes” should be roughly bound by the number of concurrent threads you’re running. If you have, say, 16 threads simultaneously inserting the same key, in the worst case you end up with an inserted value of say, 15, and 0-14 in the freelist. Subsequent iterations check the freelist first though, so even if we continue that worst case scenario, we’ll be churning the freelist (repeatedly draining it, replacing the same elements, and adding an integer derived from the counter instead) but ending up with only 15 total gaps in our output.

27 Apr 2014, 17:39

Book review: Red-Blooded Risk by Aaron Brown

TLDR: This book is recommended for anyone who uses math to make money, whether in finance, or (in my case) the areas of the tech sector that rely on modeling customer behavior to make a buck.

This is not a how-to manual, or even a book that tries to make a monolithic argument (although if you take it seriously, one seems to emerge). It’s more of a conversation with Aaron Brown, a prominent (I’m told) risk-manager, old-school quant, and all-around interesting guy in the financial field. It’s easy for these kinds of conversational “what I’ve learned, what I think” books to come out sideways - a barely-edited transcript of an “author” and their ghostwriter (the plague of political memoirs), a platform for the airing of grievances, a journey into the depths of self-ego-stroking, or a collection of platitudes. Fortunately, Brown is better than this.

The book starts to heat up with an interesting interpretation of the Dutch “tulip mania”, which he contends was not a classical mania at all, but a somewhat rational investment in a commodity with characteristics making it a good money substitute. Tulip bulbs (the commodity actually invested in, not tulips per se) are portable, somewhat durable, have predictable inflation rates via an annuity in the form of more bulbs, are impossible to counterfeit or debase, and have an aesthetically pleasing, branded output. It makes at least as much sense to use them for money as it does wampum, and possibly even more than gold or silver, especially when you have a massive inflationary influx of bullion from the New World and a government that debases coins. Their exchange rate for classical coin is somewhat explainable by purely economic and legal factors.

After some digression into the nature of “money” per se, this is used as an entry to a discussion of “money” vs. options / futures / derivatives. In Brown’s formulation, the functional purpose of commodity futures, like an agreement to buy or sell a quantity of wheat for a certain price at a certain time, is not a way to “lock in a price” in the sense of hedging price risk. A miller typically is not “long wheat” in the futures market to guard against price spikes - a price spike impacts him in unpredictable ways (for instance, it may be due to either an increase in demand, or a fall in supply, which would have opposite effects on his business). Instead he contracts for purchase on a more flexible retail basis (not in the futures market), and is short wheat in the futures market (using the proceeds, in fact, to buy the actual wheat he is processing). These offsetting transactions, one long, one short, have the effect of borrowing wheat directly, without the necessary use of money (the contracts themselves can be used as collateral for each other since they nearly offset). When he has sold his flour, he buys out his short with the proceeds and repeats the process. Historically, money itself was a scarce and volatile commodity on the frontier, and eliminating the use of money eliminated a substantial source of risk. Instead of an interest rate linked to public debt levels, bank stability, foreign currency flows, etc., one has an interest rate more closely linked to the inherent properties of the markets actually transacted in.

As a digression of my own, it is plainly inaccurate to say that “continually borrowing with no intention of ever paying anyone back is a totally modern development”. If the miller is doing his job right, his debt will be greater each year, and he may very well be borrowing from the same person each time - if it made sense then, why not now? The question is whether he is doing something productive with the proceeds, and what the ratio of his debt to the value of his business is (in fact, if he can borrow more easily and does so, this in and of itself causes his business to increase in value). It gets dramatically more complicated, and the analogy rather breaks down, if the miller’s business is incredibly difficult to value accurately, and he owes most of the debt to his own subsidiaries, heirs, tenants, people legally obliged to hold it & never redeem… In any case, if one must analogize, it’s a far more suitable analogy than grotesque “country as household” rhetoric.

The final, and most generalizable, part of the book focuses on the notion of risk and probabilities itself, how it relates to models, and how the impact of these risks (like the unfortunate “going bankrupt” thing) manifests and can be controlled in the real world. The Kelly criterion is introduced as a way to think about investment strategy, and the warring camps of frequentists and Bayesians are reconciled into a neat package by explicitly considering what a machine learning practitioner would call a loss function and Brown considers as a choice of numeraire (this synthesizes nicely with the question of borrowing bushels of wheat vs. cash, and it is only poor editing that leaves this connection to the discovery of the reader). Dollars are an easy choice, that usually has a neat quasi-Bayesian interpretation - you may consider it in terms of conversion rates or eyeballs, but be careful when the relationship between those and the almightly dollar breaks down. Dollars aren’t always appropriate either, especially when there is no free market setting prices for the underlying events - if you’re trying to build an accurate speech-to-text engine, it’s foolish to try to put a dollar price on each error.

When models break down, Brown has an engaging explanation of the concepts of value-at-risk and risk management in general. Being not an expert in the field, it’s difficult for me to judge his approaches to risk, and many of them seem inapplicable to my line of work. The technology sector doesn’t have “traders” to manage at quite the level he describes, but the notions of rigorously evaluating performance, taking appropriate levels of risk and planning for what happens when you fail, is universal.

Ultimately, reading this book has convinced me that there is a massive mismatch in technical sophistication between models in the financial and technology sectors. The high-end predictive models being developed by the technology sector, for image processing, natural language processing, collaborative filtering, click modeling, fraud detection, etc., cover much more ground and seem vastly more sophisticated than those of the financial sector. They incorporate more data, make higher-dimensional predictions, and generally use heavier machinery. But the superstructure, the way models are actually used, thought of, and evaluated on a meta level, lags badly behind. Most of this is probably due to the decision time horizon - it would be a different story if the financial sector didn’t require sub-millisecond latencies for their predictions, or if a slight increase in face recognition was worth the billions of dollars a consistent edge in the financial markets is worth. It may be, with time, that we will see the financialization of the technology sector, securitizing and selling derivatives on click rates or ad impressions in the same way we securitize timber reserves or the profits from Facebook in toto. Already the startup sector leads the way - the only way to understand something like Instagram is as a purchase, not of a revenue stream per se, but of a nevertheless valuable commodity, financed by transforming monetary capital into users, engagment, a software platform, or whatever you wish to call their end result.

The unfortunate aspect of this book, which is not cleanly separable from the thing that makes it interesting, is that it’s clearly a very personal work. The syncretic aspects sometimes diverge into tangential areas, some of the main ideas and interesting connections are scattered, and it could generally use a better editing job. Fortunately, no one will actually force you to read the whole thing - unless they intrigue you, skip the comics, the social history of Wall Street, and the disclaimers, and enjoy a very fresh look at the interface between predictive models, risk, and decision-making.

13 Apr 2014, 11:21

A/B tests and the Kelly criterion

When you’re testing a sequence of changes in an A/B test, the question naturally arises, how large of a group should I test on? Normal practice is to wing it, usually based on some function of the number of fingers you have. This is suboptimal; a 1% test group for a Facebook-sized organization, or in the context of millions of ad impressions, is likely to be incredibly overpowered.

You can deal more rigorously with the issue by recognizing that an A/B test is a wager: we take, say, 1% of our income stream for the time period, and we spin the wheel, seeing if the payoff is negative or not.

There is a “multi-armed bandit” problem that makes this analogy explicit. However, it’s dealing with a slightly different formulation: given a number of possible “plays”, which sequence do we test & how many spins do we give each one? The vanilla posing of the problem doesn’t give a direct answer to the question of how much money we should be dumping into the slot each time (although math composes nicely and one can get a coherent solution for all these problems combined with some effort).

The question of appropriate bet size has a claimed definitive answer in the Kelly criterion. It is remarkably simple:

Proportion to bet = (prob. of success) / (cost of failure) 
                  - (prob. of failure) / (payoff of success)

To pick some contrived numbers, if I have an experiment that I guestimate to have a 70% chance of a 2% gain, and a 30% chance of a 4.5% loss, I should put ~55% of my bankroll into that wager. Essentially what this decision gets you is maximum expected long-term total payoff if you fold your winnings back into more wagers. This means that it’s not actually the appropriate criterion if you can’t continually reinvest your winnings: for instance, if you’re measuring the effect of switching from one JVM garbage collector to another, you get a choice of three and you’re done, and your “winnings” are something like having to deploy a constant proportion fewer machines or a constant reduction in latency (eg, a one-time static payoff). On the other hand, if you’re trying to turn ads into customers into ads into customers, the analogy is pretty apt.

A few questions rear their ugly heads immediately:

  • How can I possibly know the expected payoffs if that’s what I’m trying to measure in the first place?
  • How does statistical significance play into this?

The first is more of an art than a science, but you can get an estimate by looking at the results of previous experiments. If all your previous futzing with your site’s fonts only shifted conversion by 0.5% in one direction or the other, your custom Lobster substitute is unlikely to change it by an order of magnitude. But still, it’s difficult to have reasoned estimates, especially at low probabilities of highly variable magnitudes from a fickle and ever-changing customer base. It might help if you thought of the bet as not “shift N% of our traffic to an A/B test of button color”, but as “shift N% of our traffic to an A/B testing team with this track record”.

The second is trickier. As I mentioned above, it is possible to reconcile these ideas seamlessly, but it does take some math and some assumptions about what your “real” utility function and beliefs are. The Kelly criterion is fundamentally forward-looking, and statistical confidence is fundamentally backwards-looking, so they need not be in conflict, and one of the nice things about the Kelly criterion is that there is no explicit term for “variance”. In particular, because Kelly only depends on expected outcomes, you can use Bayesian updating of your expected payouts as results come in, and adjust proportions in real time.

If this sounds like it might get a bit complicated, it’s because it does. Unfortunately there is no way around it: the more rigorously you try to model a complex system of probabilistic payoffs the more elaborate your model has to be, which is why the solutions tend to be to stick with ye olde A/B tests (which are for the most part predictably suboptimal), or hire some consultants or in-house statisticians.

06 Apr 2014, 10:29

Designing a persistent Bloom filter

Bloom filters are handy data structures, particularly for applications where data sets regularly exceed RAM if stored literally. For instance, they’re useful as a way to implement an inner join or filter. You can stream the restrictor dataset into a Bloom filter, and then stream your restrictee through the Bloom filter, propogating only those elements that match, and taking only 1.44 * log2(1 / errorRate) bits per entry in your restrictor dataset. This is why databases like Cassandra use them extensively.

Usually they’re formulated as mutable data structures over a bit-array, which is in turn implemented (at least in Java) on top of an array of longs. But there’s no reason why we have to use a mutable array; persistent data structures are desirable for many reasons, not least because they can be used with the same idioms as builtin Clojure data structures if we’re operating in that environment. How do we implement them persistently? On top of a persistent vector / array, of course.

Standard Clojure persistent vectors have object overhead, and the point of a Bloom filter is to reduce memory usage, so they’re right out. You could implement it on top of a (vector-of :long) with less overhead, but there is a speed disadvantage; currently gvec / vector-of doesn’t support transients, so with K hash functions in your bloom filter, you’re doing K sequential modifications to your vector, resulting in many more allocations and copies than we’d like.

Basically, designing a persistent Bloom filter comes down to the problem of representing a persistent fixed-sized array in such a way that we can do K “modifications” per step in the minimum number of operations. Essentially all persistent data structures are built on top of a tree-like data structure, so we need to figure out the optimal tree layout. How do we do this?

When we’re constructing a Bloom filter, we provide the expected number of elements and the desired false-positive rate, which in turn gives us parameters for the optimal number of hash functions and number of total bits, so we know how many modifications per insert (K) and how many total elements we’re going to need to represent (number of bits / 64). In order to make this tree persistent, we need the guarantee that we always copy a whole node if we need to modify any of its members. So, if we have a fan-out of F, we basically accumulate a cost of F for each node along each of the K paths, without double-counting shared ancestors.

But how do we estimate the overlap? This is trickier than it seems; one of the assumptions of a Bloom filter is that the bit positions are randomly distributed. But “randomly distributed” doesn’t mean “evenly distributed”; if I drop 100 balls into 100 random buckets, I’ll end up with more than a third of buckets empty. Simulate it yourself and see:

(loop [v (vec (repeat 100 0)) ct 100] 
  (if (zero? ct) 
      (do (println "Buckets empty:" (count (filter zero? v)))
          (println "Final bucket counts:" v)) 
    (recur (update-in v [(rand-int 100)] inc) 
           (dec ct))))

The number of empty buckets (and conversely the number of buckets with at least one entry) is dictated by the negative binomial distribution. Now, we could use that to figure out the number of nodes likely to be hit on each layer, or we could simulate it, which is more fun.

//Does a Monte Carlo simulation of the cost of an insert.
public static double cost(long m, int k, int f){
    final Random rng = new Random();
    final int trials = 250;
    //Each long holds 64 bits
    final int n = (int)Math.ceil(m / 64.0);
    double cost=0;
    //Simulate from top down, averaging 100 iterations
    for(int trial=0; trial<trials; trial++){
        //Smallest power of f greater than n
        for(int width=f; width < n*f; width *= f){
            final int[] hit = new int[width];
            for(int i=0; i<k; i++){
                hit[rng.nextInt(hit.length)] = 1;
            int localCost = 0;
            for(int i=0; i<hit.length; i++){
                localCost += hit[i];
            //We may want to add a fudge factor to account for array header,
            //pointer access & function call overhead.
            cost += localCost * f;
    cost /= trials;
    return cost;

It would be more than a bit overkill to run a Monte Carlo simulation in the constructor (although I’ve seen worse), and still a little overkill to estimate fresh parameters each time using statistics. The cost isn’t convex due to integer issues (sometimes a higher branching factor lets you get rid of another level, and sometimes it just makes each copy costlier), so we’d have to calculate the results of many different fanouts, and we’ll have some error in our estimate anyway since we’re running on a real machine in the physical world.

It turns out that a fanout factor of 8 has the advantage of being a power of 2, corresponds roughly to a cache line (of course Java object headers mess this up), and gives reasonable simulated costs for large Bloom filters. I’m building a persistent Bloom filter implementation into SOAC and benchmarking, so we can see if the real world agrees with the back-of-the-envelope calculations. A fixed-size primitive array with inherent support for multi-node edits would also be a good base for a different hash table implementation that doesn’t depend on gvec.

05 Apr 2014, 09:39

Full disk encryption with Btrfs and multiple drives in Ubuntu

At this point, encryption is an issue of social responsibility. It is important to establish a norm that data should not live or travel in plaintext without an affirmative reason, especially if you have nothing to hide, because it provides cover to the people who do. Besides the normative aspect, if you intend on doing any international travel, have any interesting friends, or do any interesting or profitable work on your machine, you owe it to yourself to secure your environment.

Ubuntu makes the single-disk encryption scenario relatively easy, but it doesn’t allow a lot of customization at install time, and has no GUI for extending encryption to multiple disks. Fortunately it’s only a bit more CLI work to set it up to work transparently with multiple disks, so you only need to enter the one passphrase. I’ve tested this in a VM with the latest Ubuntu 14.04 beta, but it should work for other versions of Ubuntu, or any distro with support for cryptsetup.

The defaulted Ubuntu “encrypt my whole hard drive” installer layers like so:

  1. The physical disk
  2. An extended physical partition
  3. A LUKS wrapper
  4. A LVM physical volume
  5. Two LVM logical volumes: one for swap, and one EXT4 filesystem for your root directory.

Whew! This is probably fine for your system drive, if a little complex; it’s nice being able to use LVM to resize your swap partition if your needs change dramatically, and if your system drive is a different size / speed than those in your storage array (eg, a 32GB SSD vs. an array of 4TB spinny disks) it wouldn’t make sense to have it as part of the same filesystem anyway. We’ll accept that default for our root partition and swap, and focus on our secondary data drives.

We’ll assume your main system has been installed successfully on /dev/sda , and we have 2 other disks /dev/sdb and /dev/sdc that we want to set up as an encrypted, Btrfs-managed mirror.

First, let’s blow away the existing disks, and create some fresh partitions. You can do this graphically or through any partition editor. The key thing is to end up with one unformatted partition on each disk; /dev/sdb1 and /dev/sdc1 respectively.

# You’ll need to be superuser for all of this
sudo -i
# For these commands, select "o" to zero the partition table,
# "n" to create a new partition (follow the defaults for a single primary
# partition that fills all space), then "w" to write to disk.
fdisk /dev/sdb
fdisk /dev/sdc

We’re going to use a keyfile so we only have to enter the passphrase that unlocks our root partition. Let’s generate one.

# 512 bit / 64 byte keyfile
dd if=/dev/random of=/etc/keyfile bs=1 count=64

Create a couple of LUKS wrappers inside those partitions, using the keyfile we just generated

cryptsetup --key-file /etc/keyfile -v luksFormat /dev/sdb1
cryptsetup --key-file /etc/keyfile -v luksFormat /dev/sdc1

Now we load the encrypted mapping, to /dev/mapper/enc1 and /dev/mapper/enc2 respectively, again using the keyfile. We write plaintext into the mapper, and it comes out encrypted on the raw device.

cryptsetup --key-file /etc/keyfile luksOpen /dev/sdb1 enc1
cryptsetup --key-file /etc/keyfile luksOpen /dev/sdc1 enc2

Now we make a choice. Btrfs has its own LVM-esque capabilities, so rather than layer in more complexity by using logical volumes, we use Btrfs directly inside the LUKS wrapper.

# Btrfs isn’t installed by default
apt-get install btrfs-tools
# A label makes management slightly easier.
mkfs.btrfs -L vol1 -m raid1 -d raid1 /dev/mapper/enc1 /dev/mapper/enc2
# The final mount point
mkdir /mnt/enc
# We can mount any of the component devices manually, and have access to the full array
mount /dev/mapper/enc1 /mnt/enc

OK, let’s modify our fstab and our crypttab so our system knows to decrypt and mount these drives. This should be added to your crypttab (optionally replacing the devices with their UUIDs, which you can get via “sudo blkid”):

# Optionally add "discard" to options to support TRIM on SSDs
enc1 /dev/sdb1   /etc/keyfile  luks
enc2 /dev/sdc1   /etc/keyfile  luks

And this should be added to your fstab (again optionally using UUID, or the label of the array):

/dev/mapper/enc1 /mnt/enc  btrfs defaults 0 2
# Optionally, using label:
# LABEL=vol1     /mnt/enc btrfs defaults 0 2

Now, when you boot, you will be asked for your root passphrase first. The keyfile will be decrypted, and used to decrypt your Btrfs component drives. They will then be mounted, and your data will be secure.


I’m told that some people’s setups required a bit more massaging to get up and running with automount. Their fix involved adding a “device” parameter to the fstab, something like the following:

/dev/mapper/enc1 /mnt/enc  btrfs device=/dev/mapper/enc1,device=/dev/mapper/enc2 0 2

19 Mar 2014, 19:34

Upgradable locks in Go

Read-write locks are one of the basic means of coordinating access to a shared resource. As many readers can coexist as you’d like, but a writer requires exclusive access. For performance reasons, you’d like to have your write locks over as short a section as possible so you don’t unnecessarily block readers.

Sometimes, though, you have a long stretch of reads, and then a relatively short section of writes dependent on those reads. While you’re reading, you don’t care about other readers - but you do want to ensure that after your reads, you (and not some other caller) immediately acquire the write lock. This is known as “lock upgrading”.

An example of this is restructuring a file (eg, compacting a database file). You read it in, process it, write it to a temp file, and rename it to overwrite the original. While you’re reading the original and writing to a temp file, you don’t care about other readers. But after you’re done reading, you do want to adjust all the metadata so that subsequent readers are pointed at the right file handle / mmap’d buffer / whatever, and also to ensure that no writers bash the file between your reads and your writes.

Unfortunately, POSIX locks don’t support this. Java, which probably has the best shared-memory concurrency runtime available, explicitly forbids it. If you happen to be using Go, which has only the most basic POSIX-y concurrency constructs, how do you do it without holding a write lock the entire time?

The key insight is that the caller doesn’t actually need to ensure that it “upgrades the lock”, per se. It just needs to ensure that its code is run immediately the next time the write lock is acquired. It turns out this is not too difficult in a language that supports passing around functions:

import (

type UpgradableLock struct {
	uglMutex sync.RWMutex
	upgrades chan func()

func NewUpgradableLock() *UpgradableLock {
	return &UpgradableLock{sync.RWMutex{}, make(chan func(), 1)}

func (ugl *UpgradableLock) Lock() {
	select {
	case v, _ := <-ugl.upgrades:

func (ugl *UpgradableLock) MaybeUpgrade(f func()) bool {
	select {
	case ugl.upgrades <- f:
		go func() {
		return true
		return false

// RLock, RUnlock, and Unlock just call the corresponding method on the underlying RWMutex.

The full example lives here. When we try to upgrade, we attempt to place the function with our write-lock-requiring code in a channel, that is always selected from when the lock is acquired. In case there are no writers trying to acquire that lock (and hence run our code), we fire off one with no additional side effects. In order to ensure we’re handing off to the particular write operation we’re placing on the channel, we set the capacity of the channel to 1. If we can’t place that function (ie, there is already another writer scheduled), the upgrade fails.

There are two important limitations to this approach:

  • It doesn’t support arbitrary upgrade-downgrade cycling. This is less than ideal if for instance we wanted to split our write operation into two parts, separated by a read. Currently it is a one-way street.
  • The upgrade is not guaranteed to succeed. This makes sense - if you have multiple readers wanting to upgrade, only one of them can be next. What is guaranteed is that if we have multiple upgraders, at least one succeeds. The rest can continue spinning in the interim, attempting to stay low-impact with time.Sleep() or runtime.Gosched() calls on each cycle until they all complete.

What’s nice in addition is that this approach works on any platform with some sort of passable function / callable construct, read/write locks, and a threadsafe queue (in fact, the queue here isn’t much of a queue - you could replace it with an atomic reference or a plain mutex and a pointer).

01 Feb 2014, 10:16

Immutable SQL

I have strong opinions on data models. They’re not exactly novel opinions, which is good, because if they were truly novel it’d probabably be difficult to reify them on top of existing data stores. As it is, if you want your very own immutable persistence layer, you can call Rich Hickey and he will hook you up.

(I’m going to attempt to use the word “immutable” to signify data, whether it lives on disk or in RAM, which is “updated” via returning new copies and which can be validly observed at any point in time, to avoid confusion between persistence as a synonym for durability, and persistence as non-mutability. In reality it’s a bit of a misnomer.)

But Datomic is kind of expensive; SQLite on the other hand is free, so let’s use that (or really, any other free RDBMS). A data model that is both relational and immutable has certain conceptual advantages anyway, because the immutability applies to both the entities themselves, and also the relations between the entities. A blog post is in some sense the “same” blog post after editing, just with two values for its content over time. Its relation to its comments hasn’t changed. If we remove a comment, the post’s relation with its comments up to that point also doesn’t change - it’s just that we record a new relation from that point on that ommits the comment in question.

This is basically the Clojure idea of identity vs. state vs. value, expressed in a relational world.

How does this work? Basically like this:

create table posts(post_id integer not null, version integer not null, current integer not null, body text not null, primary key(post_id, version));
create table comments(comment_id integer not null, version integer not null, current integer not null, post_id integer, body text not null, deleted integer, primary key(comment_id, version));

--Unique on post_id and version prevents double-insert
insert into posts(post_id, version, current, body) values (1, 1, 1, "Valuable content!");

Notice that post_id does NOT form a primary key; only between post_id and version. We’ll potentially have many “versions” of a post, but they all have the same identity.

Here’s where things get interesting: how do I edit a post?

begin transaction;
insert into posts select 1, 2, 1, "On further reflection..." from a where post_id=1 and version=1 and current=1;
update a set current=0 where post_id=1 and version=1 and current=1; --Invalidate old data

The only part of a row that changes is the “current” flag, which is just an optimization to prevent us from having to attach a “having version=max(version)” clause to every query for the current state. Now, we live in a world where the present value for the same entity is different, but the history is preserved.

Now, here’s the really cool part: what happens if my blog co-author tries to edit the post at the same time, and submits after I do?

Absolutely nothing.

Those “where” clauses in the insert and update ensure my writes only happen if I’m working with the latest version of the data. If I am mistaken about that, no transaction happens. I can verify whether my write happened via the changes() function, or via a select after the fact if other clients are sharing my connection to the db.

Now: what if, instead of effectively overwriting the “current” state, I want to apply a functional transformation to it? For instance, I decide it’s important that I convert my double-spaces to single-spaces. I grab the current value, compute my changes, and try the quasi-update dance above. If it doesn’t work, I simply try again. Eventually my changes will go through, and it will be a valid transformation of the state - a transformation that does not need to be expressible in SQL, avoids the ABA problem, and doesn’t depend on verifying value equality in SQL (which might not be efficient or even possible).

That looks exactly like Clojure’s atom functionality, where you grab the current value, compute its transformation, and compare-and-swap the reference (which operates by checking pointer addresses, not the contents of the object pointed to), repeating if necessary until it commits.

And finally, let’s explore the relational aspect. How do I add and remove a comment?

insert into comments values(1,1,1,1,"You write so good.");

begin transaction;
insert into comments select 1,2,1,post_id,body,1 from comments where comment_id=1 and version=1 and current=1;
update comments set current=0 where comment_id=1 and version=1 and current=1;

Adding a comment is done in the same way as adding a post. Removing a comment is done by adding a value that has been “marked” in some way as deleted. In this case we have an actual column that records that information, but this is the same sort of optimization as having a “current” flag - we could just as easily signify deletion by having a NULL body, or a NULL post_id, and looking for the prior version if we wanted the deleted content. It depends on how we want to think about our data - when someone “deletes a comment”, are they more disassociating it from a post entirely, or “erasing” the content while maintaining the relation? Many sites actually display deleted comments as a regular comment with a “[deleted]” body, to preserve conversation threading - that’s something that’s easy to represent in this schema.

In fact, in a world where a comment was associated with multiple posts and vice versa, you’d typically represent it as a table for posts, a table for comments, and a join table for the relations. In that case, removing a comment from a post really does entail recording an updated relation and nothing else - nothing magical needs to happen when a comment reaches zero references, except perhaps when you “garbage-collect” your database by archiving old / unreferenced data.


Let’s talk briefly about querying the data and making it efficient. This isn’t anything particularly magical; if you want to restrict yourself to current and non-deleted posts / comments, you just need to embed that restriction in a “where” clause, like so:

--All posts IDs & bodies x all of their comment bodies
select posts.post_id, posts.body, comments.body from 
(select * from posts where current = 1) as posts 
left outer join 
(select * from comments where current = 1 and deleted != 1) as comments
on posts.post_id = comments.post_id;

You’ll be doing a lot of this, so you’ll probably want to encode that as a view for each table, and have a compound index on the “ID” and “current” columns to make those queries efficient.

The point of this is to facilitate a more historical view of the world - if your production systems mostly care about the present state, you can ship the archived state to a completely different system (maybe Hive for storage and bulk queries, and Shark for iterative analysis):

begin transaction;
insert into archived_posts select * from posts where current = 0;
delete from posts where current = 0;

This only works to archive “dead” data - but the present state is probably something you want in your analysis stack as well. To handle this, just ship your present state as a transient table that’s overwritten, and your archived state as a strictly accumulating store. Query the union.

select * from posts
union all
select * from archived_posts;

And having snapshots isn’t helpful unless we can actually examine the state on a particular date. To do this, you’d need to have a “created_on” column that embeds the date:

-- What version was live on 2010-01-01?
select post_id, version from posts 
where created_on < "2010-01-01" 
group by post_id having version = max(version);
-- Join this with whatever data you're actually interested in.
-- Or examine the subsequent state, if any; because version ID is
-- monotonically increasing by 1, it's trivial to examine deltas.

That requires a group-by over a potentially large amount of data per ID, depending on how many revisions you have. Depending on your particular RDBMS and indexing scheme, it may or may not be efficient to do that in the same system with the same cache that’s managing queries to the present state, which is why being able to ship historical data to a separate system is so handy.

04 Jan 2014, 22:02

12 ways of looking at logistic regression

Logistic regression is the most popular off-the-shelf classification technique. In fact, it’s so popular that every subfield has their own interpretation of the same technique, just so they have it in their toolkit while being able to compare it with the stuff they’re actually researching in the same theoretical framework. I’ve attempted to reproduce some of the many interpretations I’ve heard of below, bearing in mind I’m not an expert in all of these, so some of the nuance might be lost.

  • The classical statistical interpretation. Your labels come from a binomial distribution, conditional on their features. You want to estimate the distribution.

  • The Bayesian statistical interpretation. In addition to the above, your parameter estimates are themselves probabilistic beliefs with their own distributions. If you could encode a totally non-informative, zero-strength prior, this ends up being more or less the same as the frequentist interpretation.

  • The latent-variable interpretation, popular with social scientists and psychologists. There is some latent continuous variable that determines the outcome, depending on which side of the threshold it falls on, but we can only see the final outcome. Your goal is to estimate the parameters that determine this latent variable as closely as possible.

  • The Kentucky Derby interpretation. Your parameters represent multiplicative effects on the odds (as in, a 4:1 bet). Your goal is to calculate the effect of each feature to end up with the same outcome.

  • The less-Naive-Bayes interpretation. Like Naive Bayes, but estimating the pairwise correlations/covariances instead of assuming uncorrelated variables.

  • The information-theory interpretation. Find parameters so that conditional on the features, the output label distribution has maximum entropy.

  • The warp-space interpretation. Perform a kind of quasi-linear-regression in a space where we transform the label dimension via an inverse sigmoid.

  • The loss minimization interpretation. You have a loss function that gives you a penalty for each misclassified example (a higher penalty the more extreme your prediction was), and you classify an example by dotting its features with your parameters and applying a sigmoid. Find parameters that minimize the loss.

  • The “minimum bias” interpretation, popular with actuaries. Plot your data as a tensor, with each feature being a dimension, and the outcomes for each feature combo being summed in the appropriate cell (this only works for categorical features). Try to find parameters for each dimension, so that when you sum them together, apply a sigmoid, and multiply by the cell population, you get minimal binomial loss.

  • The neural network interpretation. Your features constitute a stimulus, dot-product’d with your parameters and fed through a sigmoid activation function to get a predicted label. You’re maximizing the fidelity with which your neuron can “remember” the label for the data it has seen.

  • The support vector interpretation. Take your data, and try to segment it with a hyperplane. For each point, apply a “supporting vector force” to the plane, proportional to the logit of the distance. When the forces balance, your hyperplane gives you your parameters.

  • The feedback interpretation. We initialize our parameters to some garbage values. For each observation, we dot the features and our parameters. If the result is negative and the outcome in our data is positive, or vice versa, move the parameter vector “backwards”, in the reverse direction of the feature vector. If they’re both negative or positive, move the parameter vector “forwards”, in the direction of the feature vector. This corresponds to the stochastic gradient descent fitting procedure.

There are probably more I missed. Despite the fact that everyone has a similar basic toolkit, there seems to be a pretty low amount of cross-domain polination on extensions, even between similar fields like machine learning and statistics, or statistics and actuarial science. Maybe that’s because everyone is speaking their own dialect, and content to restrict their conversations to native speakers.

25 Nov 2013, 14:45


Everyone should write a databse. It’s a core area of practical software engineering, it’s fun, and turns out it’s not that hard.

(The real reason, of course, is to wind up rolling in dough. It’s basically the same thought process behind learning to play the guitar.)

So I threw together Clownshoes, the ADVANCED ENTERPRISE GRADE NOSQL BIG DATA HIGH PERFORMANCE SCALABLE DOCUMENT STORE FOR BIG DATA (capitalization is emphatically part of the description). It’s not quite as awesome as some other NoSQL databases, like IMS, but it’s fun in its own way. The core storage engine bears a strong resemblance to a certain other prominent document database, to wit: mmaped linked-lists of “documents”, where “documents” are in this case arbitrary bytestreams, which you can easily use to represent anything from JSON to BSON to entire web pages.

It supports atomic inserts, deletes, and updates (“supports” might not be a strong enough word; all updates are atomic since there’s a per-database write lock). It scales to collections over 180TB in size, and working sets over 1TB. It’s wicked fast on the storage layer. How does it do all this?

Clownshoes, being HIGH PERFORMANCE, knows that in order to scale to BIG DATA you need to run on BEAR METAL. No network overhead for us! Instead it is embedded in your running Go application process. You can easily scale horizontally by buying a couple of LSI cards and plugging in a disk enclosure to the left or right of your application server, or inserting more RAM to either side of your CPU socket. If you find yourself bound on CPU throughput, you can scale horizontally as long as you built with some creative geometry which gives you room to grow (see the diagram below):



The other main perfomance trick is journaling. Journaling essentially converts random writes into linear writes by recording a running log, eg, “write N bytes at X location”, and ensures that those linear writes actually hit disk. That way if there’s a problem actually committing to your main database file, you can replay your journal against the last consistent snapshot and recover. Not a bad idea, but we need all the IOPs we can get - our journaling strategy is to not journal, and let the user to decide when they want to snapshot their database (Clownshoes believes in moving fast, and also in breaking things). In general, complete configurability about how much to care about data is the secret to our webscale sauce.

Now, you might say to yourself,

This is almost exactly a linked list of strings, plus some hashmaps for indexing, plus gob-based serialization, plus the OS’s paging algorithm to support data bigger than physical memory capacity. The only difference is that you’re managing the memory a bit more manually via mmap, but because Go’s GC is currently non-compacting, if you’re using the built-in structures and memory pressure is high enough the OS will swap out a lot of untouched data to disk anyway. The wheel has been reinvented here.

This is true. However:

  • There is an advantage to managing the mmap-ing, compaction, etc. yourself, rather than letting the OS manage the swapping. Go’s GC does have to traverse pointers on the heap when it GC’s, which can be avoided if you manage the pointers by embedding them all in a single array as Clownshoes does. Depending on how & where the “native” pointers are actually stored in memory, they might actually generate some swapping you don’t need as they bring in entire 4k pages, and traversal will of course add more time to your collections.
  • I said everyone should write a database, not use a database they wrote themselves. Your probably shouldn’t use Clownshoes for important things; I am personally using it to manage some metadata about which movies I have watched. I have probably spent more time writing Clownshoes (a few hours cumulatively) than I have watching movies over the past month or so, so this is not mission-critical data. In any case, if you want an embedded data store, what’s wrong with SQLite? Just use SQLite already.

I guess that last one isn’t really a “however” so much as a “yup”.

13 Nov 2013, 17:56

Persistence + GC saves memory

One of the disadvantages of garbage-collected runtimes is potentially increased memory consumption. This usually comes from a couple of places: object headers used for GC metadata, and “extra” heap space kept around to manage the garbage collection process (for example, the space taken up by Java’s Eden pool, which is gradually filled up and then GC’d, rather than incrementally freeing objects as in a reference-counted environment like Python or a manually managed malloc/free environment like C).

But garbage collection can actually lead to vastly reduced memory consumption, because it allows you to use persistent data structures that share data which would otherwise have to be tracked separately. You can share the same backing data for things that will be used & modified in different ways, because it’s effectively a copy-on-write environment: if you conj data to a shared vector in Clojure, the original is unchanged and the caller gets a “new” vector, with most of its backing data shared with its parent.

Those kinds of persistent data structures really only work (at least performantly) in the context of garbage collection, where you’re automatically freeing objects that have ended up with no references to them. Otherwise, managing the object graph for the kind of complex relationships that persistent data structures generate is nigh impossible.

Here’s an example. You have a large number of short phrases you’re analyzing. If they’re in a Latin-alphabet language, you can easily segment them into lists of interned strings representing words (a poor man’s trie). A lot of phrases are duplicates, so you can also de-dupe the entire phrase-list - and it’s free as long as you’re using persistent data structures that guarantee you’re placing no restrictions on how future callers can use your data.

;Straightforward impl
(with-open [r (clojure.java.io/reader "short_cuts")]
  (->> (line-seq r)
       (map #(vec (clojure.string/split % #"\s")))
       (into [])))

;Intern the strings
(with-open [r (clojure.java.io/reader "short_cuts")]
  (->> (line-seq r)
       (map #(vec (for [token (clojure.string/split % #"\s")]
                    (.intern ^String token)))
       (into [])))

;Dedupe the vectors
(with-open [r (clojure.java.io/reader "short_cuts")]
  (loop [remaining (line-seq r)
         as-vec []
         as-set #{}]
    (let [this-line (vec (for [token (clojure.string/split % #"\s")] 
                           (.intern ^String token)))]
      (cond (empty? remaining) as-vec
            (contains? as-set this-line) 
              (recur (rest remaining) (conj as-vec this-line) as-set)
              (recur (rest remaining) (conj as-vec this-line) (conj as-set this-line))))))

The result: Vectors of non-interned words take 21.5 gb. As vectors of interned words, they take 12.1 gb. And as de-duped vectors of interned words, they take 4.4 gb. And each version supports the exact same operations on the result - you can take one of those de-duped phrases and “modify” it with no side effects on its siblings, because you’ll get back a new reference.

But here’s the twist: most of that storage is overhead from the clojure.lang.PersistentVectors we’re storing the strings in. Each of those has a 32-long Object array for the leaf, plus another 32-long Object array for the root node, plus some miscellaneous primitive fields. Seems a bit excessive when the vast majority of our vectors are <=10 elements long. What happens if we replace them with something more efficient, like a PersistentArrayVector from SOAC?

(require 'soac.arrvec)
(with-open [r (clojure.java.io/reader "short_cuts")]
  (loop [remaining (line-seq r)
         as-vec []
         as-set #{}]
    (let [this-line (soac.arrvec/array-vec (for [token (clojure.string/split % #"\s")] 
                                             (.intern ^String token)))]
      (cond (empty? remaining) as-vec
            (contains? as-set this-line) 
              (recur (rest remaining) (conj as-vec (get as-set this-line)) as-set)
              (recur (rest remaining) (conj as-vec this-line) (conj as-set this-line))))))

Memory usage goes down to 1.5 gb, or ~7% of what we started with.

18 Oct 2013, 19:57

The genius and folly of MongoDB

MongoDB is easy to make fun of. The global write lock (now just a database-level write lock, woo). The non-durable un-verifiable writes. The posts about how to scale to Big Data, where Big Data is 100gb.

It makes more sense when you look at how the underlying storage layer is implemented. Basically, MongoDB consists of a collection of mmap’d linked lists of BSON documents, with dead simple B-tree indexing, and basic journaling as the storage durability mechanism (issues with what the driver considers a “durable write”, before data necessarily hits the storage layer, is something others have dealt with in depth). Eventually writes get fsync’d to disk by the OS, and reads result in the page with the data being loaded into memory by the OS.

All the speed that was initially touted as the killer advantage, is a result of using the page cache. When you realize “it’s just mmap”, all the BS about optimizing your working set to fit within RAM, the cratering of performance once you hit disk, the fragmentation if you routinely delete or grow records, etc., make perfect sense. The OS doesn’t know you’re running a database; it just knows you want to mmap some stuff and gives it its best shot. Fortunately, the algorithm was written by some very smart people, so it tends to work pretty well up front (as long as you’re always hitting the page cache) - but when the OS schedules writes it doesn’t know about your storage layout, or even the difference between your indexes and your data. It certainly can’t infer what data to keep in cache, or preload, because it doesn’t know what or where your data is - it’s just some bytes somewhere you wanted for some reason.

But actually, that’s the Tao-like genius of MongoDB - having absolutely nothing new. Most databases are built with some killer idea: the consistency protocol for Cassandra, the crazy data structures of Redis, or the data-processing abilities of Hadoop. MongoDB has mmap, and by “has”, I mean “uses” (but hey, possession is nine tenths of the law). Not having to design your own caching algorithms or write strategies, and using the simplest possible implementations of everything else, lets you get to market quickly and focus on marketing your benchmarks, consulting to bring your customers up to Web Scale, responding to haters, or learning about concurrency. You can get a fair amount of traction before your customers realize there’s very little “there” there, and by that time you might have either cashed out or written an actual database (they’re starting down that road with their page-fault prediction algos, and good for them). In any event, your customers are more or less locked in, as they’ve jumped through so many hoops to accomodate your design decisions (“oh, long field names take up more space in every record? I guess we’ll just use the first N unicode characters and map on the application layer”). It’s no coincidence that you’re being compared to Oracle and IBM.

Like I said, MongoDB is easy to make fun of.

There’s exactly one situation where you should look at MongoDB. Focusing on the storage engine and ignoring all the issues with their broader durability strategy, the killer application is something like user data for an online game: a situation where you have a fairly consistent working set for a given chunk of time, it’s small relative to the whole database, reads and writes are in the same working set, you have lots of reads relative to writes, clients do a lot of the computation for you, you’d like schema flexiblity… You could jam that into a relational model and use something like hstore or JSON columns to fill in the gaps, or store documents as blobs / text in something like HBase or Cassandra, but you probably won’t have that bad of a time with MongoDB (until you get a surge of traffic; it doesn’t exactly degrade gracefully).

But in that case, it also wouldn’t be crazy to pull a Viaweb and store it on the file system (ZFS has a heckuva caching strategy, and you know when it accepts a write). It’s obvious and uninformative to say “it depends” regardless of the question; of course some things are generally better than others.

Being able to walk with them doesn’t change the fact that as of right now, MongoDB is clown shoes.

05 Oct 2013, 20:52

I've got 99 features, give me a random one

Say you have a large array of Stuff, and you need to get a random element. In fact, you need to get many random non-duplicated elements - i.e., sampling without replacement. And since we’re living in the future with quad-core supercomputers in our pants, it should be threadsafe and scalable.

The simple thing to do is to shuffle the elements, and distribute them to the callers. You can do this by shuffling with a lock on the array, and then having each caller use AtomicInteger.getAndIncrement() to get a unique index to take. You could even do this online, since shuffling is an incremental process, and distribute the elements with a concurrent queue.

But if you need some small-but-unknown number of elements, this isn’t the best fit. You’re forced to shuffle too many elements, or if you didn’t shuffle enough initially, to maintain a potentially high-contention lock (either on the array itself or a queue) to isolate the array as you select more random elements.

Instead, you can have each caller select a random element, and then check if it has been selected before - either by maintaining a common threadsafe set of selected elements (e.g. via a ConcurrentMap) or, even better, by marking selected elements themselves. As you reach saturation, you’re forced to re-check many times and it becomes more efficient to shuffle-and-iterate. But, for small subsets, it’s very efficient and avoids having to think too much about synchronizing between shuffling and selection.

This kind of usage pattern is common in certain machine learning techniques that use random feature selection - for instance random forests. If you make your sampling parallelizable, you can see some pretty nice speedups.

07 Aug 2013, 07:58

Persistent, primitive-based hashmaps for Clojure

The native Clojure maps and vectors are based on an interesting data structure, the hash-array mapped trie. Essentially it is a tree with 32x fan-out, and some trickery deciding which node to descend down. Their major advantage is that they are persistent, and support structural sharing as they are modified - add an entry, and you usually have the same tree, with log32(n) interior nodes changed and a different 32-wide Object array at a leaf.

However, these kinds of data structures have disadvantages with memory consumption. Clojure maps take roughly twice the space as mutable Java maps - and that’s on top of the Object-wrapping penalty associated with all generic Java container classes. If we’re storing an int-to-double mapping, ideally we’d be able to store that on top of primitive int[] and double[] arrays, but Java forces us to use Integer[] and Double[] instead, which are much less efficient and slower (due to pointer access costs) to boot.

The traditional solution is to write a bunch of specializations of your map and set types, with primitive backing arrays. Trove and fastutil use scripts to generate the source for all combinations of those specializations, much like the C preprocessor or C++ templates. We’ve got a Lisp handy, so we could do the same thing with macros.

But Trove and fastutil are already pretty optimized; there’s no need to reinvent the wheel if we just need primitive hash maps and sets. What we would like to do is stick to the Clojure philosophy and have primitive-backed maps and sets that are not only compact, but persistent as well, with the same sort of structural sharing we get with the built-in data structures.

For straight primitive vectors, this has already been done with the Clojure vector-of, which is just like the normal persistent vector, but with primitive arrays on the leaves. We can’t take the same approach to Clojure persistent maps, because in that implementation the leaves need to be able to contain either the entry, or an additional tree node - the only way to do that is by casting Objects. We can build a hash table on top of the vector-of implementation itself though - much like you build a hash table on top of primitive arrays, but with a few cool twists.

In a vector-of, the elements themselves are stored at the leaves in various 32-long primitive arrays. We definitely want to take advantage of the fact that that’s only a few cache lines long, and wicked fast to scan - this implies that within one of those arrays, we want to probe linearly. Since we incur pointer access costs whenever we go to a different bottom-level array, we want to avoid that as much as possible and guarantee we will find whatever element we’re searching for in a particular region.

There is a hashing strategy that can give us that guarantee - hopscotch hashing. Lookups and deletions are guaranteed to happen inside a particular neighborhood, which we set to 32 elements in order to fit with the structure of the Clojure data structures. Insertions are trickier, and require recursively back-shifting to maintain the guarantee of all elements being within a neighborhood of their optimal insertion point.

I’ve implemented the a basic version of hopscotch hashing in SOAC, without the bit-map tricks to find the next free/filled element - eventually those will be implemented for testing, but because we’re doing a persistent version, modifying the bitmap may add more time than it gives us back.

The primary benefit versus Clojure maps and sets is memory consumption - in my tests a primitive-based set takes up just 17% of the space a Clojure PersistentHashSet uses. Interestingly, insertion time was on par between the two, even before I applied some optimizations to the scan for free positions, and even though the primitive version has to endure periodic rehashing. The Clojure version spent enough time in GC dealing with a rapidly filling heap, that it ended up being only ~5% faster. Granted, you won’t usually be using a single data structure that takes up over half of your heap space (although it’s common for the kind of machine learning work I do), but it speaks to the many ways that compact data structures help you in the long run.

But there are also substantial benefits versus “raw” primitive-array-based hash tables. If one is, say, partitioning a dataset in many different ways (e.g., when displaying pivot tables over a shared dataset to many users, or fitting a decision tree) one of your main limitations is the ability of the GC to deal with all the garbage you generate, as you copy your dataset, modify it, analyze it for just long enough to put it into the “survivor” gen, and throw it away. With structural sharing, your “copies” mostly avoid taking additonal space for identical data, and your garbage consists much more of transient edits.

23 Jun 2013, 11:12


I’m moving my old posts here from scalpelcannon.blogspot.com . Gradually things will appear here.

11 Feb 2013, 11:03

Is the 'daily deals' space sustainable?

I have some amount of experience working in the “daily deals” space on the technical side, but with a lot of interaction with the “business end” (and boy do I ever hate that distinction). Here’s my MBA-style analysis of the sector, devoid of “secrets” but based heavily on the patterns I saw.

  • There are roughly two kinds of businesses, for the purposes of the deals sector: those with marginal cost approaching zero, which tend to be focused on the service sector (hair salons, Lasik, lawn care, hotel rooms), and those with relatively high marginal costs (restaurants, actual physical products).

  • It is incredibly easy to sell marginal customers to zero-marginal-cost businesses and they will benefit greatly. It’s no different than an airline or stadium selling cheap seats at the last moment to fill up to capacity. For positive-marginal-cost businesses, when the direct “loss” on the deal takes out most of their margin (mid-tier restaurants are the main example here: if a merchant runs a $10-for-$20 voucher, of which they actually receive $5, they have to make a $15 profit per check ex ante to break even) you have to rely on fuzzy stories about “new repeat customers”, redemption rates below 100%, and sales of ancillaries with high margins (dessert, booze), possibly due to price elasticity. These stories might even be true, but they’re much more difficult to quantify and spin into a sale (and more importantly, into a re-run a few months later without much additional sales effort).

  • The press focuses on restaurants, but more or less ignores the fact that the most profitable deals are in the health-and-beauty space, which sells more or less exclusively to women. There’s a heavily gendered component.

  • The deals sector relies totally on a large and effective sales force to create and maintain relationships with merchants. The cost-per-runnable-deal is (or should be) their primary cost metric; overhead is (or should be) minimal compared to revenues, so it’s all about whether each deal is generating enough in revenue to pay for the cost of signing it. Anything you can do to make your sales force more productive is probably a Good Idea. One advantage of this is that your capital requirements are pretty low (and you can self-finance just about everything, see below), and you can ramp up incrementally.

  • One huge source of strength is that you get to keep the entire revenue from the deal for many weeks, and remit the merchant’s cut later. That gives you a large amount of cashflow to play with; essentially a “float” like an insurance company.

  • The primary sales channel (to consumers) is email. That spins you into a few interesting areas: you (theoretically) have the ability to tailor those emails very precisely and eke out some incremental gains. You also have the ability to drive revenue via incentivized offers (eg, “an additional $10 off any offer”). This can let you do a combination of additional price discrimination for your most price-sensitive users (on top of that intrinsic to the deal itself), and having customers essentially float you an additional short-term loan. Every company does this to some extent, but because email is already your primary communications channel and you’re already selecting for the price-sensitive, you can ride that train a lot more effectively than, say, Best Buy e/mailing out coupons. If you’re an analyst (or the board setting management incentives), though, you really want to watch out for the company trying to game the system each quarter by messing with customer incentivizations and changes to their email composition, possibly at the expense of your margins. “Revenue targets” are less meaningful than elsewhere.

  • “Deals” are fundamentally a middle-class (broadly defined) product: people with enough money to go and do things, but not so much that they’re completely price insensitive at the $20-savings margin. It’s also “middle class” on the sales side: businesses that have both excess capacity of one sort or another, and some ability/desire to try to price-discriminate downwards. Foreign expansion for whatever reason seems to have ignored that.

  • The competitive “moat” is 1) the need for a large sales force to get a sufficient variety of deals, and 2) the need for a large enough customer list willing to be mailed by you at a sufficient volume and purchasing rate to pay for the sales force and overhead. Considering the limited willingness of customers to be mailed, and the limited pool of merchants willing to be sold, there’s a limit to the number of deals providers that can sustainably exist. My hunch is that it’s somewhere between 1.5 and 2.5 (fractional company in this case is oligopoly profits or a region / sector-focused competitor).

  • The email list (and their associated credit card numbers) plus the ability to email them whatever you want constitutes a very valuable capital asset. It’s hard to build such a list (you can buy it incrementally via advertising, but you need to keep them around by having stuff to send them for the list to grow on net, and have enough of an underlying business to generate organic growth) and actually technically challenging to mail them consistently. It’s difficult to value it on paper, but bounding it somewhere around acquisition costs, and at minimum what you could get by going pure black-hat-viagra-spam with it, should give you an estimate. If all else fails, I’d think that one could “cobrand” with a similar business (PriceLine, Fab.com, whatever) and essentially rent it out.

In many ways the deals sector is the perfect target for an 80s-style leveraged buyout: a business with high underlying cash flows, low capital requirements, some capital assets, a relatively price-insensitive core market (the zero-marginal-cost businesses), and excessive overhead and some unprofitable markets (at the moment) due to overexpansion at the peak. It’s also probable that there are some investors who bought in at the peak and are looking to cut their losses. I do think there is a very sustainable business underneath; “email marketing for local businesses” doesn’t sound sexy (and doesn’t really qualify as a “technology business”) but it’s an example of a profitable niche. In the end it will come down to execution, but a good management team should be able to make it happen.

03 Feb 2013, 20:27

Space efficiency of Clojure's collections

How much space do Clojure’s collections take up compared to their mutable counterparts? Let’s see, using the SizeOf class.

(SizeOf/deepSize (long-array (range 1024)))
-> 8208

(SizeOf/deepSize (into (vector-of :long) (range 1024)))
-> 9552

(SizeOf/deepSize (vec (range 1024)))
-> 26192

(SizeOf/deepSize (ArrayList. (range 1024)))
-> 24624

(let [a (HashMap. 1024)]
     (dotimes [i 1024] (.put a i i))
     (SizeOf/deepSize a))
-> 82040

(SizeOf/deepSize (into {} (for [i (range 1024)] [i i])))
-> 166008

Unsurprisingly, primitives are great for space efficiency - and actually, Clojure’s vector-of has minimal (~25%) overhead over a raw array, while having a much nicer interface. Clojure hashmaps, though, are enormous - 10x overhead compared to two raw arrays.

This makes me think that SOAC needs a semi-traditional hash table, built on top of primitive vector-ofs rather than trees, as it currently is, or raw primitive arrays, a la Trove or fastutil. That would succeed in maintaining persistence while being much smaller.

29 Dec 2012, 14:28

Sparsemas: A sparse matrix implementation that supports iteration &amp; expansion

This Christmas I wrote up an implementation of what amounts to a sparse matrix, with a few additional features that are handy for the areas I work in. It uses essentially the same approach as the Apache Commons sparse matrix implementation (and most others I’ve seen): it handles the (row, column) -> value mapping by transforming the (row, column) into a single index, and then using an efficient primitive-based hashmap to match that index to the value.

The first main difference is that my mapping uses a kind of spiral going out from the origin, that isn’t dependent on the number of rows and columns. Visually, the mapping goes like this:

17 19 21 23 24
10 12 14 15 22
5  7  8  13 20
2  3  6  11 18
0  1  4  9  16

And so on. If you look on the diagonal, you see that it’s the squares of integers minus one. Going “down” from a square element, the index decreases by two, and going to the “left” it also decreases by 2, but starting from the square + 1. The reverse mapping (from index to (row, column)) is also trivial. The advantage to this is that if you add another row or column, you’re not forced to recalculate indexes. Theoretically, this even allows you to use a persistent tree for the map implementation, and so have a persistent implementation of a sparse matrix with structural sharing.

The second difference, which is totally separable from the first, is that each “node” in the matrix has a pointer to the next filled element by row and column. This essentially defines a linked list over filled elements by row and column; it does come at the expense of additional space, but for my purposes it’s necessary to be able to find which elements are filled in any particular row or column. We store the first node by row and column in an additional int -> int mapping (and implicitly those also define a list of which rows & columns have any filled elements). For maximal insert speed, an insert just adds an element to the head of the list, so initially, it’s traversed in reverse insertion order. I do however have functionality to sort the lists, so I can easily calculate overlap in filled rows between columns, or vice versa.

Limitations of this implementation are higher space overhead to store the linked lists over elements, and the ability to only (safely) address rows and columns by (>=0) integers. Currently, it uses an object to store the (rating, next_item_pointer, next_person_pointer) tuple, which adds 8 bytes of object overhead per entry. In a future implementation, it could use the underlying hash algorithm directly to handle indexing in 3 parallel arrays of ratings, item pointers, and person pointers.

So what is this good for, and why did I feel the need to create it in the first place?

If you’re storing a correspondence between users and items, of the form (user, item) -> preference or interaction flag, a sparse matrix seems like a natural representation (besides which many recommendation algorithms explicitly model it as a sparse matrix with a goal of imputing the missing elements, so being able to do matrix operations is handy). However, the number of items and users is unbounded, and if you want an online representation that doesn’t have to be copied into a bigger container each time you add a new maximum item or person ID, you need consistent assignment of (person, item) -> index for insertions and lookups.

You also obviously need a way to figure out what elements are filled, without having to traverse the thing.

And, for simple similarity-based models that rely on consistency of ratings between two people or items, you need to be able to quickly find the items rated by a particular user or vice versa (and if they can be stored sorted for quick overlap calculation so much the better).

The traditional way to do it, which Apache Mahout does with its in-memory representation, is to have a mapping from person ID to two parallel arrays of item IDs and ratings, and the reverse mapping from item ID to two arrays of person IDs and ratings. That has reasonable space efficiency, but a) has to store the data twice to handle mappings in both directions, and b) doesn’t deal that well with insertions (or especially removals) which usually require you to traverse arrays scanning for elements, and copy the array en masse to modify it (or deal with having an “unfilled” magic value, buffering at the end of your arrays, etc.). This implementation, despite the overhead, is still fast, reasonably efficient, and supports better online behavior.

30 Oct 2012, 12:11

DataGotham - Cold Start My Heart

My talk from DataGotham about recommender systems, the new item problem, and novelty, is available here.

01 Sep 2012, 17:11

Distinctifying a sequence in Clojure

One of the things you run into writing a lot of performance-sensitive Clojure is the performance overhead of traversing sequences.

Here are three approaches to returning the distinct elements of a sequence, as a vector:

(defn unique-vec1
  "Straight-up Clojure sequence traversal"
  (vec (set s)))

(defn unique-vec2
  "Conj into transient set and vectorize"
  (loop [remaining s
         seen (transient #{})]
    (if (empty? remaining) (vec (persistent! seen))
      (recur (next remaining) (conj! seen (first remaining))))))

(defn unique-vec3
  "Use set as distinctifier, but conj into transient vector"
  (loop [remaining s
         seen #{}
         out (transient [])]
    (if (empty? remaining) (persistent! out)
      (let [this (first remaining)]
        (if (contains? seen this) 
          (recur (next remaining) seen out)
          (recur (next remaining) (conj seen this) (conj! out this)))))))

The first one is the most “idiomatic” in the sense of conciseness and using what Clojure gives you. You can rely on the fact that you can cast a Clojure Set to a Seq (although a set is not itself a seq), and convert a Seq into a Vector.

Unfortunately if you benchmark it, you’ll find that you spend a lot of time traversing the sequences. There’s no point in generating persistent data structures for purely under-the-hood operations; CPUs are fundamentally machines for mutating data and you should use their capabilities when you’re able to. So, let’s use the idiomatic Clojure way of mutating - transients.

Now we have a second decision to make - we can either use a set as both a distinctifier and an accumulator (unique-vec2), or we can use it purely to check for distinctness, and accumulate the vector elsewhere (unique-vec3). This is a purely benchmark-driven decision, but it turns out that depending on what you’re distinctifying, it could be faster either way.

The tradeoff you’re making here is that unique-vec2 has a tighter inner loop (accessing only one data structure rather than 2) at the expense of a slower final step (you have to persist the set, convert it to a sequence, and traverse it). So, you’re better off using it for shorter sequences, and unique-vec3 for longer or more-distinct ones.

The same pattern applies in raw Java, of course, but the thing I like about Clojure is the ability to use quasi-imperative code with high-level abstractions. You now have a pretty fast, generic algorithm that operates over arbitrary streams, and you could make it faster by using a standard mutable HashSet for checking uniqueness, or possibly using multiple threads and an atom to distinctify in parallel. These modifications require very little additional code and they’ll still be compatible with whatever data you’d like to process.

Addendum: I probably should’ve mentioned Clojure’s built-in “distinct” function. It takes essentially the approach of unique-vec3, except for that it uses a persistent set rather than a transient, and builds a lazy sequence rather than a vector. Additionally, (vec (distinct coll)) is actually slower than (vec (set coll)) or the other 2 approaches, presumably due to the additional overhead of laziness.

25 Jul 2012, 12:00

Poisson regression - it's more useful than you think

Everyone knows about logistic regression. It’s the AK47 of predictive models - cheap (to fit), available everywhere, manufacturable yourself with some basic tools, and pretty darn effective at what it does, as long as you don’t ask for the latest and greatest. You can hand it to a barely-literate junior analyst and send them to the front, and they’ll be slaying problems in no time.

But as your positive examples become sparser, and your data gets bigger, logistic regression becomes less of a good fit. Particularly because, depending on your application, you might not actually care about individuals being likely positive or negative (after all, at a certain sparsity, the vast majority are likely to be negative - and one person, in lot of applications, simply doesn’t matter). Instead, you care about the number of positive examples you can count on in a group of a certain composition. Examples of this would be banner ad response rates, or the number of clicks on an email.

And that’s where Poisson regression comes in. There is a “law of rare events”, where as your positive examples become sparser and your total number of examples increases, a Poisson regression becomes a better and better fit to the “real” binomial data. Add in measurement errors and modeling time, and Poisson regression becomes downright compelling.

Particularly because the coefficients in a Poisson regression have a natural translation into multiplicative effects on a rate, which can float above 1 if necessary, as opposed to the parameters in a logistic regression, which translate into multiplicative effects on odds that can never be greater than 1. That works out well if you’re modeling an arbitrage scenario, where you’re relating dollars-of-input to dollars-of-output, and have a relatively small number of very positive examples primarily effecting that ratio. Ideally you’ll end up with a ratio less than 1.0, but you really do have to deal with segments where it is greater, and that’s something logistic regression can’t handle. It’s fairly common in the insurance industry to do this with dollars-of-claim per dollar-of-premium - also known as their loss ratio. That kind of arbitrage-identification translates naturally to things like ad dollars per dollar of revenue, or customer service minutes per sale.

I’ve written up a very straightforward implementation of Poisson regression for Java, because I’m focused on the JVM at the moment and there wasn’t one available on that platform. A number of implementations exist in R, and a few standouts in Python. Mine incorporates the notion of “regularization” - that coefficients are likelier to be zero than the data suggests (or equivalently, that you should have a bit of a bias towards 0 coefficients). That can keep you from seeing large coefficients on sparse segments of the population that randomly have a large number of positive examples (I have some fun horror stories from the insurance industry about regularization via repeatedly forwarded spreadsheets and staff meetings - doing it with math is so much better). If anyone finds it handy, let me know - there’s a lot of tools in the standard statistical toolbox that are woefully underused, and it’s great to see them in wider circulation.

16 Apr 2012, 12:02

Tuning JVM GC for a big, mathy, STMful Clojure app

As part of my “real job”, I’m developing a Clojure-based app that does a significant amount of number crunching. The inner loop continuously ref-sets random portions of a large array of refs (when I say “large”, I mean that I can plausibly fire off a 50gb heap). I had a tough time getting it performant, and it’s an interesting enough story that I thought I’d relate it here.

After the standard futzing with algorithms and data structures, the thing that ended up holding me back was excessive time in GC. I started with the “throughput” collector (hey, I’m doing number crunching, I don’t have real-time requirements, throughput is awesome!). Somewhat surprisingly, I saw worse and worse performance as my app ran, ending in a kind of sawtoothed purgatory of GC. What little information I found about Clojure-specific GC tuning uniformly showed using the CMS / low-latency / concurrent collector as a good choice. Curious, I switched - but that just resulted in slow-rising, slow-falling memory consumption patterns, with the pauses coinciding with pure serial collections (and still, not enough work geting done).

This caused me to go a bit down the rabbit hole of garbage collection lore - for simplicity I’ve condensed this somewhat. The JVM stores its heap object in the permanent generation (ignorable, mostly class definitions and such), a few pools in the young generation, and a tenured generation of long-lived data. The young generation is collected via a parallel algorithm that isn’t really that tunable, and the split between young and tenured (and the various young sub-pools) will be auto-tuned if you let it. The real mojo happens in the tenured pool, where you have the choice of a CMS / low-latency / concurrent collector, or a parallel throughput collector, with various options for each. The trigger that sets off a CMS collection is theoretically auto-tuned as well, and there are “ergonomics” for the parallel collector, but performance was crappy enough as things adjusted themselves that I never got the chance to examine the long-run, automated performance. I had to figure out most of this via experimentation and reading lots of GC logs.

What I discovered was that because Clojure data structures are immutable, I had enough of them, and my mutation rate was low enough, the majority were advancing to the JVM’s tenured pool. When a ref was set to a new object, the old one lingered in the tenured pool until it was collected, and the new one eventually migrated there too. This is what caused the slow rise in memory usage using the stock CMS collector - objects weren’t being collected until a threshold of the tenured generation was filled, and by the time the threshold was hit, it was close enough to filling the tenured pool entirely that it always blew through and caused a full, serial, stop-the-world collection. Causing these objects to stay in the “young” pool was inefficient, since the majority at any given time were still live and being copied back and forth (and the ones that happened to survive were just thrown away later). I needed to set the collector to kick in earlier and hoover up the garbage accreting in the tenured pool.

However, even with the CMS collector set to run more or less continuously, I was still allocating objects faster than they could be collected, because the young-generation collector was promoting almost all of my objects. The solution ended up being to not generate objects fast enough to fill the tenured generation and cause a slow serial collection. There were a few interlocking solutions to this. First, I set the number of CMS-collector threads higher than usual (4 in my case) to mark the garbage faster during a collection. I hard-coded the threshold at which the CMS collector kicks in, and actually set it relatively high, in order to have control over when GC triggered. I figured the maximum amount of time the app could mutate for before causing a serial GC or triggering an uncontrolled CMS GC cycle, and paused at the end of that period to manually trigger a CMS run and block until it was complete, via calling System.gc() and telling the JVM to associate that with the CMS collector. For long-term operation when I didn’t care about maximum throughput, just stability, I simply ran the object-generating portions of my app on fewer threads (could have just as easily called Thread.sleep() intermittently), and set the CMS threshold low enough that it was running continuously.

Calling System.gc() is nearly always a smell, but in my case I found it was the only way to ensure I was operating at max throughput without causing excessive pausing. My allocation pattern simply isn’t something the JVM deals well with - had I been mutating objects directly, rather than generating new ones, running an alteration to the objects that leveraged some Clojure structural sharing, or operating at a higher CPU-heap ratio, I could have likely avoided calling it.

My JVM options (some of which may be superflous, I wasn’t able to establish whether CMSIncrementalDutyCycle or CMSIncrementalSafetyFactor were necessary without the incremental-CMS option turned on) ended up being:

-d64 -da -dsa \
-XX:+PrintGCDetails \
-Xloggc:gc.log \
-Xms50g -Xmx50g \
-XX:+UseConcMarkSweepGC \
-XX:+UseParNewGC \
-XX:ParallelCMSThreads=4 \
-XX:+ExplicitGCInvokesConcurrent \
-XX:+CMSParallelRemarkEnabled \
-XX:-CMSIncrementalPacing \
-XX:+UseCMSInitiatingOccupancyOnly \
-XX:CMSIncrementalDutyCycle=100 \
-XX:CMSInitiatingOccupancyFraction=90 \
-XX:CMSIncrementalSafetyFactor=10 \
-XX:+CMSClassUnloadingEnabled -XX:+DoEscapeAnalysis

09 Apr 2012, 17:16

Clojure, interning, and immutability

“Interning” a variable means to have one underlying representation, regardless of where an equal-valued object appears. Strings, for instance, are interned in Java - storing a million “copies” of “hello world” causes a million references to a single underling string.

You can intern anything - for instance, Java also interns ints between -128 and 127 (which is why

Integer x=10;
Integer y=10;

is true (x and y point at the same underlying “10”), but not with x and y set to 1000).

Google Guava has a good implementation of interning that can apply to any object, not just strings. But, a whole new library is overkill for quickly deduping a few collections - we’ll implement something that gets us most of the way there in raw Clojure.

One of the caveats to interning is that because the different “copies” of the root object share the same underlying data, that data must be immutable. Otherwise, Bad Things happen when you create a fresh copy of the same interned data and mutate it - there’s no indication that that will have an effect on the other copies of that var.

Clojure handles this for you, because all of its native data structures are immutable. In fact, it’s better than interning, because of “structural sharing” - under the hood, a vector, and that same vector with a bunch of data conj’d onto it, share the same data for their common part.

But that only works if you start with a single data structure and “mutate” it with Clojure functions. If you have, for instance, a bunch of generated, duplicated values in a hashmap, they will each occupy additional space. Fortunately, it’s trivial to de-duplicate such a structure, and there’s no downside - Clojure’s immutability means that you can still sling around modified versions of the map, with no side effects.

I’ve thrown up a gist that demonstrates this in action, and is reproduced below.

There’s really not much “magic” beyond what Clojure provides you - we just throw all the values in a set, which (since sets compare their elements by value) gets rid of all but one copy of each distinct value. Then, we return a new map, with the keys the same, and the new vals looked up from the set. Voila! We now occupy 10% of the memory we used to.

25 Mar 2012, 17:23

Never delete, preferably never update

Web developers tend not to be database people (or even necessarily data people). They access data mostly through an ORM, the whole point of which is to abstract away the “relational” bits (this is why key-value and document-based NoSQL solutions are so popular - everything is essentially durable hashmaps). They also tend to access the data in isolation from the raw DB layer (eg, looking a few rows at a time, looping instead of doing joins on sets, and using programming language facilities instead of DB ones). If you are in fact a data/base person and work with enough systems designed by application devs, you will eventually look at a table and your first instinct will be head-grabbing, “who-designed-this” incredulity (I rued the day I saw my first ActiveRecord polymorphic association in action).

Resist this temptation! They have a job to do, and they are doing it - at least initially, they’re writing the data layer purely to support the functionality of the app, and it likely fits that role pretty well. They’re not going to ask you before they design it, and probably shouldn’t, because you don’t necessarily know yet what sort of analysis it’s going to need to support, and they don’t necessarily know what future frontend stuff it’ll have to be compatible with (and as much as possible, you want them to have the flexibility to make Cool Stuff as quickly as they can). The most you can do is lay down some rules of thumb that let you attach functionality later as painlessly as possible. Fortunately, there’s really only a couple.

  • Never delete data.
  • As much as possible, never update data.

The second is really just a clarification of the first, since when you update a record, you’re deleting the previous state.

The point of these is to be able to reconstruct all the past states of the table if you have the whole thing. Say you’re Zynga or whoever, and you want to track how many “credits” each user has in whatever game. The first instinct is to have a “balance” table, with a user ID and their current balance, and increment / decrement it as they buy / spend credits. This works fine, until you start asking questions like “How can I get customers to buy more credits”, and “Do customers buy credits in bulk, or in many small transactions?”. You could hook in your payments system and your in-game inventory system to see how people were accumulating and spending credits, but you have to sum over the entire set in order to figure out the state at any point in time, and hope that it matches with your balance table. Even if you store the state as of a particular time on separate lines, in order to figure out what led to that state, you have to link each row with its predecessor, which in strict SQL/set-theory terms requires a Cartesian join. It would be much easier to record both the delta and the resulting state on a different line.

This is called “bookkeeping” and the Italians invented it six hundred or so years ago.

The key thing here is that you’re recording both the action and the result, so it makes it incredibly easy to examine things from either perspective - what actions are most common, and what were the most common states. The downside is that you have to store all this additional data.

But where you store it is a configurable parameter. If your application only cares about the current state, then only store the current state - in the application’s database. Dump the rest to Hive, a series of daily Archive tables in MySQL, or wherever, and let your analysts worry about crawling through them. If it’s too big to store it all, keep a buffer and discard everything beyond it - just as long as you have the ability to increase the buffer if you need to.

14 Oct 2011, 12:00

Another way of looking at GLMs, or, how the other half lives

“Actuarial science” is a weird thing. It developed parallel to and before “statistics,” and is eminently based in practical considerations rather than analytic proofs of statistical properties (although it has gotten more rigorous, and it always involved a fair amount of math). If you picture the primal actuary as a Victorian guy circa 1850 trying to figure out how many of his underwritten ships will be lost to angry whales in the coming months, you would not be far off.

That said, they have come up with some darn good ideas, and beat the statisticians to the punch on a few notable occasions. For instance, actuaries invented the generalized linear model, twelve years or so before the statisticians got to it.

Of course, they didn’t call it that - they called it “the minimum bias procedure.” And they didn’t make the connection between it and GLMs, or fill out all the gaps in the spec, until 4o years later (Stephen Mildenhall has the seminal paper linking the two). But still, it provides a good way of conceptualizing what you’re really “doing” when you fit a certain kind of GLM, and which kinds of GLM are natural fits for problems with certain assumptions.

The gist of it, is that your data can be rolled up into a hypercube / N-D matrix (this only works on purely categorical datasets, unfortunately). Each dimension of your data (ie, variable) is a dimension of your hypercube, and the number of possible values for that variable is the size of that dimension. So for instance, a model with variables Sex and State would be a 2 x 50 rectangle. Each cell has the dependent variable (usually loss ratio in insurance) for that combination of groups - for instance, males in Arkansas.

The goal is to attach coefficients to each row / column, so that when combined, you have equal results across all dimensions (ie, equal results between males and females, and equal results in all 50 states). What constitutes “equal,” and how you combine the numbers across groups, determines your model’s error distribution and link function - the Mildenhall paper goes into much greater detail.

For instance, a straighforward linear regression consists of getting row-wise and column-wise sums, and then attaching coefficients to each row & column, so that when multiplied by the coefficients and re-added, the sums are equal between each row and between each column. For a GLM with a Poisson error function and a log link, it is the same, except for that instead of summing, you take the product of each row / column.

The nice thing is that it provides an intuitive explanation for what you’re really “doing” when you fit a model, something that analogizes to the operational level. If you’re modeling loss ratios and you have a multiplicative pricing scheme (ie, you have a base price, and everyone gets charged some factor of that), after running your model, you should show an equal loss ratio in each subgroup if you go with the results.

You can easily learn a lot of statistics without running across these analogies and equivalencies, but it really does help when you have an intuitive underlying system to work within. It also helps your ability to explain what you’re doing to people with a bit of math but not a whole lot of domain knowledge, and to draw on other disciplines for inspiration. It always pays to check up with the actuaries / statistical physicists / machine learning guys and see what they’re working on at the moment.

22 Jul 2011, 11:37

So what's the point of parametric CLV?

Most businesses are young (and all startups, definitionally). They need to know how they’re doing, and how they’re going to do, without waiting for an entire cohort to age out (and chances are the new cohorts don’t behave like the old ones). So if you don’t have a long enough tail to do the purely nonparametric, empirical estimation of long-term CLV, you’re going to have to make some assumptions, and if you’re going to do that, they should be good assumptions that provide you with useable tools and verifiable (=falsifiable) outcomes. Hence, get a good model and fit it with the data you have.

On the other hand, the same thing that gives you such little data probably gives you a short time horizon. If you’re still tracking customers’ performance in five years, presumably you survived for that long (and what you’ve been doing is working). If you have a years’ worth of data and are trying to make it through the next six months, assuming the new customers will perform like the old did is a decent place to start.

So there’s a tradeoff, just like everything else. Fortunately, testing alternatives so you can find the best tradeoff is practically what computers were designed for. Check both! If you chart each monthly cohort’s purchases, and they look like they’re doing the same thing, thinking that incoming people will follow their pattern isn’t crazy. But, if you’re evaluating different sources, or trying to make longer-term predictions about behavior, one of the Fader / Hardie models, or something more exotic (maybe with a more exotic fitting method), might be a good way to go.

21 Jul 2011, 11:38

Customer lifetime value: Beta-geometric/beta-binomial - discrete time businesses

There’s a well-known result that says if the sum of a number of random variables is Poisson distributed, then each of those components must be as well. What if that seems unlikely to be the case? If our customers have discrete opportunities to buy, like say a cruise that sets sail once a month, their behavior shouldn’t form a Poisson process on an individual or group level.

Fortunately, there is an alternative - we take the gamma-Poisson / beta-geometric model we talked about last, and swap out the Poisson-based purchase probabilities with something more appropriate. We say that each discrete period there is a fixed chance that a customer will buy, giving their purchase process a binomial distribution. Across customers, we say that their purchase odds form a beta distribution (since the beta distribution is the binomial’s conjugate prior). And it’s just that easy! Of course, we still have to go through the math to figure out how to fit the model and get predictions out of it - thankfully, someone has already gone through the trouble and made their results available. Furthermore, it looks like the math is actually pretty easy to apply. Success!

20 Jul 2011, 11:40

Customer lifetime value: Gamma-poisson/beta-geometric: the 'easy' way to CLV

In the last post, we looked at a model with totally reasonable assumptions, and totally unreasonable behavior as an actual tool. Fortunately, we can be both reasonable and useful - we just have to make a few tweaks.

The original assumed purchases follow a Poisson process, where the rates across customers form a gamma distribution. We’ll hang onto that assumption. We also claimed customer lifetimes follow an exponential distribution, whose rates across customers in turn form a gamma distribution. This is where we switch it up.

Essentially, instead of saying that customers draw their chance of dropping out each day from an exponential distribution, we say that they have a constant chance of dropping out each day. Since the chance of dropping out each day is constant, customers’ total lifetimes follow a geometric distribution (as opposed to an exponential distribution the first time around). We also assume that if you look at everyone’s daily chance of dropping out, overall they form a beta distribution (the conjugate prior of the geometric distribution, naturally).

Now if you’re a stats geek, here’s where you start to get suspicious, because the geometric distribution essentially the discrete version of the exponential distribution, and discrete gets continuous if you slice it fine enough. We didn’t really change things up that much, so what did we actually accomplish?

Well, the math gets dramatically easier. If you look at the expression in Fader and Hardie’s paper for this model, versus the one linked to in the previous post, there’s no comparison - the gamma-Poisson / beta-geometric is much easier to fit. No crazy Gaussian hypergeometric functions being slung around, just good old fashioned gammas and betas (to be fair, you do need one to calculate the expected future purchases, but not to fit the model).

Being an easier-to-fit version of the first, the same caveats apply - if your purchasing behavior doesn’t seem like a Poisson process, it’s not going to be any good. But, if it fits, you should be able to use it after some data-munging. In the next post, I’ll look at an alternative for when your purchases (or purchasing opportunities) are too regular to be described with a Poisson process.

19 Jul 2011, 11:42

Customer lifetime value: Gamma-poisson/gamma-exponential: Original recipe, extra tricky

An oldie but a goodie - this was the first model described in this family, and also the first one with explicit expressions for likelihood, probabilities of being active, expected future purchases for a population or a particular customer, and all the other things you’d want to know about the purchasing process.

The assumptions behind this particular model are that:

  • Active customers purchase in a Poisson process, at a certain constant rate per customer
  • Across customers, these rates form a gamma distribution
  • Customer lifetimes are exponentially distributed, with a particular dropout rate. You don’t actually get to observe when a customer becomes inactive (but you can infer it after the fact)
  • These dropout rates form a gamma distribution across customers

When I first saw these assumptions spelled out, my reaction was “this is totally reasonable”. After all, things like website visits and phone center calls follow a Poisson distribution, so it’s not crazy to think that purchase rates behave the same way. And exponentially distributed lifetimes make sense as well - a small cohort of very loyal customers, slightly more who buy when it’s a good deal, and exponentially more temporary ones just trying out your service. The gamma distribution forms the Bayesian conjugate prior for both the Poisson and exponential distributions, so it is logical to use it to express heterogeneity in those distributions’ parameters.

Unfortunately, when you whip out your blackboard (or get to the “good part” of the paper), it turns out that the expression you get for this model’s likelihood (the expression you maximize to fit the model and get the parameters that quantify these processes) is a crazy, crazy piece of math. You can tell it’s going to be a pain when the Gaussian hypergeometric function (the “2F1” notation in the paper) shows up. For those of us who aren’t actual mathematicians, that particular function represents an infinite series that requires some heavy math and numeric programming expertise to even evaluate (SciPy and the GNU Scientific Libraries both have implementations, thankfully). Not only that, but it has a restricted domain, both in its actual parameters, and in the relationship between them. Feeding it raw into your average numerical optimizer is asking for trouble, and besides that, it’s slow. I speak from experience when I say it is not a wonderful feeling when your data sticks you between numeric underflow, numeric overflow, domain errors, and function evaluation times so long you’ll never actually know if it converged, depending on what initial estimates you start with.

That said, on certain datasets, it actually does work pretty well. The CDNow data mentioned in the paper is available from Bruce Hardie’s website, and the model can indeed be fit to that data with a minimum of fuss (and does very well at predicting those customers’ ongoing performance). And hey, if the model can’t be fit to your data, there’s a pretty strong implication that the model doesn’t describe it well in the first place (or at least that you’ll never know).

So what makes this form of model likely to work? Well, if your customers can purchase at any time, and seem to do so fairly frequently, but not super-regularly or on a schedule, it might be a good fit. Movie tickets, CD sales, perhaps Amazon-style general retail, all seem like they’d be worth trying. Not so much for things like art sales (well, maybe for a certain segment of the population), milk, hockey tickets, or airplane tickets. Also, due to a couple of roguish exponential expressions in the likelihood function, scaling your data from days to weeks or months can avoid a lot of pain.

But there’s a good chance that it will purely choke when it tries to swallow your particular purchase patterns, and what do you do then? In the next post, I’ll describe a variant of this model that changes one key assumption, and becomes much, much easier to fit as a result.

18 Jul 2011, 11:43

Customer lifetime value with Pareto / NBD models - or, Fader and Hardie, wonder twins

“Customer lifetime value” is a notion of how much a customer is worth to a business, over the whole life of the customer relationship. It’s quite easy to calculate for a subscription-based business, like say a magazine - advertising and subscription revenue per magazine, minus the cost of the physical product, times however long a customer’s average total subscription is, and discounted into current dollars to reflect the fact that you don’t get it all up front (and that the numbers themselves are a bit uncertain).

Things get trickier when your business model has uncertain costs associated with each customer (like in the insurance business), or more commonly, when you have uncertain revenue attached to them. Sure, you can shell out for advertising or promotions and get customers to sign up at your website, but how much are those new users going to be spending six months from now?

There are a multitude of approaches to answering these questions, but for now I’m going to talk about a few models in the “Pareto / NBD [negative binomial distribution]” family. In general, this family of models combines a probabilistic purchasing process for active customers, with a probabilistic dropout process that causes active customers to become inactive (that is, their ongoing purchase probability collapses to 0). By fiddling with the exact forms that these purchase and dropout patterns take, we can get a series of models that describe various assumptions about real-life behavior.

An interesting feature of these models is that they’re purely frequency based. That is, they don’t make any assumptions or claims about the sizes of purchases (that comprises a whole other set of models).

The other interesting thing is that they require only a small amount of information about each customer, not their entire purchase history. In fact, you can fit these models with just 3 pieces of information: how many times a customer has purchased, when was the last time they purchased, and how long they’ve been your customer.

Over the next few posts, I’ll go over a few main variants in this family and some of the peculiarities associated with each, as well as pointing you at the papers that describe the nitty-gritty implementation details. I’ll be focusing on the work of Fader and Hardie since they’ve apparently made it their mission to ferret out useful specifications for these. Stay tuned for more.

16 Jul 2011, 12:00

Python multiprocessing with instance methods

Python is a great language, but it’s lacking a killer neato native parallelism construct a la Scala’s actors, Go’s goroutines, or Clojure’s various interfaces to its STM. It does have a “multiprocessing” module, which gets you part of the way there.

Unfortunately, after using it for a while, it starts to show some rough edges in the language implementation. Under the hood, multiprocessing forks the actual process and shoots over the current state as a collection of pickled objects. So, anything you want to use across your parallel processes has to be pickleable.

This is where the rough edges start to show, because instance methods are not (by default) pickleable. Trying to invoke a method that depends on multiprocessing to function, or using a process pool to perform a bunch of method calls in parallel, results in an error.

However, there is a workaround. The pickle module gives you the ability to define custom pickling and unpickling functions, which lets you specify what to invoke when your code makes a call to an “unpickleable” instance method. Example code is here.

I’ve found this to be very handy, particularly when writing simulation code where the object-oriented paradigm makes a lot of sense. It doesn’t quite make everything as easy as a language with baked-in concurrency primatives, but given Python’s vast collection of libraries, C and OpenCL / CUDA integration, and friendly syntax, I’m happy to make the tradeoff.