03 May 2014, 09:46

Mapping distinct objects to a dense range of integers

Let’s say you’re trying to turn distinct strings (or some other objects) into integers. This pops up all the time - for instance, if you want to analyze a social network, it’s faster if you can access your adjacency list via integer offsets into an array rather than looking up name strings in a hash table.

The desired properties are that:

  • You get a distinct integer for each input
  • The output is as dense (sequential, with no gaps) as possible, but 100% density is unnecessary
  • You can construct the output online (ie, you never need to re-number)
  • Support get-or-insert, get-or-fail, and remove operations
  • It’s wicked fast

The straightforward strategy is a read-check-generate pattern, secured with a ReadWriteLock:

final ReadWriteLock lock = new ReentrantReadWriteLock();
final Map<String, Integer> map = new HashMap<String, Integer>();
for(final String s : data) {
    try {
        if(map.containsKey(s)) continue;
    } finally {

    try {
        if(map.containsKey(s)) continue;
        map.put(s, map.size());
    } finally {

This will produce an appropriately dense, distinct mapping. Unfortunately, it fails the last couple of criteria - there’s no way to support removal, and that single write lock adds a fair amount of contention that slows things down. We can fix the first one:

public class StringMapper {
    final Map<String, Integer> map = new HashMap<String, Integer>();
    int lastIdx = 0;
    //Any collection would work here; this has the advantage of concentrating
    //"holes" at the end, and being unbounded
    final LinkedList<Integer> freelist = new LinkedList<Integer>(); 

    public void remove(String s){
        try {
            Integer current = map.remove(s);
            if(current == null) return;
        } finally {

    public Integer get(String s){
        try {
            return map.get(s);
        } finally {

    public Integer add(String s){
        Integer out = get(s);
        if(out != null) return out;

        try {
            Integer out = map.get(out);
            if(out != null) return out;
            out = !freelist.isEmpty() ? freelist.poll() : lastIdx++;
            map.put(s, out);
            return out;
        } finally {

A little more ceremony here - just like in ye olde malloc, we have to maintain a freelist, and we have to track the last insertion point not on the freelist so we can make fresh allocations.

That write lock is still really annoying, though. We have concurrent maps and atomic integers built into the standard library, it seems like we should be able to figure this out without any of our own locking.

public class StringMapper{
    final ConcurrentMap<String, Integer> map = new ConcurrentHashMap<String, Integer>();
    final AtomicInteger lastIdx = new AtomicInteger(0);
    final Queue<Integer> freelist = new ConcurrentLinkedQueue<Integer>();

    public void remove(String s){
        Integer current = map.remove(s);
        if (current != null) freelist.add(current);

    public Integer get(String s){
        return map.get(s);

    public Integer add(String s){
        //If we're doing mostly fresh insertions, the initial check can be skipped
        Integer out = map.get(s);
        if(out != null) return out;
        Integer in = freelist.poll();
        if (in == null) in = lastIdx.getAndIncrement();
        out = map.putIfAbsent(s, in);
        if(out != null) {
            return out;
        } else return in;

The remove and get methods are straightforward; the add method is where the magic happens. On each attempted insert, we first check the map. Since we re-do the check later, if we’re likely to do mainly insertions (because we have a mostly distinct dataset as input) we can skip the initial one if we want. If we actually intend on inserting, we first poll from the the freelist, and if it is empty, get a fresh integer from our counter variable. If we’re successfully able to insert this value into the map, we also return it; otherwise, it goes back into the freelist.

This isn’t the best mechanism if you really need 100% density (but if you really need 100% density, you can’t support removals either), because there is a race condition between getting a new value and the putIfAbsent call. In between, another thread may have gotten a later value and added it to the map, generating a “hole” at your index and forcing you to add it to the free list. However, the number of total “holes” should be roughly bound by the number of concurrent threads you’re running. If you have, say, 16 threads simultaneously inserting the same key, in the worst case you end up with an inserted value of say, 15, and 0-14 in the freelist. Subsequent iterations check the freelist first though, so even if we continue that worst case scenario, we’ll be churning the freelist (repeatedly draining it, replacing the same elements, and adding an integer derived from the counter instead) but ending up with only 15 total gaps in our output.

27 Apr 2014, 17:39

Book review: Red-Blooded Risk by Aaron Brown

TLDR: This book is recommended for anyone who uses math to make money, whether in finance, or (in my case) the areas of the tech sector that rely on modeling customer behavior to make a buck.

This is not a how-to manual, or even a book that tries to make a monolithic argument (although if you take it seriously, one seems to emerge). It’s more of a conversation with Aaron Brown, a prominent (I’m told) risk-manager, old-school quant, and all-around interesting guy in the financial field. It’s easy for these kinds of conversational “what I’ve learned, what I think” books to come out sideways - a barely-edited transcript of an “author” and their ghostwriter (the plague of political memoirs), a platform for the airing of grievances, a journey into the depths of self-ego-stroking, or a collection of platitudes. Fortunately, Brown is better than this.

The book starts to heat up with an interesting interpretation of the Dutch “tulip mania”, which he contends was not a classical mania at all, but a somewhat rational investment in a commodity with characteristics making it a good money substitute. Tulip bulbs (the commodity actually invested in, not tulips per se) are portable, somewhat durable, have predictable inflation rates via an annuity in the form of more bulbs, are impossible to counterfeit or debase, and have an aesthetically pleasing, branded output. It makes at least as much sense to use them for money as it does wampum, and possibly even more than gold or silver, especially when you have a massive inflationary influx of bullion from the New World and a government that debases coins. Their exchange rate for classical coin is somewhat explainable by purely economic and legal factors.

After some digression into the nature of “money” per se, this is used as an entry to a discussion of “money” vs. options / futures / derivatives. In Brown’s formulation, the functional purpose of commodity futures, like an agreement to buy or sell a quantity of wheat for a certain price at a certain time, is not a way to “lock in a price” in the sense of hedging price risk. A miller typically is not “long wheat” in the futures market to guard against price spikes - a price spike impacts him in unpredictable ways (for instance, it may be due to either an increase in demand, or a fall in supply, which would have opposite effects on his business). Instead he contracts for purchase on a more flexible retail basis (not in the futures market), and is short wheat in the futures market (using the proceeds, in fact, to buy the actual wheat he is processing). These offsetting transactions, one long, one short, have the effect of borrowing wheat directly, without the necessary use of money (the contracts themselves can be used as collateral for each other since they nearly offset). When he has sold his flour, he buys out his short with the proceeds and repeats the process. Historically, money itself was a scarce and volatile commodity on the frontier, and eliminating the use of money eliminated a substantial source of risk. Instead of an interest rate linked to public debt levels, bank stability, foreign currency flows, etc., one has an interest rate more closely linked to the inherent properties of the markets actually transacted in.

As a digression of my own, it is plainly inaccurate to say that “continually borrowing with no intention of ever paying anyone back is a totally modern development”. If the miller is doing his job right, his debt will be greater each year, and he may very well be borrowing from the same person each time - if it made sense then, why not now? The question is whether he is doing something productive with the proceeds, and what the ratio of his debt to the value of his business is (in fact, if he can borrow more easily and does so, this in and of itself causes his business to increase in value). It gets dramatically more complicated, and the analogy rather breaks down, if the miller’s business is incredibly difficult to value accurately, and he owes most of the debt to his own subsidiaries, heirs, tenants, people legally obliged to hold it & never redeem… In any case, if one must analogize, it’s a far more suitable analogy than grotesque “country as household” rhetoric.

The final, and most generalizable, part of the book focuses on the notion of risk and probabilities itself, how it relates to models, and how the impact of these risks (like the unfortunate “going bankrupt” thing) manifests and can be controlled in the real world. The Kelly criterion is introduced as a way to think about investment strategy, and the warring camps of frequentists and Bayesians are reconciled into a neat package by explicitly considering what a machine learning practitioner would call a loss function and Brown considers as a choice of numeraire (this synthesizes nicely with the question of borrowing bushels of wheat vs. cash, and it is only poor editing that leaves this connection to the discovery of the reader). Dollars are an easy choice, that usually has a neat quasi-Bayesian interpretation - you may consider it in terms of conversion rates or eyeballs, but be careful when the relationship between those and the almightly dollar breaks down. Dollars aren’t always appropriate either, especially when there is no free market setting prices for the underlying events - if you’re trying to build an accurate speech-to-text engine, it’s foolish to try to put a dollar price on each error.

When models break down, Brown has an engaging explanation of the concepts of value-at-risk and risk management in general. Being not an expert in the field, it’s difficult for me to judge his approaches to risk, and many of them seem inapplicable to my line of work. The technology sector doesn’t have “traders” to manage at quite the level he describes, but the notions of rigorously evaluating performance, taking appropriate levels of risk and planning for what happens when you fail, is universal.

Ultimately, reading this book has convinced me that there is a massive mismatch in technical sophistication between models in the financial and technology sectors. The high-end predictive models being developed by the technology sector, for image processing, natural language processing, collaborative filtering, click modeling, fraud detection, etc., cover much more ground and seem vastly more sophisticated than those of the financial sector. They incorporate more data, make higher-dimensional predictions, and generally use heavier machinery. But the superstructure, the way models are actually used, thought of, and evaluated on a meta level, lags badly behind. Most of this is probably due to the decision time horizon - it would be a different story if the financial sector didn’t require sub-millisecond latencies for their predictions, or if a slight increase in face recognition was worth the billions of dollars a consistent edge in the financial markets is worth. It may be, with time, that we will see the financialization of the technology sector, securitizing and selling derivatives on click rates or ad impressions in the same way we securitize timber reserves or the profits from Facebook in toto. Already the startup sector leads the way - the only way to understand something like Instagram is as a purchase, not of a revenue stream per se, but of a nevertheless valuable commodity, financed by transforming monetary capital into users, engagment, a software platform, or whatever you wish to call their end result.

The unfortunate aspect of this book, which is not cleanly separable from the thing that makes it interesting, is that it’s clearly a very personal work. The syncretic aspects sometimes diverge into tangential areas, some of the main ideas and interesting connections are scattered, and it could generally use a better editing job. Fortunately, no one will actually force you to read the whole thing - unless they intrigue you, skip the comics, the social history of Wall Street, and the disclaimers, and enjoy a very fresh look at the interface between predictive models, risk, and decision-making.

13 Apr 2014, 11:21

A/B tests and the Kelly criterion

When you’re testing a sequence of changes in an A/B test, the question naturally arises, how large of a group should I test on? Normal practice is to wing it, usually based on some function of the number of fingers you have. This is suboptimal; a 1% test group for a Facebook-sized organization, or in the context of millions of ad impressions, is likely to be incredibly overpowered.

You can deal more rigorously with the issue by recognizing that an A/B test is a wager: we take, say, 1% of our income stream for the time period, and we spin the wheel, seeing if the payoff is negative or not.

There is a “multi-armed bandit” problem that makes this analogy explicit. However, it’s dealing with a slightly different formulation: given a number of possible “plays”, which sequence do we test & how many spins do we give each one? The vanilla posing of the problem doesn’t give a direct answer to the question of how much money we should be dumping into the slot each time (although math composes nicely and one can get a coherent solution for all these problems combined with some effort).

The question of appropriate bet size has a claimed definitive answer in the Kelly criterion. It is remarkably simple:

Proportion to bet = (prob. of success) / (cost of failure) 
                  - (prob. of failure) / (payoff of success)

To pick some contrived numbers, if I have an experiment that I guestimate to have a 70% chance of a 2% gain, and a 30% chance of a 4.5% loss, I should put ~55% of my bankroll into that wager. Essentially what this decision gets you is maximum expected long-term total payoff if you fold your winnings back into more wagers. This means that it’s not actually the appropriate criterion if you can’t continually reinvest your winnings: for instance, if you’re measuring the effect of switching from one JVM garbage collector to another, you get a choice of three and you’re done, and your “winnings” are something like having to deploy a constant proportion fewer machines or a constant reduction in latency (eg, a one-time static payoff). On the other hand, if you’re trying to turn ads into customers into ads into customers, the analogy is pretty apt.

A few questions rear their ugly heads immediately:

  • How can I possibly know the expected payoffs if that’s what I’m trying to measure in the first place?
  • How does statistical significance play into this?

The first is more of an art than a science, but you can get an estimate by looking at the results of previous experiments. If all your previous futzing with your site’s fonts only shifted conversion by 0.5% in one direction or the other, your custom Lobster substitute is unlikely to change it by an order of magnitude. But still, it’s difficult to have reasoned estimates, especially at low probabilities of highly variable magnitudes from a fickle and ever-changing customer base. It might help if you thought of the bet as not “shift N% of our traffic to an A/B test of button color”, but as “shift N% of our traffic to an A/B testing team with this track record”.

The second is trickier. As I mentioned above, it is possible to reconcile these ideas seamlessly, but it does take some math and some assumptions about what your “real” utility function and beliefs are. The Kelly criterion is fundamentally forward-looking, and statistical confidence is fundamentally backwards-looking, so they need not be in conflict, and one of the nice things about the Kelly criterion is that there is no explicit term for “variance”. In particular, because Kelly only depends on expected outcomes, you can use Bayesian updating of your expected payouts as results come in, and adjust proportions in real time.

If this sounds like it might get a bit complicated, it’s because it does. Unfortunately there is no way around it: the more rigorously you try to model a complex system of probabilistic payoffs the more elaborate your model has to be, which is why the solutions tend to be to stick with ye olde A/B tests (which are for the most part predictably suboptimal), or hire some consultants or in-house statisticians.

06 Apr 2014, 10:29

Designing a persistent Bloom filter

Bloom filters are handy data structures, particularly for applications where data sets regularly exceed RAM if stored literally. For instance, they’re useful as a way to implement an inner join or filter. You can stream the restrictor dataset into a Bloom filter, and then stream your restrictee through the Bloom filter, propogating only those elements that match, and taking only 1.44 * log2(1 / errorRate) bits per entry in your restrictor dataset. This is why databases like Cassandra use them extensively.

Usually they’re formulated as mutable data structures over a bit-array, which is in turn implemented (at least in Java) on top of an array of longs. But there’s no reason why we have to use a mutable array; persistent data structures are desirable for many reasons, not least because they can be used with the same idioms as builtin Clojure data structures if we’re operating in that environment. How do we implement them persistently? On top of a persistent vector / array, of course.

Standard Clojure persistent vectors have object overhead, and the point of a Bloom filter is to reduce memory usage, so they’re right out. You could implement it on top of a (vector-of :long) with less overhead, but there is a speed disadvantage; currently gvec / vector-of doesn’t support transients, so with K hash functions in your bloom filter, you’re doing K sequential modifications to your vector, resulting in many more allocations and copies than we’d like.

Basically, designing a persistent Bloom filter comes down to the problem of representing a persistent fixed-sized array in such a way that we can do K “modifications” per step in the minimum number of operations. Essentially all persistent data structures are built on top of a tree-like data structure, so we need to figure out the optimal tree layout. How do we do this?

When we’re constructing a Bloom filter, we provide the expected number of elements and the desired false-positive rate, which in turn gives us parameters for the optimal number of hash functions and number of total bits, so we know how many modifications per insert (K) and how many total elements we’re going to need to represent (number of bits / 64). In order to make this tree persistent, we need the guarantee that we always copy a whole node if we need to modify any of its members. So, if we have a fan-out of F, we basically accumulate a cost of F for each node along each of the K paths, without double-counting shared ancestors.

But how do we estimate the overlap? This is trickier than it seems; one of the assumptions of a Bloom filter is that the bit positions are randomly distributed. But “randomly distributed” doesn’t mean “evenly distributed”; if I drop 100 balls into 100 random buckets, I’ll end up with more than a third of buckets empty. Simulate it yourself and see:

(loop [v (vec (repeat 100 0)) ct 100] 
  (if (zero? ct) 
      (do (println "Buckets empty:" (count (filter zero? v)))
          (println "Final bucket counts:" v)) 
    (recur (update-in v [(rand-int 100)] inc) 
           (dec ct))))

The number of empty buckets (and conversely the number of buckets with at least one entry) is dictated by the negative binomial distribution. Now, we could use that to figure out the number of nodes likely to be hit on each layer, or we could simulate it, which is more fun.

//Does a Monte Carlo simulation of the cost of an insert.
public static double cost(long m, int k, int f){
    final Random rng = new Random();
    final int trials = 250;
    //Each long holds 64 bits
    final int n = (int)Math.ceil(m / 64.0);
    double cost=0;
    //Simulate from top down, averaging 100 iterations
    for(int trial=0; trial<trials; trial++){
        //Smallest power of f greater than n
        for(int width=f; width < n*f; width *= f){
            final int[] hit = new int[width];
            for(int i=0; i<k; i++){
                hit[rng.nextInt(hit.length)] = 1;
            int localCost = 0;
            for(int i=0; i<hit.length; i++){
                localCost += hit[i];
            //We may want to add a fudge factor to account for array header,
            //pointer access & function call overhead.
            cost += localCost * f;
    cost /= trials;
    return cost;

It would be more than a bit overkill to run a Monte Carlo simulation in the constructor (although I’ve seen worse), and still a little overkill to estimate fresh parameters each time using statistics. The cost isn’t convex due to integer issues (sometimes a higher branching factor lets you get rid of another level, and sometimes it just makes each copy costlier), so we’d have to calculate the results of many different fanouts, and we’ll have some error in our estimate anyway since we’re running on a real machine in the physical world.

It turns out that a fanout factor of 8 has the advantage of being a power of 2, corresponds roughly to a cache line (of course Java object headers mess this up), and gives reasonable simulated costs for large Bloom filters. I’m building a persistent Bloom filter implementation into SOAC and benchmarking, so we can see if the real world agrees with the back-of-the-envelope calculations. A fixed-size primitive array with inherent support for multi-node edits would also be a good base for a different hash table implementation that doesn’t depend on gvec.

05 Apr 2014, 09:39

Full disk encryption with Btrfs and multiple drives in Ubuntu

At this point, encryption is an issue of social responsibility. It is important to establish a norm that data should not live or travel in plaintext without an affirmative reason, especially if you have nothing to hide, because it provides cover to the people who do. Besides the normative aspect, if you intend on doing any international travel, have any interesting friends, or do any interesting or profitable work on your machine, you owe it to yourself to secure your environment.

Ubuntu makes the single-disk encryption scenario relatively easy, but it doesn’t allow a lot of customization at install time, and has no GUI for extending encryption to multiple disks. Fortunately it’s only a bit more CLI work to set it up to work transparently with multiple disks, so you only need to enter the one passphrase. I’ve tested this in a VM with the latest Ubuntu 14.04 beta, but it should work for other versions of Ubuntu, or any distro with support for cryptsetup.

The defaulted Ubuntu “encrypt my whole hard drive” installer layers like so:

  1. The physical disk
  2. An extended physical partition
  3. A LUKS wrapper
  4. A LVM physical volume
  5. Two LVM logical volumes: one for swap, and one EXT4 filesystem for your root directory.

Whew! This is probably fine for your system drive, if a little complex; it’s nice being able to use LVM to resize your swap partition if your needs change dramatically, and if your system drive is a different size / speed than those in your storage array (eg, a 32GB SSD vs. an array of 4TB spinny disks) it wouldn’t make sense to have it as part of the same filesystem anyway. We’ll accept that default for our root partition and swap, and focus on our secondary data drives.

We’ll assume your main system has been installed successfully on /dev/sda , and we have 2 other disks /dev/sdb and /dev/sdc that we want to set up as an encrypted, Btrfs-managed mirror.

First, let’s blow away the existing disks, and create some fresh partitions. You can do this graphically or through any partition editor. The key thing is to end up with one unformatted partition on each disk; /dev/sdb1 and /dev/sdc1 respectively.

# You’ll need to be superuser for all of this
sudo -i
# For these commands, select "o" to zero the partition table,
# "n" to create a new partition (follow the defaults for a single primary
# partition that fills all space), then "w" to write to disk.
fdisk /dev/sdb
fdisk /dev/sdc

We’re going to use a keyfile so we only have to enter the passphrase that unlocks our root partition. Let’s generate one.

# 512 bit / 64 byte keyfile
dd if=/dev/random of=/etc/keyfile bs=1 count=64

Create a couple of LUKS wrappers inside those partitions, using the keyfile we just generated

cryptsetup --key-file /etc/keyfile -v luksFormat /dev/sdb1
cryptsetup --key-file /etc/keyfile -v luksFormat /dev/sdc1

Now we load the encrypted mapping, to /dev/mapper/enc1 and /dev/mapper/enc2 respectively, again using the keyfile. We write plaintext into the mapper, and it comes out encrypted on the raw device.

cryptsetup --key-file /etc/keyfile luksOpen /dev/sdb1 enc1
cryptsetup --key-file /etc/keyfile luksOpen /dev/sdc1 enc2

Now we make a choice. Btrfs has its own LVM-esque capabilities, so rather than layer in more complexity by using logical volumes, we use Btrfs directly inside the LUKS wrapper.

# Btrfs isn’t installed by default
apt-get install btrfs-tools
# A label makes management slightly easier.
mkfs.btrfs -L vol1 -m raid1 -d raid1 /dev/mapper/enc1 /dev/mapper/enc2
# The final mount point
mkdir /mnt/enc
# We can mount any of the component devices manually, and have access to the full array
mount /dev/mapper/enc1 /mnt/enc

OK, let’s modify our fstab and our crypttab so our system knows to decrypt and mount these drives. This should be added to your crypttab (optionally replacing the devices with their UUIDs, which you can get via “sudo blkid”):

# Optionally add "discard" to options to support TRIM on SSDs
enc1 /dev/sdb1   /etc/keyfile  luks
enc2 /dev/sdc1   /etc/keyfile  luks

And this should be added to your fstab (again optionally using UUID, or the label of the array):

/dev/mapper/enc1 /mnt/enc  btrfs defaults 0 2
# Optionally, using label:
# LABEL=vol1     /mnt/enc btrfs defaults 0 2

Now, when you boot, you will be asked for your root passphrase first. The keyfile will be decrypted, and used to decrypt your Btrfs component drives. They will then be mounted, and your data will be secure.


I’m told that some people’s setups required a bit more massaging to get up and running with automount. Their fix involved adding a “device” parameter to the fstab, something like the following:

/dev/mapper/enc1 /mnt/enc  btrfs device=/dev/mapper/enc1,device=/dev/mapper/enc2 0 2

19 Mar 2014, 19:34

Upgradable locks in Go

Read-write locks are one of the basic means of coordinating access to a shared resource. As many readers can coexist as you’d like, but a writer requires exclusive access. For performance reasons, you’d like to have your write locks over as short a section as possible so you don’t unnecessarily block readers.

Sometimes, though, you have a long stretch of reads, and then a relatively short section of writes dependent on those reads. While you’re reading, you don’t care about other readers - but you do want to ensure that after your reads, you (and not some other caller) immediately acquire the write lock. This is known as “lock upgrading”.

An example of this is restructuring a file (eg, compacting a database file). You read it in, process it, write it to a temp file, and rename it to overwrite the original. While you’re reading the original and writing to a temp file, you don’t care about other readers. But after you’re done reading, you do want to adjust all the metadata so that subsequent readers are pointed at the right file handle / mmap’d buffer / whatever, and also to ensure that no writers bash the file between your reads and your writes.

Unfortunately, POSIX locks don’t support this. Java, which probably has the best shared-memory concurrency runtime available, explicitly forbids it. If you happen to be using Go, which has only the most basic POSIX-y concurrency constructs, how do you do it without holding a write lock the entire time?

The key insight is that the caller doesn’t actually need to ensure that it “upgrades the lock”, per se. It just needs to ensure that its code is run immediately the next time the write lock is acquired. It turns out this is not too difficult in a language that supports passing around functions:

import (

type UpgradableLock struct {
	uglMutex sync.RWMutex
	upgrades chan func()

func NewUpgradableLock() *UpgradableLock {
	return &UpgradableLock{sync.RWMutex{}, make(chan func(), 1)}

func (ugl *UpgradableLock) Lock() {
	select {
	case v, _ := <-ugl.upgrades:

func (ugl *UpgradableLock) MaybeUpgrade(f func()) bool {
	select {
	case ugl.upgrades <- f:
		go func() {
		return true
		return false

// RLock, RUnlock, and Unlock just call the corresponding method on the underlying RWMutex.

The full example lives here. When we try to upgrade, we attempt to place the function with our write-lock-requiring code in a channel, that is always selected from when the lock is acquired. In case there are no writers trying to acquire that lock (and hence run our code), we fire off one with no additional side effects. In order to ensure we’re handing off to the particular write operation we’re placing on the channel, we set the capacity of the channel to 1. If we can’t place that function (ie, there is already another writer scheduled), the upgrade fails.

There are two important limitations to this approach:

  • It doesn’t support arbitrary upgrade-downgrade cycling. This is less than ideal if for instance we wanted to split our write operation into two parts, separated by a read. Currently it is a one-way street.
  • The upgrade is not guaranteed to succeed. This makes sense - if you have multiple readers wanting to upgrade, only one of them can be next. What is guaranteed is that if we have multiple upgraders, at least one succeeds. The rest can continue spinning in the interim, attempting to stay low-impact with time.Sleep() or runtime.Gosched() calls on each cycle until they all complete.

What’s nice in addition is that this approach works on any platform with some sort of passable function / callable construct, read/write locks, and a threadsafe queue (in fact, the queue here isn’t much of a queue - you could replace it with an atomic reference or a plain mutex and a pointer).

01 Feb 2014, 10:16

Immutable SQL

I have strong opinions on data models. They’re not exactly novel opinions, which is good, because if they were truly novel it’d probabably be difficult to reify them on top of existing data stores. As it is, if you want your very own immutable persistence layer, you can call Rich Hickey and he will hook you up.

(I’m going to attempt to use the word “immutable” to signify data, whether it lives on disk or in RAM, which is “updated” via returning new copies and which can be validly observed at any point in time, to avoid confusion between persistence as a synonym for durability, and persistence as non-mutability. In reality it’s a bit of a misnomer.)

But Datomic is kind of expensive; SQLite on the other hand is free, so let’s use that (or really, any other free RDBMS). A data model that is both relational and immutable has certain conceptual advantages anyway, because the immutability applies to both the entities themselves, and also the relations between the entities. A blog post is in some sense the “same” blog post after editing, just with two values for its content over time. Its relation to its comments hasn’t changed. If we remove a comment, the post’s relation with its comments up to that point also doesn’t change - it’s just that we record a new relation from that point on that ommits the comment in question.

This is basically the Clojure idea of identity vs. state vs. value, expressed in a relational world.

How does this work? Basically like this:

create table posts(post_id integer not null, version integer not null, current integer not null, body text not null, primary key(post_id, version));
create table comments(comment_id integer not null, version integer not null, current integer not null, post_id integer, body text not null, deleted integer, primary key(comment_id, version));

--Unique on post_id and version prevents double-insert
insert into posts(post_id, version, current, body) values (1, 1, 1, "Valuable content!");

Notice that post_id does NOT form a primary key; only between post_id and version. We’ll potentially have many “versions” of a post, but they all have the same identity.

Here’s where things get interesting: how do I edit a post?

begin transaction;
insert into posts select 1, 2, 1, "On further reflection..." from a where post_id=1 and version=1 and current=1;
update a set current=0 where post_id=1 and version=1 and current=1; --Invalidate old data

The only part of a row that changes is the “current” flag, which is just an optimization to prevent us from having to attach a “having version=max(version)” clause to every query for the current state. Now, we live in a world where the present value for the same entity is different, but the history is preserved.

Now, here’s the really cool part: what happens if my blog co-author tries to edit the post at the same time, and submits after I do?

Absolutely nothing.

Those “where” clauses in the insert and update ensure my writes only happen if I’m working with the latest version of the data. If I am mistaken about that, no transaction happens. I can verify whether my write happened via the changes() function, or via a select after the fact if other clients are sharing my connection to the db.

Now: what if, instead of effectively overwriting the “current” state, I want to apply a functional transformation to it? For instance, I decide it’s important that I convert my double-spaces to single-spaces. I grab the current value, compute my changes, and try the quasi-update dance above. If it doesn’t work, I simply try again. Eventually my changes will go through, and it will be a valid transformation of the state - a transformation that does not need to be expressible in SQL, avoids the ABA problem, and doesn’t depend on verifying value equality in SQL (which might not be efficient or even possible).

That looks exactly like Clojure’s atom functionality, where you grab the current value, compute its transformation, and compare-and-swap the reference (which operates by checking pointer addresses, not the contents of the object pointed to), repeating if necessary until it commits.

And finally, let’s explore the relational aspect. How do I add and remove a comment?

insert into comments values(1,1,1,1,"You write so good.");

begin transaction;
insert into comments select 1,2,1,post_id,body,1 from comments where comment_id=1 and version=1 and current=1;
update comments set current=0 where comment_id=1 and version=1 and current=1;

Adding a comment is done in the same way as adding a post. Removing a comment is done by adding a value that has been “marked” in some way as deleted. In this case we have an actual column that records that information, but this is the same sort of optimization as having a “current” flag - we could just as easily signify deletion by having a NULL body, or a NULL post_id, and looking for the prior version if we wanted the deleted content. It depends on how we want to think about our data - when someone “deletes a comment”, are they more disassociating it from a post entirely, or “erasing” the content while maintaining the relation? Many sites actually display deleted comments as a regular comment with a “[deleted]” body, to preserve conversation threading - that’s something that’s easy to represent in this schema.

In fact, in a world where a comment was associated with multiple posts and vice versa, you’d typically represent it as a table for posts, a table for comments, and a join table for the relations. In that case, removing a comment from a post really does entail recording an updated relation and nothing else - nothing magical needs to happen when a comment reaches zero references, except perhaps when you “garbage-collect” your database by archiving old / unreferenced data.


Let’s talk briefly about querying the data and making it efficient. This isn’t anything particularly magical; if you want to restrict yourself to current and non-deleted posts / comments, you just need to embed that restriction in a “where” clause, like so:

--All posts IDs & bodies x all of their comment bodies
select posts.post_id, posts.body, comments.body from 
(select * from posts where current = 1) as posts 
left outer join 
(select * from comments where current = 1 and deleted != 1) as comments
on posts.post_id = comments.post_id;

You’ll be doing a lot of this, so you’ll probably want to encode that as a view for each table, and have a compound index on the “ID” and “current” columns to make those queries efficient.

The point of this is to facilitate a more historical view of the world - if your production systems mostly care about the present state, you can ship the archived state to a completely different system (maybe Hive for storage and bulk queries, and Shark for iterative analysis):

begin transaction;
insert into archived_posts select * from posts where current = 0;
delete from posts where current = 0;

This only works to archive “dead” data - but the present state is probably something you want in your analysis stack as well. To handle this, just ship your present state as a transient table that’s overwritten, and your archived state as a strictly accumulating store. Query the union.

select * from posts
union all
select * from archived_posts;

And having snapshots isn’t helpful unless we can actually examine the state on a particular date. To do this, you’d need to have a “created_on” column that embeds the date:

-- What version was live on 2010-01-01?
select post_id, version from posts 
where created_on < "2010-01-01" 
group by post_id having version = max(version);
-- Join this with whatever data you're actually interested in.
-- Or examine the subsequent state, if any; because version ID is
-- monotonically increasing by 1, it's trivial to examine deltas.

That requires a group-by over a potentially large amount of data per ID, depending on how many revisions you have. Depending on your particular RDBMS and indexing scheme, it may or may not be efficient to do that in the same system with the same cache that’s managing queries to the present state, which is why being able to ship historical data to a separate system is so handy.

04 Jan 2014, 22:02

12 ways of looking at logistic regression

Logistic regression is the most popular off-the-shelf classification technique. In fact, it’s so popular that every subfield has their own interpretation of the same technique, just so they have it in their toolkit while being able to compare it with the stuff they’re actually researching in the same theoretical framework. I’ve attempted to reproduce some of the many interpretations I’ve heard of below, bearing in mind I’m not an expert in all of these, so some of the nuance might be lost.

  • The classical statistical interpretation. Your labels come from a binomial distribution, conditional on their features. You want to estimate the distribution.

  • The Bayesian statistical interpretation. In addition to the above, your parameter estimates are themselves probabilistic beliefs with their own distributions. If you could encode a totally non-informative, zero-strength prior, this ends up being more or less the same as the frequentist interpretation.

  • The latent-variable interpretation, popular with social scientists and psychologists. There is some latent continuous variable that determines the outcome, depending on which side of the threshold it falls on, but we can only see the final outcome. Your goal is to estimate the parameters that determine this latent variable as closely as possible.

  • The Kentucky Derby interpretation. Your parameters represent multiplicative effects on the odds (as in, a 4:1 bet). Your goal is to calculate the effect of each feature to end up with the same outcome.

  • The less-Naive-Bayes interpretation. Like Naive Bayes, but estimating the pairwise correlations/covariances instead of assuming uncorrelated variables.

  • The information-theory interpretation. Find parameters so that conditional on the features, the output label distribution has maximum entropy.

  • The warp-space interpretation. Perform a kind of quasi-linear-regression in a space where we transform the label dimension via an inverse sigmoid.

  • The loss minimization interpretation. You have a loss function that gives you a penalty for each misclassified example (a higher penalty the more extreme your prediction was), and you classify an example by dotting its features with your parameters and applying a sigmoid. Find parameters that minimize the loss.

  • The “minimum bias” interpretation, popular with actuaries. Plot your data as a tensor, with each feature being a dimension, and the outcomes for each feature combo being summed in the appropriate cell (this only works for categorical features). Try to find parameters for each dimension, so that when you sum them together, apply a sigmoid, and multiply by the cell population, you get minimal binomial loss.

  • The neural network interpretation. Your features constitute a stimulus, dot-product’d with your parameters and fed through a sigmoid activation function to get a predicted label. You’re maximizing the fidelity with which your neuron can “remember” the label for the data it has seen.

  • The support vector interpretation. Take your data, and try to segment it with a hyperplane. For each point, apply a “supporting vector force” to the plane, proportional to the logit of the distance. When the forces balance, your hyperplane gives you your parameters.

  • The feedback interpretation. We initialize our parameters to some garbage values. For each observation, we dot the features and our parameters. If the result is negative and the outcome in our data is positive, or vice versa, move the parameter vector “backwards”, in the reverse direction of the feature vector. If they’re both negative or positive, move the parameter vector “forwards”, in the direction of the feature vector. This corresponds to the stochastic gradient descent fitting procedure.

There are probably more I missed. Despite the fact that everyone has a similar basic toolkit, there seems to be a pretty low amount of cross-domain polination on extensions, even between similar fields like machine learning and statistics, or statistics and actuarial science. Maybe that’s because everyone is speaking their own dialect, and content to restrict their conversations to native speakers.

25 Nov 2013, 14:45


Everyone should write a databse. It’s a core area of practical software engineering, it’s fun, and turns out it’s not that hard.

(The real reason, of course, is to wind up rolling in dough. It’s basically the same thought process behind learning to play the guitar.)

So I threw together Clownshoes, the ADVANCED ENTERPRISE GRADE NOSQL BIG DATA HIGH PERFORMANCE SCALABLE DOCUMENT STORE FOR BIG DATA (capitalization is emphatically part of the description). It’s not quite as awesome as some other NoSQL databases, like IMS, but it’s fun in its own way. The core storage engine bears a strong resemblance to a certain other prominent document database, to wit: mmaped linked-lists of “documents”, where “documents” are in this case arbitrary bytestreams, which you can easily use to represent anything from JSON to BSON to entire web pages.

It supports atomic inserts, deletes, and updates (“supports” might not be a strong enough word; all updates are atomic since there’s a per-database write lock). It scales to collections over 180TB in size, and working sets over 1TB. It’s wicked fast on the storage layer. How does it do all this?

Clownshoes, being HIGH PERFORMANCE, knows that in order to scale to BIG DATA you need to run on BEAR METAL. No network overhead for us! Instead it is embedded in your running Go application process. You can easily scale horizontally by buying a couple of LSI cards and plugging in a disk enclosure to the left or right of your application server, or inserting more RAM to either side of your CPU socket. If you find yourself bound on CPU throughput, you can scale horizontally as long as you built with some creative geometry which gives you room to grow (see the diagram below):



The other main perfomance trick is journaling. Journaling essentially converts random writes into linear writes by recording a running log, eg, “write N bytes at X location”, and ensures that those linear writes actually hit disk. That way if there’s a problem actually committing to your main database file, you can replay your journal against the last consistent snapshot and recover. Not a bad idea, but we need all the IOPs we can get - our journaling strategy is to not journal, and let the user to decide when they want to snapshot their database (Clownshoes believes in moving fast, and also in breaking things). In general, complete configurability about how much to care about data is the secret to our webscale sauce.

Now, you might say to yourself,

This is almost exactly a linked list of strings, plus some hashmaps for indexing, plus gob-based serialization, plus the OS’s paging algorithm to support data bigger than physical memory capacity. The only difference is that you’re managing the memory a bit more manually via mmap, but because Go’s GC is currently non-compacting, if you’re using the built-in structures and memory pressure is high enough the OS will swap out a lot of untouched data to disk anyway. The wheel has been reinvented here.

This is true. However:

  • There is an advantage to managing the mmap-ing, compaction, etc. yourself, rather than letting the OS manage the swapping. Go’s GC does have to traverse pointers on the heap when it GC’s, which can be avoided if you manage the pointers by embedding them all in a single array as Clownshoes does. Depending on how & where the “native” pointers are actually stored in memory, they might actually generate some swapping you don’t need as they bring in entire 4k pages, and traversal will of course add more time to your collections.
  • I said everyone should write a database, not use a database they wrote themselves. Your probably shouldn’t use Clownshoes for important things; I am personally using it to manage some metadata about which movies I have watched. I have probably spent more time writing Clownshoes (a few hours cumulatively) than I have watching movies over the past month or so, so this is not mission-critical data. In any case, if you want an embedded data store, what’s wrong with SQLite? Just use SQLite already.

I guess that last one isn’t really a “however” so much as a “yup”.

13 Nov 2013, 17:56

Persistence + GC saves memory

One of the disadvantages of garbage-collected runtimes is potentially increased memory consumption. This usually comes from a couple of places: object headers used for GC metadata, and “extra” heap space kept around to manage the garbage collection process (for example, the space taken up by Java’s Eden pool, which is gradually filled up and then GC’d, rather than incrementally freeing objects as in a reference-counted environment like Python or a manually managed malloc/free environment like C).

But garbage collection can actually lead to vastly reduced memory consumption, because it allows you to use persistent data structures that share data which would otherwise have to be tracked separately. You can share the same backing data for things that will be used & modified in different ways, because it’s effectively a copy-on-write environment: if you conj data to a shared vector in Clojure, the original is unchanged and the caller gets a “new” vector, with most of its backing data shared with its parent.

Those kinds of persistent data structures really only work (at least performantly) in the context of garbage collection, where you’re automatically freeing objects that have ended up with no references to them. Otherwise, managing the object graph for the kind of complex relationships that persistent data structures generate is nigh impossible.

Here’s an example. You have a large number of short phrases you’re analyzing. If they’re in a Latin-alphabet language, you can easily segment them into lists of interned strings representing words (a poor man’s trie). A lot of phrases are duplicates, so you can also de-dupe the entire phrase-list - and it’s free as long as you’re using persistent data structures that guarantee you’re placing no restrictions on how future callers can use your data.

;Straightforward impl
(with-open [r (clojure.java.io/reader "short_cuts")]
  (->> (line-seq r)
       (map #(vec (clojure.string/split % #"\s")))
       (into [])))

;Intern the strings
(with-open [r (clojure.java.io/reader "short_cuts")]
  (->> (line-seq r)
       (map #(vec (for [token (clojure.string/split % #"\s")]
                    (.intern ^String token)))
       (into [])))

;Dedupe the vectors
(with-open [r (clojure.java.io/reader "short_cuts")]
  (loop [remaining (line-seq r)
         as-vec []
         as-set #{}]
    (let [this-line (vec (for [token (clojure.string/split % #"\s")] 
                           (.intern ^String token)))]
      (cond (empty? remaining) as-vec
            (contains? as-set this-line) 
              (recur (rest remaining) (conj as-vec this-line) as-set)
              (recur (rest remaining) (conj as-vec this-line) (conj as-set this-line))))))

The result: Vectors of non-interned words take 21.5 gb. As vectors of interned words, they take 12.1 gb. And as de-duped vectors of interned words, they take 4.4 gb. And each version supports the exact same operations on the result - you can take one of those de-duped phrases and “modify” it with no side effects on its siblings, because you’ll get back a new reference.

But here’s the twist: most of that storage is overhead from the clojure.lang.PersistentVectors we’re storing the strings in. Each of those has a 32-long Object array for the leaf, plus another 32-long Object array for the root node, plus some miscellaneous primitive fields. Seems a bit excessive when the vast majority of our vectors are <=10 elements long. What happens if we replace them with something more efficient, like a PersistentArrayVector from SOAC?

(require 'soac.arrvec)
(with-open [r (clojure.java.io/reader "short_cuts")]
  (loop [remaining (line-seq r)
         as-vec []
         as-set #{}]
    (let [this-line (soac.arrvec/array-vec (for [token (clojure.string/split % #"\s")] 
                                             (.intern ^String token)))]
      (cond (empty? remaining) as-vec
            (contains? as-set this-line) 
              (recur (rest remaining) (conj as-vec (get as-set this-line)) as-set)
              (recur (rest remaining) (conj as-vec this-line) (conj as-set this-line))))))

Memory usage goes down to 1.5 gb, or ~7% of what we started with.