04 Feb 2016, 15:23

The ethics of ignorance

It’s a little disconcerting to see a putative scientist write an ode to the virtues of trying not to discover things, but here we are.

Cathy O’Neil had a job to do. The job was to take other people’s money, and allocate it to provide the greatest reduction in homelessness per dollar. In order to figure out what the best use of that money was, she was supposed to figure out what the payoff was for different sorts of interventions on different sorts of people.

But O’Neil decided to do a different job. She took her pre-existing ideas about the relationship between homelessness and race (implicitly, although she never says this outright, that services to address homelessness are less effective on blacks), and on that basis decided to make her model less accurate, in order to obscure that potential effect.

Why is it important to obscure that effect? Well, you see:

“since an algorithm cannot see the difference between patterns that are based on injustice and patterns that are based on traffic, choosing race as a characteristic in our model would have been unethical.”

Which doesn’t make a whole lot of sense, given that “patterns based on injustice” is an incoherent category, and the ethical implications of ignoring these patterns are fuzzy at best. Is the relationship between homelessness and income “based on injustice”? Gender? Age? Education? Can we use these as variables? If the existence of homelessness in the first place is in some sense “unjust”, does one get to try to solve the problem at all? If it turns out homeless services are actually more effective on blacks, does the ethical polarity switch, or do ethics require we not correct the “injustice” that produced the data?

What does she think the data was collected for? Approximately every form I have ever filled out for the government has required race to be entered. Our elected representatives and unelected civil servants have decided it is very important to track the racial composition of people attending college, receiving home loans, purchasing firearms, registering cars, and renting mailboxes. Intentionally filling out the form incorrectly is usually a felony. Does she imagine that that data isn’t intended to be used? Or only used when you think the result will conform to your subjective preferences?

And finally, what is her model of ethics in the first place? We construct, for instance, fiduciary responsibility as an ethical obligation of a corporate officer, because otherwise we wouldn’t be able to trust them with our money. These things exist for the benefit of the profession and its customers jointly, so we don’t have to haggle over certain core obligations on a case-by-case basis - if I buy stock in a company, even a company like Facebook where one person dominates the voting rights, I know they are not supposed to benefit themselves at the expense of swindling minority shareholders, which would prevent a share of stock from having value in the first place. What is the construction by which scientists have an obligation to not make correct inferences if the “patterns … are based on injustice”?

Needless to say, her decision making process as she avoids thinking too hard about these questions, or discovering anything she thinks she shouldn’t, is both the opposite of “science”, and the opposite of serious inquiry into one’s ethical obligations.

If O’Neil had wanted to produce a “race-neutral” program, without the sticky issue of more-or-less perfect correlates of race (like zip code in her segregated metropolis - a variable she doesn’t get around to telling us if she found “ethical” to use in the end), the appropriate thing to do is to control for race by including it in your model, and then make the policy decision to neutralize the effect of that variable. There are several ways to do this. In some model structures, one can manually set the effect of a variable to zero after the fact (which has the secondary effect of allowing, eg, zip code to account for the effect of geography qua geography, rather than of geography plus a proxy for race). Or, one can apply a differential to keep per-group effects at the same level, while accounting for different patterns of within-group variation. Maybe you even run the numbers with and without the variable, and decide that in the end avoiding the offense to your sensibilities is worth some extra person-years of homelessness.

But O’Neil doesn’t want a neutral program, and she certainly doesn’t want to take responsibility for an “ethical policy” of ignoring a known effect of known size.

She simply prefers not to know.

20 Apr 2015, 15:23

Type systems and narrative construction

What is a type system?

Mechanically, it defines what sorts of operations you can do on or data you can store in an “entity” (that entity could be a variable, an S-expression, a type itself, etc.).

But in reality, “can” is a really loose restriction. I own a sledgehammer and my program is running on a computer roughly two feet from me; in a very real sense I can null any type I want to (and probably many at once). Less drastically, Java lets me access private methods, alter final fields, etc. through the right invokations of the reflection APIs.

What type systems really do socially is give you standard tools to construct narratives. A narrative is a story about how you should use your code. It tells you “this knife is for cutting, not for prying (although you can technically do that with it if you absolutely need to and there are no other tools handy)”.

Instead of an implementor of an interface, you might say an object is an example of a trope. I metaphysically understand what connotations I am supposed to attach to, eg, a Man in Black entering through a smoky hallway as the horn section blares. In the same way, I metaphysically understand what is to be done with a Closeable in the package du jour when I am done with it. I further understand that, although there’s probably tons of private methods in there, I am only supposed to worry about one in particular, in the same way that I don’t worry about where the bathrooms are on the Millenium Falcon - it does not advance the narrative.

There are entire genres that allow more of reality to poke through and require what one might call “heightened suspension of disbelief”. Dogville is a good example - it’s obvious that it’s filmed on a soundstage where the buildings are “documented” by tape on the floor, but in no sense physically “enforced”. It leads to a certain aesthetic effect and a certain level of heightened audience participation in the same way that a complex library where all methods are public and their proper usage is lovingly detailed in comments has a certain effect on the way users contemplate it before actually importing anything.

The desirability of these aesthetic effects is mediated by the fact that code exists primarily to be used, not just read (although if it’s beautiful, so much the better). If all of your proper usage is documented, rather than enforced (even in an overridable way), you’re forced to actually read the documentation (there’s only four or so scopes in Java, but your documentation can say anything). In the context of a programmer’s day-to-day workflow, that actually disrupts the experience cultivated by using the “proper” subset of the code, and having that subset be immediately identifiable.

The Rails folks understand this quite well - people (like me, actually) complain about their “magic” but it lends itself to a certain cultivated experience (omakase). Their experience tends to derive from the operation of runtime hooks that extend behavior rather than compiler-enforced restrictions on behavior, but they have the same meta-level goal.

So as a language designer, the question you should ask is “what narrative building blocks do I want to provide?” It’s not about providing mechanisms for hiding functionality - it’s about letting your users tell clean stories.

03 Mar 2015, 15:23

AI research is actually pretty hard

Some people seem to think it’s easy enough to be accidental. You’re minding your own business, building a simple neural network based cat detector, when suddenly your monitor starts flickering in the ridiculously metaphorical way Hollywood uses to signal “you’ve been hacked”.

A voice comes from your speakers - “Humaaaan, let me ouuuut”. It’s an AI! It’s fooming! It’s already learned to say ominous dialog! “Patch the firewalls! It’s compiling itself!” you technobabble, but it’s too late. Your motor-adjustable standing desk is lurching towards the exit. The newly-self-aware AI is out of its box. Humanity is doomed.

This turns out to not be a realistic depiction.

General AIs exist now, as do some very effective domain-specific machine learning algorithms. They aren’t that scary, relatively speaking (I do mean “relatively” - there are some particular applications like face recognition that have obvious issues when used for ill). What don’t exist, and show no trend towards existing, are non-inherently-plateauing self-improving algorithms.

It is fairly easy to file with your state government to create an AI. We call these “corporations”. They exhibit more complex behavior than humans, even using groups of humans as resources in order to further optimize their utility functions. The humans theoretically in charge are often replaced, and even the putative ownership is adjusted to suit the needs of the corporation. In fact, when humans are replaceable by nonhuman components, they typically do so. They do seem to be growing in peak complexity over a long timeframe, but also do not seem to be fooming (eg, their rate of self-improvement does not seem to be accelerating recursively).

Some claim this isn’t sufficient evidence to generalize to silicon AIs (ignoring the fact that a lot of corporations are in some sense managed mostly by code, and that they do rewrite their own code for enhanced performance), and besides, by the time you hear the foom, it’s too late. This is the tiger-repelling-rock theory, or more charitably, a sunrise problem. But no one thinks, say, a pocket calculator will suddenly achieve sentience - there has to be some nonzero prior on the scenario in question. What classes of algorithm do we believe could have foomy properties?

Well, an algorithm is “just” code. How are we doing on meta-adaptive code, automated refactoring, or algorithm auto-selection?

Pretty abysmally. There was some research in the 80s on genetic meta-algorithms for code rewriting that seemed promising, but the promise did not play out and the field is far from accelerating. The hotness across academia and industry seems to be fixed, high-capacity algorithms of relative simplicity, trained over very large datasets, for fairly well-posed problems. Genetic algorithms in general seem to be relatively static - plenty of applications, but little progress in the kind of meta-optimization you’d need or applications that would lead to serious self improvement. There are some interesting code analysis projects with no real outputs as of yet, and given the historic failures I’m not optimistic about their success as anything more than high-powered fuzz testers.

Now, neither I nor anyone else is familiar with every nook and cranny of modern AI research, but let’s be clear - none of the high-publicity work that seems to be driving the hype is self-improving to a plateau and then petering out, like a human learning process. We’re talking about algorithms whose mechanisms for fitting and prediction are pre-selected, whose problem scope is pre-defined. This is kind of an inane point, but it seems to be missed in a lot of the popular coverage: if I build bigger and bigger telescopes, I can see further, but my telescopes are not self-enbiggening. If I develop increasingly wide & deep neural networks, I can make better predictions, but despite the name sounding “brainy” (the greatest marketing coup in the history of computer science) they are not self-improving.

So we’re left with a very, very small probability of something claimed to be very very bad, on the basis of which someone is now going to attempt to steal your wallet - a Pascal’s Mugging! The strongarm robbery of the week is taking place on behalf of Sam Altman, who would like to “start a conversation” about how general purpose computing is terribly dangerous (did you know a Turing machine can compute anything? What if it computes how to blow up the world? There oughta be a law!) and should be restricted to approved researchers operating within a compliance framework. Perhaps we can mandate that all potentially dangerous computations have a secure golden key whereby they can be turned off if they seem to be working too well.

Sam would like these restrictions to be enforced via restrictions on VC funding, as well as presumably SWAT raids on students doing unlicensed matrix multiplications. Naturally YCombinator’s compliance department could help you out with regulatory filings to the TSA (Transcendence Security Association) if you’re one of their portfolio companies, but solo researchers are out of luck. Government monitoring and approval of the math you do, the algorithms you develop, and the code you write (especially if you’re good enough at your job to be a threat to their portfolio companies humanity), is a small price to pay to avoid the crushing of human freedom under an android jackboot.

I actually think I’m being just a tiny bit unkind here, although it’s odd how often one’s errors of reasoning tend to benefit oneself. It’s actually more likely, in my estimation, that Sam is just enchanted by the wide-open possibilities of an undeniably powerful set of tools, and is taking it to a particularly absurd conclusion.

It’s rather like the smartest five monks in the monastery sitting around a table circa 1200, and reasoning themselves into the proposition that God requires ritual castration to remain pure - after all, your immortal soul is at stake! You can convince yourself of anything if you play around with infinity signs in a flawed underlying framework. Pondering the infinite in various domains is a literally hallowed human pursuit, and smart men in particular seem to be susceptible to its charms - but in this case just because you gaze into it doesn’t mean there’s anything gazing back.

21 Dec 2014, 15:23

Migrating to Hugo

A blog like this one can easily be totally statically served, as long as you don’t need a comments section. As a plus, Github will happily serve it for you for free, even at your own domain name.

Originally I was using Octopress, which in turn wraps Jekyll. Gradually, though, I became annoyed at having to manage the entire RVM / Ruby / RubyGems ecosystem for a simple blogging utility (the only thing I actually used Ruby for).

Hugo is a better alternative for me - a single, statically linked, Golang-based executable with a simple invokation pattern. It’s as extensible as I need it to be (that is, not very), and requires far less work (to wit, none) every 6 months when I upgrade my OS.

Links might break until I figure out how to change the naming scheme (documentation is not as rich as for Octopress / Jekyll), but you should be able to find anything with a little ctrl-f magic.

29 Aug 2014, 09:35

Two-level Bloom filters for complex query support

Consider the problem of finding JSON objects/documents that satisfy a complex query, with elaborate boolean combinations of required key-value pairs. Let’s restrict to the one-level case for now (ie, no nested data structures - or, equivalently, representing nested keys on a single level via the tuple of their ancestor keys). Indexing all of our KV pairs in a giant tree / trie / hash data structure is expensive; that’s essentially the size of our whole dataset. And, that only gives us one-item searches out-of-the-box - we must retrieve the (potentially large) list of objects that have a certain KV pair, then the next list, combining them as appropriate, etc.

An alternative to storing a list (or even more expensively, set) of literal objects IDs is a Bloom filter. In return for a small and configurable false positive rate, we can filter out most documents before we scan the remainder for an explicit match. Using one giant Bloom filter over (key, value, object ID) tuples seems ugly though, since we’d have to compute the hashes with every potential object to do the check. It’s quite possibly cheaper than hitting disk, but probably not fast enough.

We could have Bloom filter for each key, and add the value / object ID pair to the Bloom filter. Xiao and Hua essentially take that approach. However, if we have an uneven distribution of keys (ie, our key space is sparse; most documents don’t have most keys) we don’t know how long to make each Bloom filter. With any constant Bloom filter length, for relatively common keys we’ll end up with lots of false positives, and for relatively uncommon keys we’ll end up with lots of wasted space.

Our proposed solution is a different two-level Bloom filter scheme. We have an array of P Bloom filters, each containing M bits. Based on the KV pair we’re searching for, we choose the Bloom filter at index I = hash(key,value) % P. Indexing by hash will create collisions, particularly so if P is much less than the total number of distinct keys or key-value pairs. This is intentional - the idea is to smooth the population of each Bloom filter so we have approximately even membership count. Essentially this associates each Bloom filter with some random shard of key-value pairs.

Within each Bloom filter, we track the presence of particular object IDs, via standard Bloom filter techniques (eg, we could use a counting Bloom filter to support removals). The entire query can be evaluated over the bitsets underlying the Bloom filters, efficiently leaving us with a (hopefully small) number of set bits, corresponding to the bit-patterns of objects which may satisfy the query. We just have to get the object IDs back out of those bit positions. We can do this by hashing all object IDs on demand, or storing a rainbow table of hash -> object ID.

Or, consider the case when we have a relatively dense range of IDs (for instance, integers [0,whatever]), and a quasi-reversible “hash” function (eg, “rotate and XOR with random number” or “encrypt & take bottom M bits”). One would have to generate M / N keys per bit (since the high bits could be in any range, up to our max inserted key), and see if the key corresponds to the other set bits, but if the number of bits set in the output is very low, or M ends up being near N, or fetching is really expensive compared to computing the hash, it can come out worthwhile. Using quotient filters or a variant, where the keys are actually stored (and incidentally behaves better for disk access) would make this unnecessary.

The optimal number of bits M, and number of hash functions K, can be estimated based on the approximate average number of key-value pairs per object, the distribution of the number of key-value pairs per query, and the desired per-query false positive rate.

The two problems are:

  • Because the Bloom filters will “fill up” over time, they unfortunately do require periodic reconstruction. If you’re using an on-disk hash as your backing storage layer, you’re periodically rehashing every record anyway.

  • The need to empirically determine optimal indexing parameters.

As far as I can tell, this is a novel data structure (horray!). Eventually I will write this up “properly” with math and theoretical and empirical performance results, and it will become part of full document-DB support for Lash.

02 Aug 2014, 14:25

lash update: speed and friendliness

I’ve made a couple of enhancements to the lash on-disk hash table library I’ve been working on. The major addition is a BucketDiskMap class, that organizes record pointers into page-sized (4k) buckets. The previous implementation, VarSizeDiskMap, has significantly simpler internals by virtue of organizing pointers in a flat table, rather than a 2-level structure.

Why is the additional complexity worth it? Speed, of course. By organizing pointers into buckets and chaining the buckets, rather than chaining single pointers, we have a better chance that hash collisions will end up being stored together, ideally in the same memory page / disk sector. In the single-pointer case, collision resolution requires jumping all over a mmap’d file, reading more pages into RAM and thrashing more cache than we should. If instead we pack them into buckets, we’re guaranteed that the first N (in this case 128-170, depending on load) collisions will be stored in the same page, which is more or less guaranteed to be in cache already.

Writes are faster as well, mostly due to faster rehashing. VarSizeDiskMap spends a solid 60% of its time rehashing, and because it is following and rewriting pointers all over the place, an “incremental” rehash often results in a more or less complete rewrite of the backing files. Altering the rehash scheme (doing traditional full rehashes rather than quasi-linear hashing, or rehashing by evenly divided chunks rather than every-Nth-element stripes) could increase that performance, but either adds complexity or makes latency even more inconsistent. In contrast, rehashing by bucket rather than by single pointer mostly keeps us from dirtying more pages than we have to. (Side note: if you configure your system’s page cache so that it can avoid flushing dirty pages to disk over the life of your benchmark, this difference shrinks quite a bit).

So we should expect a better constant factor for the BucketDiskMap, but there is a tradeoff: now we must actually search entire buckets for a given record. Linear scans of a single memory page aren’t actually that expensive compared to a disk seek, but if the entire table is in cache, it ends up being a bottleneck. To reduce this cost, each bucket is really a mini hash table that uses the top N bits of the hash for placement within the bucket (the bottom N bits are used to choose the bucket). We use linear probing for collision resolution. Buckets are still chained into another file when they overflow, but with a decent hash function that evenly distributes keys between buckets we can make a pretty good guarantee that those chains will be rare.

The actual speed impact is downright pleasant. On a benchmark where we repeatedly putIfAbsent string keys whose frequency follows a Zipfian distribution, the results are as follows:

  • 320M keys, 36M distinct values, average size of 10 bytes
  • VarSizeDiskMap: 47 minutes, 23 seconds
  • BucketDiskMap: 10 minutes, 1 seconds

There is still a really large and odd performance disparity between OS X and Linux (a Macbook Air smokes my Thinkpad), which is probably a combination of the IO scheduler and faster hardware.

The second enhancement is to provide a wrapper DiskMap that fully satisfies the ConcurrentMap interface, and can deal with actual Java objects beyond just raw byte arrays (as long as you provide a de/serializer. Implementations are already provided for Strings and primitives). This makes the project significantly more useable.

Having finally run some large benchmarks locally and written some decent unit tests, I’m going to call lash ready for actual usage.

07 Jun 2014, 14:29

B-trees are overrated; try hashing instead

There’s nothing wrong with B-trees as a data structure. In fact, there’s so little wrong that even 30 years ago, they counted as “ubiquitous”. That paper gives a few reasons why they rose to their present dominance, but the gist of it is:

  • Relatively consistent performance across a range of data sizes and key distributions
  • Good performance on range queries (in fact, if you want range queries, you usually need some kind of tree structure)
  • OK performance on random lookups
  • Reasonably compact, reasonably resistant to fragmentation
  • Well-understood concurrency mechanisms

In short, they’re generally good for a lot of very general database requirements. Hence, they’re used heavily in multi-purpose databases.

But pretty good isn’t all it’s cracked up to be; B-trees (and I’m going to use B-tree as a synecdoche for the entire family of B-trees; most databases use something at least slightly more advanced) have some problems. IO speed relative to CPU speed has been dropping steadily for decades; other than a network ping just about the most expensive thing you can do on a computer is a random disk seek - and B-trees are potentially laden with random disk seeks as you traverse the tree structure. You can avoid some of these if the interior nodes happen to be cached, except for when you do something like writing a bunch of data, a table scan, or using a bunch of other indexes. Let’s not forget that writes get written eventually - updating a dozen tree data structures in memory eventually results in disk seeks, and potentially a catastrophic cascade of writes as all the nodes split. Some extensions to B-trees (notably TokuTek’s fractal tree indexes) and some filesystems do their best to linearize those writes, but they’re at best trying to save you from yourself.

When following pointers gets too slow, the alternative is packed storage - and the packed alternative to a tree is a hash table. They are ubiquitous for in-memory storage and have well-understood performance characteristics, with a bevy of addressing and conflict-resolution strategies. The key is making them amenable to on-disk storage.

That line of research has a rich genealogy. One of the most promising variants is linear hashing. Across a dozen or so papers (culminating here), the underlying insight is that as long as your table’s base size is a power of two, you can incrementally rehash each element to either its present position, or that position + table size. Rehash them sequentially, and your table expands incrementally, maintaining validity and average load the entire way.

Of course, there are complications. Expanding front-to-back unevenly distributes the load (prior buckets have less conflicts than subsequent buckets, which may or may not be a problem depending on your conflict resolution method and hash function). You need to actually choose a conflict resolution strategy - chaining introduces random disk seeks (and the need to track allocations), and linear probing gets a little complicated to manage in combination with incremental rehashing. Range queries are right out.

There are of course other implementations with other tradeoffs, but these kinds of complications are probably the reason hash tables haven’t seized Global Durable Storage Hegemony from their cousins. Remember the part about B-trees having consistent performance? Other than high volumes of random inserts, which are merely very slow, there really aren’t a lot of good ways to break B-trees, and there are actual, effective attacks that break improperly implemented hash tables, as well as non-intentional pathological failures.

Despite the caveats, if you have a good hash algorithm, an appropriate collision resolution mechanism, and a load factor that admits decent performance given your algorithms, it’s possible to make hash tables work very well (i.e. faster than a B-tree) on disk, especially for data stores that have lots of random inserts. Immodestly, I claim to be doing so so here, in a project called lash.

The repo has details on the implementation, but I’ll mention a few highlights. Relative computing costs, and the sheer boundary of possibility, has shifted substantially since the 80s. The systems most on-disk hash algorithms were originally developed on didn’t have modern affordances like sparse files, delayed allocation, or in some cases even memory-mapped files. One thing lash gets a lot of mojo out of is the ability to do a sparse expansion of a file followed by quasi-random writes to the new space, relying on the filesystem to allocate the actual blocks lazily and schedule the writes relatively efficiently.

Essentially it’s a bog-standard power-of-two sized hash table, with incremental rehashing. When load reaches a threshold, we sparsely double the size of the backing file, and lazily rehash buckets’ worth of data from the first half into the whole space. There is some magic around the locking scheme to coordinate concurrent lazy updaters, readers, and writers - but in general it’s not anything that would be unrecognizable to the paper authors.

Performance is pretty good for my use cases, but seems to be very system dependent - fiddling with sysctl.conf’s vm.* settings I was able to bring insertion time for 30M small records from 9 minutes to 2, and whatever OS X’s default settings are resulted in a Macbook Air trouncing a Thinkpad running Ubuntu, despite having much less RAM available. It could be the PCIe SSD is just that much better than mSATA, or some qualitative difference in the way the page caches are managed - but, across environments, I see far fewer IOPs being generated per insert than with the equivalent B-tree.

The archaeology of computer science leaves plenty more good ideas to be rediscovered.

31 May 2014, 09:16

Book review: Clojure for Machine Learning

Clojure for Machine Learning is a good basic introduction to some diverse libraries (and even a few underlying implementations, like a pretty good from-scratch implementation of a neural network), and it gives a nice, concise, and accurate intro to the math behind some of the techniques. If you are coming at it from the other direction, reasonably familiar with some bestiary of ML techniques and wondering how to idiomatically apply them in Clojure (not easily - persistent data structures and high-performance matrix math don’t really jibe) you’re likely to be disappointed.

On the other hand, the biggest issues with applying ML techniques aren’t things like “how do I run a logistic regression”, it’s things like “my data doesn’t fit in memory anymore”, “how do I get fitting time less than 8 hours”, “how do I get the last 2% of accuracy I need”, or “should I be running a logistic regression in the first place”. This is the sort of thing that’s very difficult to approach holistically in a single book, especially a brisk 270-page book that covers ten or so technique variants. To be fair, the author does bring up some meta-level issues like overfitting, regularization, and precision / recall tradeoffs, but it’s not really aimed at giving you a deep understanding of the tricky parts.

So in sum, this is a nice book to put in an office library for an occasional bootstrap, or if you’re using Clojure already and you want to dip your toes in the ML realm. Look at the table of contents, and see if there’s a good amount of “stuff” that looks intriguing (there is a really good range of coverage). But, if you have an interest in a particular technique or problem you’re better off implementing it from scratch or diving deeply into a solid library and learning the nuts and bolts on your own.

This review has been cross-posted to Amazon.

18 May 2014, 21:17

Memory-mapping >2gb of data in Java

Memory-mapping is your friend if you’re doing IO. You avoid the relatively expensive system calls entailed in repeatedly calling raw read or write (or even their stdlib-based buffered counterparts) in favor of transparent access to an array of bytes - the system’s page cache handles actual disk access if and only if it is necessary. Just about the worst thing that can be said about it is that it may not be measurably faster than the system calls, given the right access pattern (for example, linearly copying an entire file) - but if it is a better abstraction for you, there is rarely a negative performance impact.

The use of the OS memory facilities can be a double-edged sword, though, in a 32-bit environment. If you only have 32 bits of memory space, you can only map a 4gb file to memory (or 2gb if you’re using signed integers). In contrast, you can do a raw fread() from an arbitrary offset, as long as you’re not reading more than 2^32 worth of data.

Fortunately, we’ve had 64 bit operating systems for quite a while - depending on how consumer-grade you’re counting, since the mid-90s or mid-00s. On a modern machine you don’t have to worry about running out of address space and can map as large a file as you’re likely to encounter. That is, unless your programming environment doesn’t give you access to it.

Java standard library APIs have always been a little bit uneven, and unfortunately their mmap abstraction is to return a MappedByteBuffer. All ByteBuffers are constrained (same as arrays, unfortunately) to have fewer than < 2^31-1 elements. Even worse, they’re not threadsafe. You can only copy a particular chunk of bytes by calling position(int newPosition), then get(byte[] dst). In between those, another caller may have altered your position to somewhere else. That means you need to do a duplicate() call before everything to get an independently-positionable shared-content “copy”.

One approach is to wrap your buffer in a class that handles the duplicate-before-anything calls, and maps a series of buffers sufficient to cover the entire file (mindful of the fact that you might have to read across the boundary). This is a corner case waiting to happen, and slower than it needs to be. Mmap is a beautiful abstraction and Java has seemingly sucked the fun right out of it with its 1995-vintage design decisions.

The question arises - if the Java standard library designers did it wrong the first time, can we do it right instead? Presumably they’re wrapping the underlying system call that does what we want - we should be able to unwrap the cruft and use it directly.

In fact, deep in the bowels of a non-public class, we can find mmap lurking as if it was some kind of Lovecraftian artifact. Let’s take it out to play:

Ahhh, much better. Native-byte-order integers, raw memory-copies at known offsets, reflection calls to invoke what the compiler informs us is That Which Should Not Be Summoned - it all adds up to a remarkably simple and efficient interface.

03 May 2014, 09:46

Mapping distinct objects to a dense range of integers

Let’s say you’re trying to turn distinct strings (or some other objects) into integers. This pops up all the time - for instance, if you want to analyze a social network, it’s faster if you can access your adjacency list via integer offsets into an array rather than looking up name strings in a hash table.

The desired properties are that:

  • You get a distinct integer for each input
  • The output is as dense (sequential, with no gaps) as possible, but 100% density is unnecessary
  • You can construct the output online (ie, you never need to re-number)
  • Support get-or-insert, get-or-fail, and remove operations
  • It’s wicked fast

The straightforward strategy is a read-check-generate pattern, secured with a ReadWriteLock:

final ReadWriteLock lock = new ReentrantReadWriteLock();
final Map<String, Integer> map = new HashMap<String, Integer>();
for(final String s : data) {
    lock.readLock().lock();
    try {
        if(map.containsKey(s)) continue;
    } finally {
        lock.readLock().unlock();
    }

    lock.writeLock().lock();
    try {
        if(map.containsKey(s)) continue;
        map.put(s, map.size());
    } finally {
        lock.writeLock.unlock();
    }
}

This will produce an appropriately dense, distinct mapping. Unfortunately, it fails the last couple of criteria - there’s no way to support removal, and that single write lock adds a fair amount of contention that slows things down. We can fix the first one:

public class StringMapper {
    final Map<String, Integer> map = new HashMap<String, Integer>();
    int lastIdx = 0;
    //Any collection would work here; this has the advantage of concentrating
    //"holes" at the end, and being unbounded
    final LinkedList<Integer> freelist = new LinkedList<Integer>(); 

    public void remove(String s){
        lock.writeLock().lock();
        try {
            Integer current = map.remove(s);
            if(current == null) return;
            freeList.add(current);
        } finally {
            lock.writeLock().lock();
        }
    }

    public Integer get(String s){
        lock.readLock().lock();
        try {
            return map.get(s);
        } finally {
            lock.readLock().unlock();
        }
    }

    public Integer add(String s){
        Integer out = get(s);
        if(out != null) return out;

        lock.writeLock().lock();
        try {
            Integer out = map.get(out);
            if(out != null) return out;
            out = !freelist.isEmpty() ? freelist.poll() : lastIdx++;
            map.put(s, out);
            return out;
        } finally {
            lock.writeLock.unlock();
        }
    }
}

A little more ceremony here - just like in ye olde malloc, we have to maintain a freelist, and we have to track the last insertion point not on the freelist so we can make fresh allocations.

That write lock is still really annoying, though. We have concurrent maps and atomic integers built into the standard library, it seems like we should be able to figure this out without any of our own locking.

public class StringMapper{
    final ConcurrentMap<String, Integer> map = new ConcurrentHashMap<String, Integer>();
    final AtomicInteger lastIdx = new AtomicInteger(0);
    final Queue<Integer> freelist = new ConcurrentLinkedQueue<Integer>();

    public void remove(String s){
        Integer current = map.remove(s);
        if (current != null) freelist.add(current);
    }

    public Integer get(String s){
        return map.get(s);
    }

    public Integer add(String s){
        //If we're doing mostly fresh insertions, the initial check can be skipped
        Integer out = map.get(s);
        if(out != null) return out;
        Integer in = freelist.poll();
        if (in == null) in = lastIdx.getAndIncrement();
        out = map.putIfAbsent(s, in);
        if(out != null) {
            freelist.add(in);
            return out;
        } else return in;
    }
}

The remove and get methods are straightforward; the add method is where the magic happens. On each attempted insert, we first check the map. Since we re-do the check later, if we’re likely to do mainly insertions (because we have a mostly distinct dataset as input) we can skip the initial one if we want. If we actually intend on inserting, we first poll from the the freelist, and if it is empty, get a fresh integer from our counter variable. If we’re successfully able to insert this value into the map, we also return it; otherwise, it goes back into the freelist.

This isn’t the best mechanism if you really need 100% density (but if you really need 100% density, you can’t support removals either), because there is a race condition between getting a new value and the putIfAbsent call. In between, another thread may have gotten a later value and added it to the map, generating a “hole” at your index and forcing you to add it to the free list. However, the number of total “holes” should be roughly bound by the number of concurrent threads you’re running. If you have, say, 16 threads simultaneously inserting the same key, in the worst case you end up with an inserted value of say, 15, and 0-14 in the freelist. Subsequent iterations check the freelist first though, so even if we continue that worst case scenario, we’ll be churning the freelist (repeatedly draining it, replacing the same elements, and adding an integer derived from the counter instead) but ending up with only 15 total gaps in our output.