18 May 2014, 21:17

Memory-mapping >2gb of data in Java


Memory-mapping is your friend if you’re doing IO. You avoid the relatively expensive system calls entailed in repeatedly calling raw read or write (or even their stdlib-based buffered counterparts) in favor of transparent access to an array of bytes - the system’s page cache handles actual disk access if and only if it is necessary. Just about the worst thing that can be said about it is that it may not be measurably faster than the system calls, given the right access pattern (for example, linearly copying an entire file) - but if it is a better abstraction for you, there is rarely a negative performance impact.

The use of the OS memory facilities can be a double-edged sword, though, in a 32-bit environment. If you only have 32 bits of memory space, you can only map a 4gb file to memory (or 2gb if you’re using signed integers). In contrast, you can do a raw fread() from an arbitrary offset, as long as you’re not reading more than 2^32 worth of data.

Fortunately, we’ve had 64 bit operating systems for quite a while - depending on how consumer-grade you’re counting, since the mid-90s or mid-00s. On a modern machine you don’t have to worry about running out of address space and can map as large a file as you’re likely to encounter. That is, unless your programming environment doesn’t give you access to it.

Java standard library APIs have always been a little bit uneven, and unfortunately their mmap abstraction is to return a MappedByteBuffer. All ByteBuffers are constrained (same as arrays, unfortunately) to have fewer than < 2^31-1 elements. Even worse, they’re not threadsafe. You can only copy a particular chunk of bytes by calling position(int newPosition), then get(byte[] dst). In between those, another caller may have altered your position to somewhere else. That means you need to do a duplicate() call before everything to get an independently-positionable shared-content “copy”.

One approach is to wrap your buffer in a class that handles the duplicate-before-anything calls, and maps a series of buffers sufficient to cover the entire file (mindful of the fact that you might have to read across the boundary). This is a corner case waiting to happen, and slower than it needs to be. Mmap is a beautiful abstraction and Java has seemingly sucked the fun right out of it with its 1995-vintage design decisions.

The question arises - if the Java standard library designers did it wrong the first time, can we do it right instead? Presumably they’re wrapping the underlying system call that does what we want - we should be able to unwrap the cruft and use it directly.

In fact, deep in the bowels of a non-public class, we can find mmap lurking as if it was some kind of Lovecraftian artifact. Let’s take it out to play:

Ahhh, much better. Native-byte-order integers, raw memory-copies at known offsets, reflection calls to invoke what the compiler informs us is That Which Should Not Be Summoned - it all adds up to a remarkably simple and efficient interface.