03 May 2014, 09:46

Mapping distinct objects to a dense range of integers

Share

Let’s say you’re trying to turn distinct strings (or some other objects) into integers. This pops up all the time - for instance, if you want to analyze a social network, it’s faster if you can access your adjacency list via integer offsets into an array rather than looking up name strings in a hash table.

The desired properties are that:

  • You get a distinct integer for each input
  • The output is as dense (sequential, with no gaps) as possible, but 100% density is unnecessary
  • You can construct the output online (ie, you never need to re-number)
  • Support get-or-insert, get-or-fail, and remove operations
  • It’s wicked fast

The straightforward strategy is a read-check-generate pattern, secured with a ReadWriteLock:

final ReadWriteLock lock = new ReentrantReadWriteLock();
final Map<String, Integer> map = new HashMap<String, Integer>();
for(final String s : data) {
    lock.readLock().lock();
    try {
        if(map.containsKey(s)) continue;
    } finally {
        lock.readLock().unlock();
    }

    lock.writeLock().lock();
    try {
        if(map.containsKey(s)) continue;
        map.put(s, map.size());
    } finally {
        lock.writeLock.unlock();
    }
}

This will produce an appropriately dense, distinct mapping. Unfortunately, it fails the last couple of criteria - there’s no way to support removal, and that single write lock adds a fair amount of contention that slows things down. We can fix the first one:

public class StringMapper {
    final Map<String, Integer> map = new HashMap<String, Integer>();
    int lastIdx = 0;
    //Any collection would work here; this has the advantage of concentrating
    //"holes" at the end, and being unbounded
    final LinkedList<Integer> freelist = new LinkedList<Integer>(); 

    public void remove(String s){
        lock.writeLock().lock();
        try {
            Integer current = map.remove(s);
            if(current == null) return;
            freeList.add(current);
        } finally {
            lock.writeLock().lock();
        }
    }

    public Integer get(String s){
        lock.readLock().lock();
        try {
            return map.get(s);
        } finally {
            lock.readLock().unlock();
        }
    }

    public Integer add(String s){
        Integer out = get(s);
        if(out != null) return out;

        lock.writeLock().lock();
        try {
            Integer out = map.get(out);
            if(out != null) return out;
            out = !freelist.isEmpty() ? freelist.poll() : lastIdx++;
            map.put(s, out);
            return out;
        } finally {
            lock.writeLock.unlock();
        }
    }
}

A little more ceremony here - just like in ye olde malloc, we have to maintain a freelist, and we have to track the last insertion point not on the freelist so we can make fresh allocations.

That write lock is still really annoying, though. We have concurrent maps and atomic integers built into the standard library, it seems like we should be able to figure this out without any of our own locking.

public class StringMapper{
    final ConcurrentMap<String, Integer> map = new ConcurrentHashMap<String, Integer>();
    final AtomicInteger lastIdx = new AtomicInteger(0);
    final Queue<Integer> freelist = new ConcurrentLinkedQueue<Integer>();

    public void remove(String s){
        Integer current = map.remove(s);
        if (current != null) freelist.add(current);
    }

    public Integer get(String s){
        return map.get(s);
    }

    public Integer add(String s){
        //If we're doing mostly fresh insertions, the initial check can be skipped
        Integer out = map.get(s);
        if(out != null) return out;
        Integer in = freelist.poll();
        if (in == null) in = lastIdx.getAndIncrement();
        out = map.putIfAbsent(s, in);
        if(out != null) {
            freelist.add(in);
            return out;
        } else return in;
    }
}

The remove and get methods are straightforward; the add method is where the magic happens. On each attempted insert, we first check the map. Since we re-do the check later, if we’re likely to do mainly insertions (because we have a mostly distinct dataset as input) we can skip the initial one if we want. If we actually intend on inserting, we first poll from the the freelist, and if it is empty, get a fresh integer from our counter variable. If we’re successfully able to insert this value into the map, we also return it; otherwise, it goes back into the freelist.

This isn’t the best mechanism if you really need 100% density (but if you really need 100% density, you can’t support removals either), because there is a race condition between getting a new value and the putIfAbsent call. In between, another thread may have gotten a later value and added it to the map, generating a “hole” at your index and forcing you to add it to the free list. However, the number of total “holes” should be roughly bound by the number of concurrent threads you’re running. If you have, say, 16 threads simultaneously inserting the same key, in the worst case you end up with an inserted value of say, 15, and 0-14 in the freelist. Subsequent iterations check the freelist first though, so even if we continue that worst case scenario, we’ll be churning the freelist (repeatedly draining it, replacing the same elements, and adding an integer derived from the counter instead) but ending up with only 15 total gaps in our output.