18 Oct 2013, 19:57

The genius and folly of MongoDB

MongoDB is easy to make fun of. The global write lock (now just a database-level write lock, woo). The non-durable un-verifiable writes. The posts about how to scale to Big Data, where Big Data is 100gb.

It makes more sense when you look at how the underlying storage layer is implemented. Basically, MongoDB consists of a collection of mmap’d linked lists of BSON documents, with dead simple B-tree indexing, and basic journaling as the storage durability mechanism (issues with what the driver considers a “durable write”, before data necessarily hits the storage layer, is something others have dealt with in depth). Eventually writes get fsync’d to disk by the OS, and reads result in the page with the data being loaded into memory by the OS.

All the speed that was initially touted as the killer advantage, is a result of using the page cache. When you realize “it’s just mmap”, all the BS about optimizing your working set to fit within RAM, the cratering of performance once you hit disk, the fragmentation if you routinely delete or grow records, etc., make perfect sense. The OS doesn’t know you’re running a database; it just knows you want to mmap some stuff and gives it its best shot. Fortunately, the algorithm was written by some very smart people, so it tends to work pretty well up front (as long as you’re always hitting the page cache) - but when the OS schedules writes it doesn’t know about your storage layout, or even the difference between your indexes and your data. It certainly can’t infer what data to keep in cache, or preload, because it doesn’t know what or where your data is - it’s just some bytes somewhere you wanted for some reason.

But actually, that’s the Tao-like genius of MongoDB - having absolutely nothing new. Most databases are built with some killer idea: the consistency protocol for Cassandra, the crazy data structures of Redis, or the data-processing abilities of Hadoop. MongoDB has mmap, and by “has”, I mean “uses” (but hey, possession is nine tenths of the law). Not having to design your own caching algorithms or write strategies, and using the simplest possible implementations of everything else, lets you get to market quickly and focus on marketing your benchmarks, consulting to bring your customers up to Web Scale, responding to haters, or learning about concurrency. You can get a fair amount of traction before your customers realize there’s very little “there” there, and by that time you might have either cashed out or written an actual database (they’re starting down that road with their page-fault prediction algos, and good for them). In any event, your customers are more or less locked in, as they’ve jumped through so many hoops to accomodate your design decisions (“oh, long field names take up more space in every record? I guess we’ll just use the first N unicode characters and map on the application layer”). It’s no coincidence that you’re being compared to Oracle and IBM.

Like I said, MongoDB is easy to make fun of.

There’s exactly one situation where you should look at MongoDB. Focusing on the storage engine and ignoring all the issues with their broader durability strategy, the killer application is something like user data for an online game: a situation where you have a fairly consistent working set for a given chunk of time, it’s small relative to the whole database, reads and writes are in the same working set, you have lots of reads relative to writes, clients do a lot of the computation for you, you’d like schema flexiblity… You could jam that into a relational model and use something like hstore or JSON columns to fill in the gaps, or store documents as blobs / text in something like HBase or Cassandra, but you probably won’t have that bad of a time with MongoDB (until you get a surge of traffic; it doesn’t exactly degrade gracefully).

But in that case, it also wouldn’t be crazy to pull a Viaweb and store it on the file system (ZFS has a heckuva caching strategy, and you know when it accepts a write). It’s obvious and uninformative to say “it depends” regardless of the question; of course some things are generally better than others.

Being able to walk with them doesn’t change the fact that as of right now, MongoDB is clown shoes.

05 Oct 2013, 20:52

I've got 99 features, give me a random one

Say you have a large array of Stuff, and you need to get a random element. In fact, you need to get many random non-duplicated elements - i.e., sampling without replacement. And since we’re living in the future with quad-core supercomputers in our pants, it should be threadsafe and scalable.

The simple thing to do is to shuffle the elements, and distribute them to the callers. You can do this by shuffling with a lock on the array, and then having each caller use AtomicInteger.getAndIncrement() to get a unique index to take. You could even do this online, since shuffling is an incremental process, and distribute the elements with a concurrent queue.

But if you need some small-but-unknown number of elements, this isn’t the best fit. You’re forced to shuffle too many elements, or if you didn’t shuffle enough initially, to maintain a potentially high-contention lock (either on the array itself or a queue) to isolate the array as you select more random elements.

Instead, you can have each caller select a random element, and then check if it has been selected before - either by maintaining a common threadsafe set of selected elements (e.g. via a ConcurrentMap) or, even better, by marking selected elements themselves. As you reach saturation, you’re forced to re-check many times and it becomes more efficient to shuffle-and-iterate. But, for small subsets, it’s very efficient and avoids having to think too much about synchronizing between shuffling and selection.

This kind of usage pattern is common in certain machine learning techniques that use random feature selection - for instance random forests. If you make your sampling parallelizable, you can see some pretty nice speedups.

07 Aug 2013, 07:58

Persistent, primitive-based hashmaps for Clojure

The native Clojure maps and vectors are based on an interesting data structure, the hash-array mapped trie. Essentially it is a tree with 32x fan-out, and some trickery deciding which node to descend down. Their major advantage is that they are persistent, and support structural sharing as they are modified - add an entry, and you usually have the same tree, with log32(n) interior nodes changed and a different 32-wide Object array at a leaf.

However, these kinds of data structures have disadvantages with memory consumption. Clojure maps take roughly twice the space as mutable Java maps - and that’s on top of the Object-wrapping penalty associated with all generic Java container classes. If we’re storing an int-to-double mapping, ideally we’d be able to store that on top of primitive int[] and double[] arrays, but Java forces us to use Integer[] and Double[] instead, which are much less efficient and slower (due to pointer access costs) to boot.

The traditional solution is to write a bunch of specializations of your map and set types, with primitive backing arrays. Trove and fastutil use scripts to generate the source for all combinations of those specializations, much like the C preprocessor or C++ templates. We’ve got a Lisp handy, so we could do the same thing with macros.

But Trove and fastutil are already pretty optimized; there’s no need to reinvent the wheel if we just need primitive hash maps and sets. What we would like to do is stick to the Clojure philosophy and have primitive-backed maps and sets that are not only compact, but persistent as well, with the same sort of structural sharing we get with the built-in data structures.

For straight primitive vectors, this has already been done with the Clojure vector-of, which is just like the normal persistent vector, but with primitive arrays on the leaves. We can’t take the same approach to Clojure persistent maps, because in that implementation the leaves need to be able to contain either the entry, or an additional tree node - the only way to do that is by casting Objects. We can build a hash table on top of the vector-of implementation itself though - much like you build a hash table on top of primitive arrays, but with a few cool twists.

In a vector-of, the elements themselves are stored at the leaves in various 32-long primitive arrays. We definitely want to take advantage of the fact that that’s only a few cache lines long, and wicked fast to scan - this implies that within one of those arrays, we want to probe linearly. Since we incur pointer access costs whenever we go to a different bottom-level array, we want to avoid that as much as possible and guarantee we will find whatever element we’re searching for in a particular region.

There is a hashing strategy that can give us that guarantee - hopscotch hashing. Lookups and deletions are guaranteed to happen inside a particular neighborhood, which we set to 32 elements in order to fit with the structure of the Clojure data structures. Insertions are trickier, and require recursively back-shifting to maintain the guarantee of all elements being within a neighborhood of their optimal insertion point.

I’ve implemented the a basic version of hopscotch hashing in SOAC, without the bit-map tricks to find the next free/filled element - eventually those will be implemented for testing, but because we’re doing a persistent version, modifying the bitmap may add more time than it gives us back.

The primary benefit versus Clojure maps and sets is memory consumption - in my tests a primitive-based set takes up just 17% of the space a Clojure PersistentHashSet uses. Interestingly, insertion time was on par between the two, even before I applied some optimizations to the scan for free positions, and even though the primitive version has to endure periodic rehashing. The Clojure version spent enough time in GC dealing with a rapidly filling heap, that it ended up being only ~5% faster. Granted, you won’t usually be using a single data structure that takes up over half of your heap space (although it’s common for the kind of machine learning work I do), but it speaks to the many ways that compact data structures help you in the long run.

But there are also substantial benefits versus “raw” primitive-array-based hash tables. If one is, say, partitioning a dataset in many different ways (e.g., when displaying pivot tables over a shared dataset to many users, or fitting a decision tree) one of your main limitations is the ability of the GC to deal with all the garbage you generate, as you copy your dataset, modify it, analyze it for just long enough to put it into the “survivor” gen, and throw it away. With structural sharing, your “copies” mostly avoid taking additonal space for identical data, and your garbage consists much more of transient edits.

23 Jun 2013, 11:12

Hosted

I’m moving my old posts here from scalpelcannon.blogspot.com . Gradually things will appear here.

11 Feb 2013, 11:03

Is the 'daily deals' space sustainable?

I have some amount of experience working in the “daily deals” space on the technical side, but with a lot of interaction with the “business end” (and boy do I ever hate that distinction). Here’s my MBA-style analysis of the sector, devoid of “secrets” but based heavily on the patterns I saw.

  • There are roughly two kinds of businesses, for the purposes of the deals sector: those with marginal cost approaching zero, which tend to be focused on the service sector (hair salons, Lasik, lawn care, hotel rooms), and those with relatively high marginal costs (restaurants, actual physical products).

  • It is incredibly easy to sell marginal customers to zero-marginal-cost businesses and they will benefit greatly. It’s no different than an airline or stadium selling cheap seats at the last moment to fill up to capacity. For positive-marginal-cost businesses, when the direct “loss” on the deal takes out most of their margin (mid-tier restaurants are the main example here: if a merchant runs a $10-for-$20 voucher, of which they actually receive $5, they have to make a $15 profit per check ex ante to break even) you have to rely on fuzzy stories about “new repeat customers”, redemption rates below 100%, and sales of ancillaries with high margins (dessert, booze), possibly due to price elasticity. These stories might even be true, but they’re much more difficult to quantify and spin into a sale (and more importantly, into a re-run a few months later without much additional sales effort).

  • The press focuses on restaurants, but more or less ignores the fact that the most profitable deals are in the health-and-beauty space, which sells more or less exclusively to women. There’s a heavily gendered component.

  • The deals sector relies totally on a large and effective sales force to create and maintain relationships with merchants. The cost-per-runnable-deal is (or should be) their primary cost metric; overhead is (or should be) minimal compared to revenues, so it’s all about whether each deal is generating enough in revenue to pay for the cost of signing it. Anything you can do to make your sales force more productive is probably a Good Idea. One advantage of this is that your capital requirements are pretty low (and you can self-finance just about everything, see below), and you can ramp up incrementally.

  • One huge source of strength is that you get to keep the entire revenue from the deal for many weeks, and remit the merchant’s cut later. That gives you a large amount of cashflow to play with; essentially a “float” like an insurance company.

  • The primary sales channel (to consumers) is email. That spins you into a few interesting areas: you (theoretically) have the ability to tailor those emails very precisely and eke out some incremental gains. You also have the ability to drive revenue via incentivized offers (eg, “an additional $10 off any offer”). This can let you do a combination of additional price discrimination for your most price-sensitive users (on top of that intrinsic to the deal itself), and having customers essentially float you an additional short-term loan. Every company does this to some extent, but because email is already your primary communications channel and you’re already selecting for the price-sensitive, you can ride that train a lot more effectively than, say, Best Buy e/mailing out coupons. If you’re an analyst (or the board setting management incentives), though, you really want to watch out for the company trying to game the system each quarter by messing with customer incentivizations and changes to their email composition, possibly at the expense of your margins. “Revenue targets” are less meaningful than elsewhere.

  • “Deals” are fundamentally a middle-class (broadly defined) product: people with enough money to go and do things, but not so much that they’re completely price insensitive at the $20-savings margin. It’s also “middle class” on the sales side: businesses that have both excess capacity of one sort or another, and some ability/desire to try to price-discriminate downwards. Foreign expansion for whatever reason seems to have ignored that.

  • The competitive “moat” is 1) the need for a large sales force to get a sufficient variety of deals, and 2) the need for a large enough customer list willing to be mailed by you at a sufficient volume and purchasing rate to pay for the sales force and overhead. Considering the limited willingness of customers to be mailed, and the limited pool of merchants willing to be sold, there’s a limit to the number of deals providers that can sustainably exist. My hunch is that it’s somewhere between 1.5 and 2.5 (fractional company in this case is oligopoly profits or a region / sector-focused competitor).

  • The email list (and their associated credit card numbers) plus the ability to email them whatever you want constitutes a very valuable capital asset. It’s hard to build such a list (you can buy it incrementally via advertising, but you need to keep them around by having stuff to send them for the list to grow on net, and have enough of an underlying business to generate organic growth) and actually technically challenging to mail them consistently. It’s difficult to value it on paper, but bounding it somewhere around acquisition costs, and at minimum what you could get by going pure black-hat-viagra-spam with it, should give you an estimate. If all else fails, I’d think that one could “cobrand” with a similar business (PriceLine, Fab.com, whatever) and essentially rent it out.

In many ways the deals sector is the perfect target for an 80s-style leveraged buyout: a business with high underlying cash flows, low capital requirements, some capital assets, a relatively price-insensitive core market (the zero-marginal-cost businesses), and excessive overhead and some unprofitable markets (at the moment) due to overexpansion at the peak. It’s also probable that there are some investors who bought in at the peak and are looking to cut their losses. I do think there is a very sustainable business underneath; “email marketing for local businesses” doesn’t sound sexy (and doesn’t really qualify as a “technology business”) but it’s an example of a profitable niche. In the end it will come down to execution, but a good management team should be able to make it happen.

03 Feb 2013, 20:27

Space efficiency of Clojure's collections

How much space do Clojure’s collections take up compared to their mutable counterparts? Let’s see, using the SizeOf class.

(SizeOf/deepSize (long-array (range 1024)))
-> 8208

(SizeOf/deepSize (into (vector-of :long) (range 1024)))
-> 9552

(SizeOf/deepSize (vec (range 1024)))
-> 26192

(SizeOf/deepSize (ArrayList. (range 1024)))
-> 24624

(let [a (HashMap. 1024)]
     (dotimes [i 1024] (.put a i i))
     (SizeOf/deepSize a))
-> 82040

(SizeOf/deepSize (into {} (for [i (range 1024)] [i i])))
-> 166008

Unsurprisingly, primitives are great for space efficiency - and actually, Clojure’s vector-of has minimal (~25%) overhead over a raw array, while having a much nicer interface. Clojure hashmaps, though, are enormous - 10x overhead compared to two raw arrays.

This makes me think that SOAC needs a semi-traditional hash table, built on top of primitive vector-ofs rather than trees, as it currently is, or raw primitive arrays, a la Trove or fastutil. That would succeed in maintaining persistence while being much smaller.

29 Dec 2012, 14:28

Sparsemas: A sparse matrix implementation that supports iteration & expansion

This Christmas I wrote up an implementation of what amounts to a sparse matrix, with a few additional features that are handy for the areas I work in. It uses essentially the same approach as the Apache Commons sparse matrix implementation (and most others I’ve seen): it handles the (row, column) -> value mapping by transforming the (row, column) into a single index, and then using an efficient primitive-based hashmap to match that index to the value.

The first main difference is that my mapping uses a kind of spiral going out from the origin, that isn’t dependent on the number of rows and columns. Visually, the mapping goes like this:

17 19 21 23 24
10 12 14 15 22
5  7  8  13 20
2  3  6  11 18
0  1  4  9  16

And so on. If you look on the diagonal, you see that it’s the squares of integers minus one. Going “down” from a square element, the index decreases by two, and going to the “left” it also decreases by 2, but starting from the square + 1. The reverse mapping (from index to (row, column)) is also trivial. The advantage to this is that if you add another row or column, you’re not forced to recalculate indexes. Theoretically, this even allows you to use a persistent tree for the map implementation, and so have a persistent implementation of a sparse matrix with structural sharing.

The second difference, which is totally separable from the first, is that each “node” in the matrix has a pointer to the next filled element by row and column. This essentially defines a linked list over filled elements by row and column; it does come at the expense of additional space, but for my purposes it’s necessary to be able to find which elements are filled in any particular row or column. We store the first node by row and column in an additional int -> int mapping (and implicitly those also define a list of which rows & columns have any filled elements). For maximal insert speed, an insert just adds an element to the head of the list, so initially, it’s traversed in reverse insertion order. I do however have functionality to sort the lists, so I can easily calculate overlap in filled rows between columns, or vice versa.

Limitations of this implementation are higher space overhead to store the linked lists over elements, and the ability to only (safely) address rows and columns by (>=0) integers. Currently, it uses an object to store the (rating, next_item_pointer, next_person_pointer) tuple, which adds 8 bytes of object overhead per entry. In a future implementation, it could use the underlying hash algorithm directly to handle indexing in 3 parallel arrays of ratings, item pointers, and person pointers.

So what is this good for, and why did I feel the need to create it in the first place?

If you’re storing a correspondence between users and items, of the form (user, item) -> preference or interaction flag, a sparse matrix seems like a natural representation (besides which many recommendation algorithms explicitly model it as a sparse matrix with a goal of imputing the missing elements, so being able to do matrix operations is handy). However, the number of items and users is unbounded, and if you want an online representation that doesn’t have to be copied into a bigger container each time you add a new maximum item or person ID, you need consistent assignment of (person, item) -> index for insertions and lookups.

You also obviously need a way to figure out what elements are filled, without having to traverse the thing.

And, for simple similarity-based models that rely on consistency of ratings between two people or items, you need to be able to quickly find the items rated by a particular user or vice versa (and if they can be stored sorted for quick overlap calculation so much the better).

The traditional way to do it, which Apache Mahout does with its in-memory representation, is to have a mapping from person ID to two parallel arrays of item IDs and ratings, and the reverse mapping from item ID to two arrays of person IDs and ratings. That has reasonable space efficiency, but a) has to store the data twice to handle mappings in both directions, and b) doesn’t deal that well with insertions (or especially removals) which usually require you to traverse arrays scanning for elements, and copy the array en masse to modify it (or deal with having an “unfilled” magic value, buffering at the end of your arrays, etc.). This implementation, despite the overhead, is still fast, reasonably efficient, and supports better online behavior.

30 Oct 2012, 12:11

DataGotham - Cold Start My Heart

My talk from DataGotham about recommender systems, the new item problem, and novelty, is available here.

01 Sep 2012, 17:11

Distinctifying a sequence in Clojure

One of the things you run into writing a lot of performance-sensitive Clojure is the performance overhead of traversing sequences.

Here are three approaches to returning the distinct elements of a sequence, as a vector:

(defn unique-vec1
  "Straight-up Clojure sequence traversal"
  [s]
  (vec (set s)))

(defn unique-vec2
  "Conj into transient set and vectorize"
  [s]
  (loop [remaining s
         seen (transient #{})]
    (if (empty? remaining) (vec (persistent! seen))
      (recur (next remaining) (conj! seen (first remaining))))))

(defn unique-vec3
  "Use set as distinctifier, but conj into transient vector"
  [s]
  (loop [remaining s
         seen #{}
         out (transient [])]
    (if (empty? remaining) (persistent! out)
      (let [this (first remaining)]
        (if (contains? seen this) 
          (recur (next remaining) seen out)
          (recur (next remaining) (conj seen this) (conj! out this)))))))

The first one is the most “idiomatic” in the sense of conciseness and using what Clojure gives you. You can rely on the fact that you can cast a Clojure Set to a Seq (although a set is not itself a seq), and convert a Seq into a Vector.

Unfortunately if you benchmark it, you’ll find that you spend a lot of time traversing the sequences. There’s no point in generating persistent data structures for purely under-the-hood operations; CPUs are fundamentally machines for mutating data and you should use their capabilities when you’re able to. So, let’s use the idiomatic Clojure way of mutating - transients.

Now we have a second decision to make - we can either use a set as both a distinctifier and an accumulator (unique-vec2), or we can use it purely to check for distinctness, and accumulate the vector elsewhere (unique-vec3). This is a purely benchmark-driven decision, but it turns out that depending on what you’re distinctifying, it could be faster either way.

The tradeoff you’re making here is that unique-vec2 has a tighter inner loop (accessing only one data structure rather than 2) at the expense of a slower final step (you have to persist the set, convert it to a sequence, and traverse it). So, you’re better off using it for shorter sequences, and unique-vec3 for longer or more-distinct ones.

The same pattern applies in raw Java, of course, but the thing I like about Clojure is the ability to use quasi-imperative code with high-level abstractions. You now have a pretty fast, generic algorithm that operates over arbitrary streams, and you could make it faster by using a standard mutable HashSet for checking uniqueness, or possibly using multiple threads and an atom to distinctify in parallel. These modifications require very little additional code and they’ll still be compatible with whatever data you’d like to process.

Addendum: I probably should’ve mentioned Clojure’s built-in “distinct” function. It takes essentially the approach of unique-vec3, except for that it uses a persistent set rather than a transient, and builds a lazy sequence rather than a vector. Additionally, (vec (distinct coll)) is actually slower than (vec (set coll)) or the other 2 approaches, presumably due to the additional overhead of laziness.

25 Jul 2012, 12:00

Poisson regression - it's more useful than you think

Everyone knows about logistic regression. It’s the AK47 of predictive models - cheap (to fit), available everywhere, manufacturable yourself with some basic tools, and pretty darn effective at what it does, as long as you don’t ask for the latest and greatest. You can hand it to a barely-literate junior analyst and send them to the front, and they’ll be slaying problems in no time.

But as your positive examples become sparser, and your data gets bigger, logistic regression becomes less of a good fit. Particularly because, depending on your application, you might not actually care about individuals being likely positive or negative (after all, at a certain sparsity, the vast majority are likely to be negative - and one person, in lot of applications, simply doesn’t matter). Instead, you care about the number of positive examples you can count on in a group of a certain composition. Examples of this would be banner ad response rates, or the number of clicks on an email.

And that’s where Poisson regression comes in. There is a “law of rare events”, where as your positive examples become sparser and your total number of examples increases, a Poisson regression becomes a better and better fit to the “real” binomial data. Add in measurement errors and modeling time, and Poisson regression becomes downright compelling.

Particularly because the coefficients in a Poisson regression have a natural translation into multiplicative effects on a rate, which can float above 1 if necessary, as opposed to the parameters in a logistic regression, which translate into multiplicative effects on odds that can never be greater than 1. That works out well if you’re modeling an arbitrage scenario, where you’re relating dollars-of-input to dollars-of-output, and have a relatively small number of very positive examples primarily effecting that ratio. Ideally you’ll end up with a ratio less than 1.0, but you really do have to deal with segments where it is greater, and that’s something logistic regression can’t handle. It’s fairly common in the insurance industry to do this with dollars-of-claim per dollar-of-premium - also known as their loss ratio. That kind of arbitrage-identification translates naturally to things like ad dollars per dollar of revenue, or customer service minutes per sale.

I’ve written up a very straightforward implementation of Poisson regression for Java, because I’m focused on the JVM at the moment and there wasn’t one available on that platform. A number of implementations exist in R, and a few standouts in Python. Mine incorporates the notion of “regularization” - that coefficients are likelier to be zero than the data suggests (or equivalently, that you should have a bit of a bias towards 0 coefficients). That can keep you from seeing large coefficients on sparse segments of the population that randomly have a large number of positive examples (I have some fun horror stories from the insurance industry about regularization via repeatedly forwarded spreadsheets and staff meetings - doing it with math is so much better). If anyone finds it handy, let me know - there’s a lot of tools in the standard statistical toolbox that are woefully underused, and it’s great to see them in wider circulation.