13 Nov 2013, 17:56

Persistence + GC saves memory

Share

One of the disadvantages of garbage-collected runtimes is potentially increased memory consumption. This usually comes from a couple of places: object headers used for GC metadata, and “extra” heap space kept around to manage the garbage collection process (for example, the space taken up by Java’s Eden pool, which is gradually filled up and then GC’d, rather than incrementally freeing objects as in a reference-counted environment like Python or a manually managed malloc/free environment like C).

But garbage collection can actually lead to vastly reduced memory consumption, because it allows you to use persistent data structures that share data which would otherwise have to be tracked separately. You can share the same backing data for things that will be used & modified in different ways, because it’s effectively a copy-on-write environment: if you conj data to a shared vector in Clojure, the original is unchanged and the caller gets a “new” vector, with most of its backing data shared with its parent.

Those kinds of persistent data structures really only work (at least performantly) in the context of garbage collection, where you’re automatically freeing objects that have ended up with no references to them. Otherwise, managing the object graph for the kind of complex relationships that persistent data structures generate is nigh impossible.

Here’s an example. You have a large number of short phrases you’re analyzing. If they’re in a Latin-alphabet language, you can easily segment them into lists of interned strings representing words (a poor man’s trie). A lot of phrases are duplicates, so you can also de-dupe the entire phrase-list - and it’s free as long as you’re using persistent data structures that guarantee you’re placing no restrictions on how future callers can use your data.

;Straightforward impl
(with-open [r (clojure.java.io/reader "short_cuts")]
  (->> (line-seq r)
       (map #(vec (clojure.string/split % #"\s")))
       (into [])))

;Intern the strings
(with-open [r (clojure.java.io/reader "short_cuts")]
  (->> (line-seq r)
       (map #(vec (for [token (clojure.string/split % #"\s")]
                    (.intern ^String token)))
       (into [])))

;Dedupe the vectors
(with-open [r (clojure.java.io/reader "short_cuts")]
  (loop [remaining (line-seq r)
         as-vec []
         as-set #{}]
    (let [this-line (vec (for [token (clojure.string/split % #"\s")] 
                           (.intern ^String token)))]
      (cond (empty? remaining) as-vec
            (contains? as-set this-line) 
              (recur (rest remaining) (conj as-vec this-line) as-set)
            :else 
              (recur (rest remaining) (conj as-vec this-line) (conj as-set this-line))))))

The result: Vectors of non-interned words take 21.5 gb. As vectors of interned words, they take 12.1 gb. And as de-duped vectors of interned words, they take 4.4 gb. And each version supports the exact same operations on the result - you can take one of those de-duped phrases and “modify” it with no side effects on its siblings, because you’ll get back a new reference.

But here’s the twist: most of that storage is overhead from the clojure.lang.PersistentVectors we’re storing the strings in. Each of those has a 32-long Object array for the leaf, plus another 32-long Object array for the root node, plus some miscellaneous primitive fields. Seems a bit excessive when the vast majority of our vectors are <=10 elements long. What happens if we replace them with something more efficient, like a PersistentArrayVector from SOAC?

(require 'soac.arrvec)
(with-open [r (clojure.java.io/reader "short_cuts")]
  (loop [remaining (line-seq r)
         as-vec []
         as-set #{}]
    (let [this-line (soac.arrvec/array-vec (for [token (clojure.string/split % #"\s")] 
                                             (.intern ^String token)))]
      (cond (empty? remaining) as-vec
            (contains? as-set this-line) 
              (recur (rest remaining) (conj as-vec (get as-set this-line)) as-set)
            :else 
              (recur (rest remaining) (conj as-vec this-line) (conj as-set this-line))))))

Memory usage goes down to 1.5 gb, or ~7% of what we started with.