A couple of weeks ago a comment on HN caught my eye:

I write Clojure for food, and Common Lisp for fun. One reason for the latter is CL's speed – awhile back I compared a bit of (non-optimized) Clojure code I wrote for a blog post with a rough equivalent in CL, and was stunned that the Common Lisp code ran about 10x faster. This made me curious as to how fast it could be made if I really tried, and was able to get nearly 30x more1 by optimizing it.

Clojure is definitely fast enough for everything I've done professionally for six years. But Common Lisp, while having plenty of rough edges, intrigues on the basis of performance alone. (This is on SBCL – I have yet to play with a commercial implementation.)

And I took that personally.

I wasn't the only one, either.

The Premise

Non Optimized Clojure Is Slow

This premise hides two assumptions:

  • Idiomatic Clojure is slow
  • Optimized Clojure is non idiomatic

Let us examine these assumptions by looking at the original implementation:

(defn smt-8 [times]
  (->> times
       (partition 8 1)
       (map (juxt identity
                  (comp (partial apply -)
                        (juxt last first))))
       (filter (comp (partial > 1000) second))))

There is absolutely nothing wrong with it. It's correct and demonstrates a good use of composition.

It is, however, far from optimal, even in terms of Clojure's performance.

As we'll see further on, by refactoring this function our code will be both more idiomatic and perform better.

Baseline

Setting up some test data, we can get a baseline measurement:

(require '[criterium.core :as cc])
(def times-v (into [] (take 1e6) (iterate #(+ % (rand-int 1000)) 0)))
(cc/quick-bench (doall (smt-8 times-v)))
;; Evaluation count : 6 in 6 samples of 1 calls.
;;              Execution time mean : 1.159347 sec
;;     Execution time std-deviation : 17.835002 ms
;;    Execution time lower quantile : 1.143263 sec ( 2.5%)
;;    Execution time upper quantile : 1.186615 sec (97.5%)
;;                    Overhead used : 2.066166 ns

It's important to note an initial oversight by the author, who benchmarked his code against a lazily generated sequence.

I gave the code a head start by realizing it in a vector.

Throughout this post I benchmark with only one input collection size. You can either believe me that performance scales linearly for all implementations, or check for yourselves.

First Step - the missing transducer

This use case is perfect for transducers. They were written exactly for a series of sequence transformations:

(->> xs
     (map f)
     (filter g)
     (map h))
;; is equivalent to =>
(sequence
 (comp
  (map f)
  (filter g)
  (map h))
 xs)

With the added bonus of removing intermediary allocations. That's pretty neat. The only problem is we're missing a crucial component. There is no transducer equivalent to (partition n step coll).

Are we doomed? Not quite.

There is a close transducer, partition-all, which has no step arity. Let's look at how it's implemented:

(defn partition-all
  ([^long n]
   (fn [rf]
     (let [a (java.util.ArrayList. n)]
       (fn
         ([] (rf))
         ([result]
          (let [result (if (.isEmpty a)
                         result
                         (let [v (vec (.toArray a))]
                           ;;clear first!
                           (.clear a)
                           (unreduced (rf result v))))]
            (rf result)))
         ([result input]
          (.add a input)
          (if (= n (.size a))
            (let [v (vec (.toArray a))]
              (.clear a)
              (rf result v))
            result)))))))

If we wanted a sliding window, all we have to do was replace the ArrayList with a Queue!

(defn sliding
  ([^long n]
   (sliding n 1))
  ([^long n ^long step]
   (fn [rf]
     (let [a (java.util.ArrayDeque. n)] ;; Queue here
       (fn
         ([] (rf))
         ([result] (rf result)) ;; don't need leftovers
         ([result input]
          (.add a input)
          (if (= n (.size a))
            (let [v (vec (.toArray a))]
              ;; Remove `step` elements instead of clear
              (dotimes [_ step] (.removeFirst a))
              (rf result v))
            result)))))))

Let's convinces ourselves it works:

(sequence (sliding 3 1) '[a b c d e]);; => ([a b c] [b c d] [c d e])

Now we can define an equivalent transducer:

(def baseline-xf
  (comp
   (sliding 8 1)
   (map (juxt identity
              (comp (partial apply -)
                    (juxt last first))))
   (filter (comp (partial > 1000) second))))

(cc/quick-bench (doall (sequence baseline-xf times-v)))
;; Evaluation count : 6 in 6 samples of 1 calls.
;;              Execution time mean : 462.921956 ms
;;     Execution time std-deviation : 20.213288 ms
;;    Execution time lower quantile : 453.931650 ms ( 2.5%)
;;    Execution time upper quantile : 497.963799 ms (97.5%)
;;                    Overhead used : 2.079753 ns

And we're already ~2.5x faster

De-composing

How much overhead is there to all this functional composition? Let's find out:

(def decomposed-xf
  (comp
   (sliding 8 1)
   (map (fn [v] [v (- (last v) (first v))]))
   (filter (fn [[_ t]] (> 1000 t)))))

(= (sequence decomposed-xf times-v) (smt-8 times-v)) ;; => true
(cc/quick-bench (doall (sequence decomposed-xf times-v)))
;; Evaluation count : 6 in 6 samples of 1 calls.
;;              Execution time mean : 366.650954 ms
;;     Execution time std-deviation : 2.112047 ms
;;    Execution time lower quantile : 365.042052 ms ( 2.5%)
;;    Execution time upper quantile : 370.254096 ms (97.5%)
;;                    Overhead used : 2.066166 ns

25% faster? How come? The culprit is mainly apply. juxt returns a vector of two elements and apply takes it back apart, one element at a time. Iteration has its price.

Faster vector operations

first and last will work on pretty much everything, including Java arrays. It does not mean, however, it is a good idea. Vectors can be accessed faster using indexed access.

Since our last vector won't have 8 elements, we can generically get the last element by using peek. Looking at its docstring:

For a list or queue, same as first, for a vector, same as, but much more efficient than, last. If the collection is empty, returns nil.

(def vector-xf
  (comp
   (sliding 8 1)
   (map (fn [v] [v (- (peek v) (nth v 0))]))
   (filter (fn [[_ t]] (> 1000 t)))))

(= (sequence decomposed-xf times-v) (sequence vector-xf times-v)) ;; => true
(cc/quick-bench (doall (sequence vector-xf times-v)))
;; Evaluation count : 12 in 6 samples of 2 calls.
;;              Execution time mean : 88.566441 ms
;;     Execution time std-deviation : 350.326432 µs
;;    Execution time lower quantile : 88.235538 ms ( 2.5%)
;;    Execution time upper quantile : 89.062221 ms (97.5%)
;;                    Overhead used : 2.066166 ns

Now we're beginning to see some dramatic improvements. It is mostly due to the overhead of last, which always iterates over the input collection, and does not even take the most efficient code paths to do so.

Something between map and filter

Ideally, we would have liked to only allocate the vector in the map transducer if the condition in filter is satisfied. Can we? Enter keep:

Returns a lazy sequence of the non-nil results of (f item). Note, this means false return values will be included. f must be free of side-effects. Returns a transducer when no collection is provided.

Turns out, that's exactly what we needed. We can then discard the difference calculation and not allocate another vector:

(def keep-xf
  (comp
   (sliding 8 1)
   (keep (fn [v]
           (when (> 1000 (- (peek v) (nth v 0)))
             v)))))

(= (sequence keep-xf times-v) (map first (sequence vector-xf times-v))) ;; => true
(cc/quick-bench (doall (sequence keep-xf times-v)))
;; Evaluation count : 12 in 6 samples of 2 calls.
;;              Execution time mean : 80.411626 ms
;;     Execution time std-deviation : 1.211228 ms
;;    Execution time lower quantile : 79.822031 ms ( 2.5%)
;;    Execution time upper quantile : 82.508332 ms (97.5%)
;;                    Overhead used : 2.066166 ns

Slightly faster. Since we know the inputs will always be long, we can use unchecked maths:

(set! *unchecked-math* true)
(def unchecked-xf
  (comp
   (sliding 8 1)
   (keep (fn [v]
           (when (> 1000 (unchecked-subtract (long (peek v)) (long (nth v 0))))
             v)))))
(set! *unchecked-math* false)

(cc/quick-bench (doall (sequence unchecked-xf times-v)))
;; Evaluation count : 12 in 6 samples of 2 calls.
;;              Execution time mean : 66.642476 ms
;;     Execution time std-deviation : 356.808224 µs
;;    Execution time lower quantile : 66.323382 ms ( 2.5%)
;;    Execution time upper quantile : 67.047693 ms (97.5%)
;;                    Overhead used : 2.066166 ns

Another 20%!

Up to this point, we can pat ourselves on the back and say our code is still idiomatic on one hand, but performs way better on the other. About 15x faster, while non optimized CL was only 10x faster.

As an added bonus, this looks pretty idiomatic.

Aside: wasn't the sliding transducer an optimization?

You could argue that it is. Or that it's a missing piece. It should probably live in a library. I also don't think it was complicated to derive, although transducers might be unwieldy for beginners.

Compare the initial snippet with the final version:

(defn smt-8 [times]
  (->> times
       (partition 8 1)
       (map (juxt identity
                  (comp (partial apply -)
                        (juxt last first))))
       (filter (comp (partial > 1000) second))))

(def keep-xf
  (comp
   (sliding 8 1)
   (keep (fn [v]
           (when (> 1000 (- (peek v) (nth v 0)))
             v)))))

I would consider the latter way more idiomatic and concise.

Slightly less idiomatic

Do we have to get the results back as vectors? If we relax this requirement, we can skip over wrapping the results in the sliding transducer in a vector:

(defn sliding-array
  ([^long n ^long step]
   (fn [rf]
     (let [a (java.util.ArrayDeque. n)]
       (fn
         ([] (rf))
         ([result] (rf result))
         ([result input]
          (.add a input)
          (if (= n (.size a))
            (let [v (.toArray a)]
              ;; Remove `step` elements instead of clear
              (dotimes [_ step] (.removeFirst a))
              (rf result v))
            result)))))))

Then, modify the argument to keep to take an array:

(set! *unchecked-math* true)
(def array-xf
  (comp
   (sliding-array 8 1)
   (keep (fn [^objects arr]
           (when (> 1000 (unchecked-subtract
                          (long (aget arr 7))
                          (long (aget arr 0))))
             arr)))))
(set! *unchecked-math* false)
(cc/quick-bench (doall (sequence array-xf times-v)))
;; Evaluation count : 30 in 6 samples of 5 calls.
;;              Execution time mean : 23.127029 ms
;;     Execution time std-deviation : 192.325091 µs
;;    Execution time lower quantile : 22.890029 ms ( 2.5%)
;;    Execution time upper quantile : 23.325306 ms (97.5%)
;;                    Overhead used : 2.066166 ns

Another ~3x speedup, new 50x times faster than the original, without going crazy with interop, optimization or some unrolling.

But can we go faster?

Let us put aside our requirement for idiomatic Clojure. Let's settle on readable.

By converting the input to an array, we can work directly with indices and built up the results collection. We'll also take advantage of the fact that linked lists are pretty fast to allocate, so build those instead of a vector or set:

(set! *unchecked-math* true)
(defn unrolled
  [^longs arr]
  (let [l (unchecked-subtract (alength arr) 7)]
    (loop [idx (int 0) agg ()]
      (if (< idx l)
        (let [idx (int idx)]
          (recur
           (unchecked-inc-int idx)
           (if (> 1000 (unchecked-subtract (aget arr (unchecked-add-int idx 7)) (aget arr idx)))
             (.cons agg idx)
             agg)))
        agg))))
(set! *unchecked-math* false)

(let [arr (long-array times-v)]
  (cc/quick-bench (unrolled arr)))
;; Evaluation count : 912 in 6 samples of 152 calls.
;;              Execution time mean : 655.284194 µs
;;     Execution time std-deviation : 2.846323 µs
;;    Execution time lower quantile : 651.150750 µs ( 2.5%)
;;    Execution time upper quantile : 658.353038 µs (97.5%)
;;                    Overhead used : 2.066166 ns

Another 35x speedup!

This implementation is very different. It deals with a concrete array instead of a sequence abstraction, and explicitly builds up the result. It allocates a lot less and sequentially accesses memory.

It's more similar to a solution in Java then Clojure, but it's pretty readable. It might even be more readable to programmers unfamiliar with Clojure.

I feel pretty comfortable saying that Clojure is not slow. I did not even have to disassemble it, tweak anything, or write complicated code.

The gains here are a result of a few actions:

  • cutting down on allocation
    • partition -> sliding
    • lazy sequences -> transducers
    • map / filter -> keep
  • cutting down on iteration
    • last -> peek
    • juxt / apply -> direct function calls
  • Using primitives instead of collections
    • Vectors -> arrays make for faster access
    • Working directly with arrays and contiguous memory access

Final Scores

Let's take a moment to reflect on how far we've gone

step time (ms) improvement relative
baseline 1159 1 1
xf 463 2.5032397 2.5032397
decomposed 366 3.1666667 1.2650273
vector 88 13.170455 4.1590909
keep 80 14.4875 1.1
unchecked 66 17.560606 1.2121212
array 23 50.391304 2.8695652
iteration 0.655 1769.4656 35.114504

Should you try this at home?

premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3% – Donald Knuth

The answer as always is "it depends"; Things like using nth, peek and pop instead of first and last, using transducers instead of lazy sequences, and familiarity with Clojure's core (e.g. keep) are good and will probably produce more idiomatic code. They can be embraced as habits.

On the other hand, things like writing your own transducers, especially ones built with Java interop, and working directly with arrays should be reserved for special circumstances.

Do profile your code first, understand the problems and use cases, then optimize to your heart's content, secure in the knowledge that if you need to, Clojure can probably get there.

Happy hacking