Datahike is a durable database powered by an efficient Datalog query engine. It started as a fork of DataScript and grew from there. It's an interesting project I've been following in the past few months with interest.
My curiosity led me to a dive inside Datahike's Datalog parser. I've been wrangling with parsers for a while and have started to get a feel for them. I'm in no way an expert on the matter, but I've bumped into enough walls and stumbled on enough rocks to know they're there.
This is a short illustrative example on how one line can devastatingly impact an application's performance.
Datalog? What's That?
Without getting too deep into the theory, Datalog is a declarative logical programming subset of Prolog which can often be used as a query language for declarative databases, such as the case in this post.
The particular Clojure implementation, originally conceived by Rich Hickey in Datomic, represents queries as data structures:
(def q '[:find ?e :in $ ?fname ?lname :keys foo :where [?e :user/firstName ?fname] [?e :user/lastName ?lname]])
The fine details of the data model are interesting, but out of this post's scope. I recommend looking more into Datalog and what it has to offer.
Profiling a Parser
The parser's entry point is datalog.parser/parse
. It extracts the
query elements then validates them.
When I want to get a feel of how a complex function behaves I usually benchmark and profile it. When I benchmarked the parser for the sample query, I got some surprising results:
(require '[criterium.core :as cc]) (cc/bench (parser/parse q)) ;;; Execution time mean : 1.787494 ms
What's surprising about these results is the parser is incredibly slow.
It's true the parser uses a cache behind the scenes, but there's still no reason it should take this long. It's still a major hit on the first query, and makes the query parser non-viable for other usages, such as static analysis (see Kondo).
That's when I profiled the query using clj-async-profiler and saw something familiar, which I've actually seen before:
(require '[clj-async-profiler.core :as prof]) (def reps 1e6) (deftest profile (prof/profile (time (dotimes [_ reps] (parser/parse q)))))
Feel free to use the search functionality in the upper right hand side
of the SVG. Search for satisfies
. What you'll find is that the
parse
function spends 95% of its CPU time calling satisfies
.
What is satisfies?
and what does it do?
Handling The Null Case
When writing parsers, we need a way to specify our handling of the null case, default behavior. Depending on our data model it can be achieved in several ways.
The Datalog parser is implemented via protocols and records which implement it:
(defprotocol ITraversable (-collect [_ pred acc]) (-collect-vars [_ acc]) (-postwalk [_ f]))
This allows for fast invocation of the protocol functions on all
records which implement it, which is great. However, there is still a
need to know if a certain object is ITraversable
, for which one
usually writes a traversable?
function, which looks like:
(def traversable? (partial satisfies? ITraversable))
This implementation is 100% correct and covers all use cases. The
problem with it is that satisfies?
has terrible performance, making
it a poor choice for code which sits on any hot path, as can be seen
in the flame graph above.
The reason satisfies?
is so slow is it searches through all of an
object's supers
with isAssignableFrom
for one which implements the
protocol.
There's an open issue which addresses that, but even if it gets fixed in Clojure 1.11, everyone using a previous version won't benefit from the speedup.
All solutions will eventually boil down to creating some sort of
registry which specifies which types satisfy ITraversable
and a
default case, but the one presented here probably performs better than
multimethods or creating a sort of managed global state and managing
it ourselves.
Extend, Embrace, Satisfy
After reading Solving The Expression Problem and Clojure's history I
got a sense that the right way to solve it was to let objects answer
the question themselves, instead of being external to them. It could
have been done by extending ITraversable
or adding a new protocol, I
chose to add a new protocol which answers a simple question, is the
object traversable?
(defprotocol Traversable (-traversable? [_]))
Then, add to every record definition:
p/Traversable (-traversable? [_] true)
Luckily, the record definitions were already wrapped in a macro so I needed to change only one place in the code.
Finally, we need to ensure that for every other type, -traversable?
is false:
(extend-type #?(:clj Object :cljs object) Traversable (-traversable? [_] false)) (extend-type nil Traversable (-traversable? [_] false))
Dramatic Improvements
Running the same bench as before, we get the following results:
(require '[criterium.core :as cc]) (cc/bench (parser/parse q)) ;;; Execution time mean : 59.656711 µs ;;; 1.787494e-3 / 59.656711e-6 ~ 30
By getting rid of satisfies?
we sped the parser up 30x!
This doesn't and shouldn't reflect negatively on the code's authors.
The problem with satisfies?
is that it's too damn convenient and
easy to use. In the past I've had the (mis)fortune of seeing it in a
production environment, which was my first tussle with it, also in a
parser of sorts. Lucky for us, it's very easy to get rid of, as you've
seen in this post.
Using solutions like caching do help alleviate the problem in long
running application's contexts, however, when used for ephemeral
processes like clj-kondo
we can see great improvements.
Acknowledgments
- Nikita Prokopov, DataScript.
- Konrad Kühne, Christian Weilbach and the rest of the Datahike contributors.