this is the fifth entry in an n-part series explaining the compilation techniques of Clojure.
In honor of the upcoming Clojure’s Google Summer of Code projects I present some discussion of the ClojureScript compiler pipeline. I talked about this in my ClojureWest talk, but a little more discussion is welcomed. Much of what I say here also applies to Clojure’s pipeline except that the details around data structures and actual programmatic interfaces are different. I will attempt to stay at a high-level of abstraction.
This is the ClojureScript compiler pipeline:
You put Clojure code in one end and out comes JavaScript from the other end.
The compiler is made of numerous phases the first of which is devoted to reading strings and converting them into Clojure data structures.
You can see how the reading phase works by observing the following in a Clojure REPL:
(def E (read-string "(-> 42 (- 6) Math/sqrt)"))
(type E)
;=> clojure.lang.PersistentList
(type (last E))
;=> clojure.lang.Symbol
You can see E
is a data structure.
The next compilation phase is the macro expansion phase.
During this phase a form will be expanded until it reaches some fixed point, as shown below:
(-> E macroexpand-1)
;=> (clojure.core/-> (clojure.core/-> 42 (- 6)) Math/sqrt)
(-> E macroexpand-1 macroexpand-1)
;=> (Math/sqrt (clojure.core/-> 42 (- 6)))
(-> E macroexpand-1 macroexpand-1 macroexpand-1)
;=> (. Math sqrt (clojure.core/-> 42 (- 6)))
(-> E macroexpand-1 macroexpand-1 macroexpand-1 macroexpand-1)
;=> (. Math sqrt (clojure.core/-> 42 (- 6)))
In the illustrative case above at three levels of macro expansion the
form reached a fixed point (does not change from one level to the next).
Eventually the inner ->
macro will also expand, but that
happens as the form is traversed during AST generation. You’ll notice
that I made the macro expansion box a bit smaller. The meaning for this
difference is that macro expansion is interleaved with AST generation.
Apart from rote expansion of macros, this phase also shuffles arguments
to the .
(dot) operator into a consistent
(. target field/method args)
format.
The next phase is deemed the analysis phase and its primary purpose to generate the ClojureScript abstract syntax tree (AST).
If you’ve talked to me about ASTs you’ll probably had the unfortunate luck to hear all about my prejudice against the unfortunate phrase “Lisp syntax is the AST of the program itself” (or some such variation). This is junk. Actual ASTs are adorned with a boatload of additional information like local binding information, accessible bindings, arity information, and many other useful tidbits.
You’ve likely seen a diagram like this (and probably images similar to the pipeline also) in other blog posts, textbooks and papers. However, one advantage that the ClojureScript (and the Clojure compiler one day) provides is that between each compilation phase the interface consists solely of Clojure data! The product of the reader is a list, some other Clojure data type, or a nesting thereof. The product of the macro expansion is the same. The product of the analysis phase is an AST itself composed of nested Clojure maps. There is no product of the compilation phases that cannot be processed by the very tools that Clojure or ClojureScript themselves (and hundreds of libraries) handle directly. This is truly an amazing feature of Clojure and Lisps in general.
The final stage is the emission phase that takes an AST and generates JavaScript.
Emission is likely the most complicated part of the entire ClojureScript compiler - it’s about 700 lines of code.
The first obvious point of extension is at the backend of the analysis phase.
This is effectively the approach that Ambrose Bonnaire-Sergeant takes in his typed-clojure. The big difference is that his analyze project provides a ClojureScript-like AST using Clojure’s analysis engine. It’s very cool. Graphically, you can envision typed-clojure situated like the following:
With the type checker adorning and vetting a tree generated by the analysis phase. Question: what is the greatest limitation of Haskell’s type system? Think about it. I answer this in my talk, but it’s not a central point to this post.
There is no reason to stop there however as conceptually one can imagine additional checking occurring on the AST in sequence.
Therefore, the model derived is one where an AST is gradually adorned through a pipeline. This is a pluggable1 compilation system; one that is fully in line with Clojure’s ideal of open extension.
This is a very powerful model, but potentially fraught with complexity.
However, while powerful, the sequential model is not terribly interesting nor desired. Instead, a better model would be one that allows branching or tap logic into the pipeline. For example, if a program is isolated and fully typed then a branch (beta) may be an appropriate action. However, if a program includes the use of untyped libraries then perhaps partial static typing ornamented with runtime constraint checks2 (alpha) may be the more appropriate branch path.3
Designing an interface to the analysis phase and a set of tools for processing its structures is fairly straightforward. The hard part comes in their use, tapping into the pipeline, and branch logic. Careful design is required here.
The next point of extension is at the point of input to the emission phase.
By extension of course I mean that one can plug in their own emitter.
This is likely the most straightforward way to target new platforms for ClojureScript and is indeed the one taken by clojure-scheme. That is, with clojure-scheme Nathan Sorenson generates Gambit Scheme at the backend. The generated Scheme can then be further compiled to C and then finally compiled to machine code.
Gilad Bracha talks eloquently about pluggable type systems in his paper “Pluggable Type Systems” at http://bracha.org/pluggableTypesPosition.pdf↩︎
Gilad Bracha again with Types are anti-modular.↩︎
Although I have the docs phase situated along a separate branch it may more appropriately serve as a tap into all branches.↩︎