Last week I started playing around with the [Instaparse][instaparse] library, which makes writing parsers really easy in Clojure. What Instaparse does is take the formal definition of a language and give you a parser for that language. You can then use that parser on a string, possibly containing something in that language. You need to write the grammar in one of two forms: Extended Backus-Naur Form(EBNF) or Augmented Backus-Naur form(ABNF). You can read about these grammars in Matt Might’s excellent blog post: [The language of languages][lol]. I’ve picked EBNF as my grammar of choice for now, admittedly without having studied the nuances of each. Instaparse does let you mix up EBNF and ABNF in your grammar but you might end up unnecessarily complecting things for yourself.
If the string contains a syntactically correct string, the parser gives you a tree. Instaparse also provides you with the tools to transform that tree into something else, say a Clojure form, so that it can get evaluated. To get familiar with the Backus-Naur form(and its descendants, EBNF and ABNF), I recommend going through Matt’s blog post linked above.
In this post, we’ll write a grammar for parsing arithmetic expressions, so that we can turn something like "11*1-1/5"
into:
(clojure.core/- (clojure.core/* 11 1) (clojure.core// 1 5))
Notice that the parser has spit out the Clojure form with the correct precedence, that is multiplicative operators get evaluated before additive operators.
The grammar
Here is the grammar that I came up with. It handles the four basic arithmetic operators: +
, -
, *
and /
. +
and -
have the same precedence, as do *
and /
. The multiplicative operators(* and /) have higher precedence than the additive operators(+ and -).
(def arithmetic
(insta/parser
"<expr> = additive-expr | multiplicative-expr | number
additive-expr = expr (('+' | '-' ) expr)
multiplicative-expr = number ( ('*' | '/') (multiplicative-expr|number) )
number = #'-?[0-9]+'"))
Let us now take a closer look at the grammar. Starting from the bottom-up, a number
is any integer. A multiplicative-expr
is any number multiplied or divided by another multiplicative-expr
or a number. Notice that a multiplicative-expr
cannot contain an additive-expr
.
Next, an additive-expr
is any expr
added or subtracted with any other expr
. expr
is defined in the first line to be one of additive-expr
, multiplicative-expr
or number
. The angled brackets around expr
means that the :expr
tag will be hidden in the parse tree Instaparse produces.
The definitions of additive and multiplicative expressions implicitly specify that multiplicative operators have higher precedence.
Trying out the parser
Let us now try out the parser to see if it actually works:
First, let’s try the simplest possible operation:
user> (arithmetic "42")
([:number "42"])
A bit more compound expression:
user> (arithmetic "2+2")
([:additive-expr [:number "2"] "+" [:number "2"]])
And now a slightly ridiculous expression:
user> (clojure.pprint/pprint (arithmetic "2+2-2/2*2-1/1+1"))
([:additive-expr
[:number "2"]
"+"
[:additive-expr
[:number "2"]
"-"
[:additive-expr
[:multiplicative-expr
[:number "2"]
"/"
[:multiplicative-expr [:number "2"] "*" [:number "2"]]]
"-"
[:additive-expr
[:multiplicative-expr [:number "1"] "/" [:number "1"]]
"+"
[:number "1"]]]]])
Transforming the parse tree
Like I mentioned above, we really want to have the arithmetic expression parsed into a Clojure form. To do this we will specify what transformations we want:
(def parse-tree->sexp
{:additive-expr (fn [operand operator & rest]
(if (= operator "+")
`(+ ~operand ~@rest)
`(- ~operand ~@rest)))
:multiplicative-expr (fn [operand operator & rest]
(if (= operator "*")
`(* ~operand ~@rest)
`(/ ~operand ~@rest)))
:number read-string})
And finally to get the Clojure form we can run the insta/transform
functions like this:
user> (insta/transform parse-tree->sexp (arithmetic "2+2*2-2"))
((clojure.core/+ 2 (clojure.core/- (clojure.core/* 2 2) 2)))
[lol]:http://matt.might.net/articles/grammars-bnf-ebnf/[instaparse]:https://github.com/Engelberg/instaparse