Parsing TeX with Recursive Transducers

Conveyor belt panorama "Conveyor belt panorama", by Bertram Nudelbach CC-BY-SA 2.0 (cropped)

For the last weeks, I have been playing around with parsers of different kinds. The most useful I made is probably an EDN parser in Go, but I think the more interesting one is one of my toy parsers. It parses a very constrained subset of TeX in a single pass, by using a more generalised version of Clojure's transducers.

Differences Between Transducers and Reducers

It began with me playing around with Clojure's transducers, and as everyone else I came over all the common gotchas. The reducers returned from transducers can be stateful, and thus can be used only once. So, although a transducer takes a reducer and returns a reducer, you should not call a transducer yourself unless you know it is stateless:

;; A transducer is reducer -> reducer, so this is a reducer
user=> (def vec10 ((take 10) conj))
#'user/vec10
user=> (reduce vec10 [] (range 20))
[0 1 2 3 4 5 6 7 8 9]
user=> (reduce vec10 [] (range 20))
[]

This is partly why you should always use transduce, into, eduction, or any of the other functions in Clojure 1.7 that takes a transducer.

There are other reasons as well. A reducer's init function, the 0-arity call to a reducer, is more prevalent in transducers. In reduce, it would only occur if you have no init value and is passed in an empty collection. In transduce and friends, it will occur every time you do not provide an initial value:

;; (reduce (xf rf) coll) != (transduce xf rf coll)
user=> (reduce conj nil)
[]

user=> (reduce conj (range 10))
ClassCastException java.lang.Long cannot be cast to clojure.lang.IPersistentCollection  clojure.core/conj--4338 (core.clj:82)

user=> (transduce identity conj nil)
[]

user=> (transduce identity conj (range 10))
[0 1 2 3 4 5 6 7 8 9]

So, while it would be desirable that they act similar, they do not actually do so.

Another reason to not use transducers directly is tightly linked together with the first reason: All reducers used with a transducer function must have a 1-arity function that "completes" the result. This is so that some stateful reducers could "clean up" when there is no more input. But this doesn't happen when you use reduce:

;; again, (reduce (xf rf) coll) != (transduce xf rf coll)
user=> (reduce ((partition-by number?) conj) [] (range 10))
[]

user=> (transduce (partition-by number?) conj [] (range 10))
[[0 1 2 3 4 5 6 7 8 9]]

So while both "transducer" reducers and "old" reducers are called reducers, they are two somewhat different things. Some reducers work for both reducers and transducers (monoids mostly, like + and *). Some, like conj, will only work in reduce if you provide an initial value. And some, mostly stateful reducers, will only work for transduce functions because of the completing call.

The problem with this is not that they differ; backwards compatibility is super important in a programming language, and you can't just change how or what reducers are from one version to the next. However, the similar naming convention could cause people to believe that transducers are something you can magically apply on top of existing reducer functions. I'm guessing that this will cause bugs and confusion for newcomers to transducers, at least it did so for me.

Reducers/Transducers Revisited

I'm not sure if cleaning up reducers/transducers is worth its own library, but assume for a moment that it is. I've made a library that does this, among other things we'll see briefly, which is named rexf.

In rexf, a rexf reducer must implement all the three different arity functions: init (0-arity), complete (1-arity) and step (2-arity). A rexf reducer may be called at most once, and will always be called with complete when it is done. A reducer must also implement the Reducer protocol, which we will see in a short moment.

A rexf transducer is slightly different from its clojure.core cousin: It doesn't work with reducers directly, but with things that can produce reducers – what I called (in lack of a better name) a ReducerFactory. A rexf transducer is a function that takes a ReducerFactory and produces a ReducerFactory (ReducerFactory → ReducerFactory).

A ReducerFactory is a protocol that has a single function: init. init creates "fresh" stateful reducers, or returns a stateless reducer.

(defprotocol ReducerFactory
  (init [this]))

By adding in the indirection of "creation", we can merge transduce and reduce into a single function: In rexf you use reduce for everything. rexf/reduce takes in a ReducerFactory, an optional init value and a collection. It's used in the same way we use transduce in Clojure, but without a transducer:

(require '[com.hypirion.rexf :as rexf])

(def vec10 ((rexf/take 10) rexf/conj))
;; vec10 is a ReducerFactory

user=> (rexf/reduce vec10 [] (range 20))
[0 1 2 3 4 5 6 7 8 9]
user=> (rexf/reduce vec10 [] (range 20))
[0 1 2 3 4 5 6 7 8 9]
user=> (rexf/reduce vec10 (range 20))
[0 1 2 3 4 5 6 7 8 9]

Since folks have created a lot of reducers and transducers already, rexf comes with the functions stateless-rf, stateless-xf and stateful-xf. They take a normal Clojure reducer/transducer and produce rexf equivalents. But rexf also comes with rexf transducers of all the transducers in Clojure core for convenience; rexf/map is for example the same as (rexf/stateless-xf map):

user=> (rexf/reduce (rexf/stateless-rf +) (range 10))
45
user=> (rexf/reduce ((rexf/stateful-xf (take 10)) 
  #_=>               (rexf/stateless-rf conj))
  #_=>   (range 20))
[0 1 2 3 4 5 6 7 8 9]
;; or ((rexf/take 10) rexf/conj)

It's Reducers All the Way Down

turtle.two "turtle.two", by Karol Franks CC-BY-NC-ND 2.0

What we've seen so far isn't interesting in any particular way. It has just been cleaning up the reducer/transducer interface a little. But the idea that a transducer takes things that produce reducers instead of actual reducers itself is cool. So let's try to extend that even further: What would happen if a Reducer could create a fresh version of "itself"? That's what the rexf Reducer protocol says that it must be able to:

(defprotocol Reducer
  (reinit [this]))

(As previously mentioned, a Reducer must also implement the 0-, 1- and 2-arity IFn invoke calls.)

The reinit function a Reducer has to provide is the part of rexf which introduces recursion to reducers. It is essentially saying that every Reducer needs to be able to reinitialise themselves, for some definition of reinitialise[1]. Generally, we talk about creating a new stateful reducer with the state this reducer had when it was initialised, so that we can reuse it. This is convenient whenever you want the reduction to be recursive, and is the reason rexf is called rexf; it's short for recursive xforms. Although recursive reducers doesn't make sense for all problems (core.async, for example), it's interesting for some types of pipelines.

Rexf doesn't ship with that many batteries for recursion, but it has one notable function: group-with. It takes 2 predicates, when to start and when to stop, and a function that creates the value the grouping transducer will emit. Out from those, it will create a rexf transducer that uses the recursion capabilities of rexf:

user=> (def group-parens
  #_=>   (rexf/group-with (fn [item] (= item \())
  #_=>                    (fn [start stop] (= stop \)))
  #_=>                    (fn [start inner stop] inner)))
#'user/group-parens

user=> (def tree [1 \( 2 3 \( 4 \( \( 5 \) 6 \) 7 8 \) \) 9])
#'user/tree

user=> (rexf/reduce (group-parens rexf/conj!) tree)
[1 [2 3 [4 [[5] 6] 7 8]] 9]

user=> (def mul-num (rexf/map #(if (number? %) (* 10 %) %)))
#'user/mul-num

user=> (rexf/reduce ((comp group-parens mul-num)
  #_=>               rexf/conj!) tree)
[10 [20 30 [40 [[50] 60] 70 80]] 90]

By default, all stateless reducers will just return themselves. Stateful reducers will reinitialise themselves, but the reinitialisation step can return whatever it wants.

For example, rexf/toplevel takes a transducer and returns a new transducer. When reducers produced from the transducer are reinitialised, then the particular transducer step rexf/toplevel was given is just skipped:

user=> (rexf/reduce ((comp group-parens (rexf/toplevel mul-num)) 
  #_=>                 rexf/conj!) 
  #_=>    tree)
[10 [2 3 [4 [[5] 6] 7 8]] 90]

Compose, Compose, Compose

As hinted in the beginning of this text, the kind of pipeline where I find this interesting and potentially valuable is in parsers. We do have parser combinators which is a powerful tool for this kind of stuff, but we generally do not make it possible to replace parser parts. I therefore thought it would be interesting to see if a transducer-based parser strategy could be viable and enforce some kind of composability as well.

The thought came when I needed to modify a Markdown parser slightly for my blog. I wanted to parse local links and resources in a different manner depending on context, and so what I wanted was a way to "hook into" a Markdown parser and use a different URL renderer in certain cases. However, this turned out to be really hard without copypasting lots of code.

And so I tried this recursive transducer parser idea with a subset of TeX, as its semantics are less ambiguous and less complex than Markdown. After much tinkering and headscratching, I ended up with the Clojure library subtex.

While it's still a proof of concept, it seems to work fine on legal input. The read-string within the hiccup part of the library converts a string into a hiccup-style layout with the option to tag on some metadata if that is desired. For example, calling read-string with the (La)TeX snippet

\documentclass{article}
\newif\ifpost
\postfalse

\usepackage{hyperref}

\ifpost
\documentclass{blog}
\title{A title}
\tags{foo}{bar}{baz}
\fi

\begin{document}
A TeX-parser in \href{http://www.clojure.org}{Clojure}
using \emph{transducers}.
\begin{itemize}
\item
  \begin{enumerate}
  \item nested \texttt{results}
  \end{enumerate}
\end{enumerate}
\end{document}

will give you back

{:properties
 {:documentclass "blog"
  :title "A title"
  :tags #{"foo" "bar" "baz"}},
 :document
 [[:p
   " A TeX-parser in "
   [:a {:href "http://www.clojure.org"} "Clojure"]
   " using "
   [:em ("transducers")]
   ". "]
  [:ul
   ([:li
     (" " 
      [:ol ([:li (" nested "
                  [:code ("results")] " ")])]
      " ")])]
  [:p " "]]}

which is a good start for data you can massage into a blogpost with certain layouts.

One of the cool results is that it looks and feels very composable. Most of the main transducers are just compositions of smaller transducers and/or reducers. Here's a snippet from hiccup.clj, for example:

(def common-invokes
  (comp h1 h2 h3 h4 strong em code sub sup
        hrule newline url href img rm-maketitle))

(def ^:private rdoc
  (tex/read-document
   ((comp (tex/group-env #(get-in % [:args 0 0]))
          item-to-li
          itemize-to-ul
          enumerate-to-ol
          common-invokes
          text-value
          (footnote*)
          tex/shrink-text
          tex/minted-to-pre
          paragraphiphy)
    rexf/conj!)))

(def ^:private default-transducer
  (comp minted/process
        tex/remove-comments
        tex/match-braces
        tex/group-calls
        (tex/read-if-properties "post" props/common)
        rdoc
        (rexf/toplevel tex/filter-data)))

(defn read-string
  [s]
  (->>
   (iterator-seq (Tokenise. s))
   (rexf/into [] default-transducer)
   (into {})))

The result is that, if you want to modify the parser process in some way, you got the option to do so with relatively small effort.

One good example of this composability is shown in the default-transducer code: I want to handle minted environments as raw text instead of TeX code. To do that, I created a transducer that takes raw tokens, recognizes minted environments and emits them as a specific type, and just pass through all other tokens. This is run before comments are filtered out, so that you don't have to escape % inside the minted environment. The rdoc transducer then converts these values into pre tags. However, if I want to use pygments, I can replace that step with a transducer that calls pygmentize and returns the result instead.

Joy and Trouble at Recursion Lane

Is there anything of value within this little experiment? There might be. Subtex is, or at least feels, very composable. It's possible and easy to make own passes and add/remove passes that fit into the parser engine, and by using transducers, you are to some extent forced to make them composable.

But it felt hard to design the parser properly: You have to be very careful with inspecting output from a downstream reducer. In fact, you shouldn't do so at all if possible. This is also true – to some extent – with values you get passed in: If you add in a pass that modify some items sent in, they may act differently with passes downstream. So while it's tempting to assume this engine is super composable, the reality is that you have to know every step in the pipeline to avoid unexpected ramifications. For subtex, this happens very seldom. However, that was after I experienced the trouble it caused, and designed the parse steps to be as independent as possible.

It is also somewhat confusing to understand what will happen during recursion at times. reinit is super flexible, which is both a blessing and a curse. The fact that you can modify the functionality of reinit makes it complex, but at the same time very powerful.

Another issue in the parser/compiler department is that this is kind-of sort-of forward referencing only. If you need some way to change values after you've emitted them, you could probably use volatiles and use them as pointers. That being said, if you're using this for more complex tasks, I'd guess this is inferior compared to tools.analyzer's schedule function, or an architecture designed for a nanopass compiler.

So is this really a good idea? I think there's something here, but it is probably not in the exact same shape as transducers/reducers.

Why Not Just Use $THING?

If you just want composability, there are other choices out there. I mentioned tools.analyzer and the nanopass compiler, but you can go even smaller: We can just perform multiple maps/reduces/etc in sequence.

(->> coll
   (map step1)
   (filter step2)
   (reduce step3)
   ...)

This is straightforward, but it would mean also that we need to define a traversal function à la clojure.walk/walk to recursively perform operations. Since every step is executed in sequence, you end up with the sequence allocation problem transducers were to some extent trying to solve[2]. While this isn't super critical for my kind of work, it's a nice to be sure that it's not a concern.

There are other interesting properties by having a parser that is based upon transducers as well:

  1. It's incremental, like attoparsec. You can handle BIG™ files without problem, or you can make multiline repls less painful to design. It's generally not a big deal for most applications though, and in cases where you need to read a BIG™ file, it's usually a stream of JSON/EDN elements where you don't need a sophisticated parser to read one and one element.

  2. It also means that you can send data out incrementally, unlike attoparsec (as attoparsec is pure). If you cannot hold all the data in memory, you can dump it to disk or send it to other systems as a step in the "parser".

I have not made thorough attempts to find a parser library with these properties yet though, so I feel confident that somebody will inform me that $THING solves all of these problems in a more elegant manner. Which is cool and fine; For me this is mostly about experimenting and toying with ideas. If something comes out of this recursive transducer thing, then that's just a neat bonus.






[1] We can do a small optimisation with reinit: A stateless Reducer can be its own ReducerFactory, which returns itself. This is what is created when you call rexf/stateless-rf. This can also be used on top of statless transducers: If a stateless transducer receives a reducer factory that returns a stateless reducer, then this can also return a stateless reducer. This is also the only difference between rexf/stateless-xf and rexf/stateful-xf – when in doubt, just use the latter.

[2] It's not as bad in lazy languages like Haskell, but once you attempt to do nanopasses there and want to use monads over a transformation, you're in the same situation. I guess the Haskell folks probably have ways around this somehow – I haven't dug too deply into this though.

Tagged with: clojure.