Indentation sensitive parsing the easy way

Rob Zinkov

Recently, I had to write an parser for Hakaru. Writing parsers in Haskell is generally a treat as there are muliple parser libraries to choose from including Happy, parsec, attoparsec, megaparsec, trifecta, and many others. The trouble occurs when you want to parse and indentation-sensitive language like Python or Haskell. For that task the choices are more limited and far less documented. Which is unfortunate as my favorite library indentation of the bunch is the least documented. The following is how to use indentation to write an indentation-sensitive parser.

For this tutorial, I will use indentation and parsec.

cabal install indentation parsec

To get started import Parsec as you normally would

And then add the following modules from indentation

The key thing which needs to be changed is that the lexer needs to be indentation-sensitive. Sadly, there is no easy way to extend the existing LanguageDefs, so we make one from scratch.

Once you have an indentation-sensitive lexer, you can add the primitives you need in terms of it.

All of these are boilerplate except for parens. You will notice, for it we call localIndentation Any before passing the input. This function indicates that indentation rules can be ignored when using this combinator. This gives parentheses the meaning they have in python which is to suspend indentation rules. We will go into more detail how the indentation primitives work, but for now let’s define AST for our language.

Parsing this language doesn’t involve need to involve indentation rules

Let’s add some helper code.

And this parses programs just fine.

The issue is also things which feel invalid.

We need to change def so that its body must be indented at strictly greater than character where it starts.

If you now look, we have defined a function for the body, blockExpr, which says we must have the body strictly greater. Now when we parse test2 we get the following.

localIndentation takes two arguments, what the indentation of an expression should be relative to the current indentation, and the expression itself. Relative indentations can be greater-than and equal (Ge), strictly greater-than (Gt), equal (Eq), a specific amount (Const 5), or anything (Any).

While it seems like this is the only primitive you should need, sometimes the indentation level you want can’t be defined in terms of the parent.

For example, the following is a valid program

The issue is that “y” is indented greater than the “def” but, we really want it to be indented in terms of “add”. To do this we need to use absolute indentation. This mode says indentation is defined in terms of the first token parsed, and all indentation rules apply in terms of where that first token is found.

We define a function absBlockExpr. You’ll notice we also used a localIndentation. The reason for that is absolutionIndentation normally defaults to the first token of the parent. In our case, this is def and we want instead for it to choose add.

Now it works as expected

This library has other bits to it, but this should give enough to figure out, how to add indentation sensitivity to your language.

Special thanks to Aleksey Kliger for helping me understand this library.