Convex Optimized

Indentation sensitive parsing the easy way

by Rob Zinkov on 2016-01-12

Recently, I had to write an parser for Hakaru. Writing parsers in Haskell is generally a treat as there are muliple parser libraries to choose from including Happy, parsec, attoparsec, megaparsec, trifecta, and many others. The trouble occurs when you want to parse and indentation-sensitive language like Python or Haskell. For that task the choices are more limited and far less documented. Which is unfortunate as my favorite library indentation of the bunch is the least documented. The following is how to use indentation to write an indentation-sensitive parser.

For this tutorial, I will use indentation and parsec.

cabal install indentation parsec

To get started import Parsec as you normally would

module Indent.Demo where

import           Data.Functor                  ((<$>), (<$))
import           Control.Applicative           (Applicative(..))
import qualified Control.Monad                 as M
import           Data.Functor.Identity
import           Data.Text                     (Text)
import qualified Data.Text                     as Text
import           Text.Parsec                   hiding (Empty)
import           Text.Parsec.Text              () -- instances only
import qualified Text.Parsec.Expr              as Ex
import qualified Text.Parsec.Token             as Tok

And then add the following modules from indentation

import           Text.Parsec.Indentation
import           Text.Parsec.Indentation.Char
import qualified Text.Parsec.Indentation.Token as ITok

The key thing which needs to be changed is that the lexer needs to be indentation-sensitive. Sadly, there is no easy way to extend the existing LanguageDefs, so we make one from scratch.

style :: Tok.GenLanguageDef ParserStream st Identity
style = ITok.makeIndentLanguageDef $ Tok.LanguageDef
    { Tok.commentStart    = ""
    , Tok.commentEnd      = ""
    , Tok.nestedComments  = True
    , Tok.identStart      = letter <|> char '_'
    , Tok.identLetter     = alphaNum <|> oneOf "_'"
    , Tok.opStart         = oneOf "!#$%&*+./<=>?@\\^|-~"
    , Tok.opLetter        = oneOf "!#$%&*+./<=>?@\\^|-~"
    , Tok.caseSensitive   = True
    , Tok.commentLine     = "#"
    , Tok.reservedOpNames = [":"]
    , Tok.reservedNames   = ["def", "add"]
    }

lexer :: Tok.GenTokenParser ParserStream () Identity
lexer = ITok.makeTokenParser style

Once you have an indentation-sensitive lexer, you can add the primitives you need in terms of it.

integer :: Parser Integer
integer = Tok.integer lexer

identifier :: Parser String
identifier = Tok.identifier lexer

reserved :: String -> Parser ()
reserved = Tok.reserved lexer

reservedOp :: String -> Parser ()
reservedOp = Tok.reservedOp lexer

parens :: Parser a -> Parser a
parens = Tok.parens lexer . localIndentation Any

commaSep :: Parser a -> Parser [a]
commaSep = Tok.commaSep lexer

All of these are boilerplate except for parens. You will notice, for it we call localIndentation Any before passing the input. This function indicates that indentation rules can be ignored when using this combinator. This gives parentheses the meaning they have in python which is to suspend indentation rules. We will go into more detail how the indentation primitives work, but for now let’s define AST for our language.

type Name = String
type Args = [Name]

type ParserStream    = IndentStream (CharIndentStream String)
type Parser          = ParsecT     ParserStream () Identity

data Expr =
     Func Name Args Expr
   | Var  Name
   | App  Expr [Expr]
   | Add  Expr Expr
   | Lit  Integer
   deriving (Show)

Parsing this language doesn’t involve need to involve indentation rules

int :: Parser Expr
int = Lit <$> integer

add :: Parser Expr
add = reserved "add" *> (Add <$> expr <*> expr)

var :: Parser Expr
var = Var <$> identifier

app :: Parser Expr
app = App <$> var <*> parens (commaSep expr)

def :: Parser Expr
def = do
  reserved "def"
  name <- identifier
  args <- parens (commaSep identifier)
  body <- reservedOp ":" *> expr
  return (Func name args body)

expr :: Parser Expr
expr = def
   <|> try app
   <|> try var
   <|> try add
   <|> int
   <|> parens expr

Let’s add some helper code.

indentConfig :: String -> ParserStream
indentConfig =
    mkIndentStream 0 infIndentation True Ge . mkCharIndentStream

parse :: String -> Either ParseError [Expr]
parse =
    runParser (many expr <* eof) () "[input]" . indentConfig

And this parses programs just fine.

test1 = unlines
  [ "def foo(x,y):"
  , "    add x y"
  ]

parse test1
-- Right [Func "foo" ["x","y"] (Add (Var "x") (Var "y"))]

The issue is also things which feel invalid.

test2 = unlines
  [ "def foo(x,y):"
  , "add x y"
  ]

parse test2
-- Right [Func "foo" ["x","y"] (Add (Var "x") (Var "y"))]

We need to change def so that its body must be indented at strictly greater than character where it starts.

blockExpr :: Parser Expr
blockExpr = reservedOp ":" *> localIndentation Gt expr

def :: Parser Expr
def = do
  reserved "def"
  name <- identifier
  args <- parens (commaSep identifier)
  body <- blockExpr
  return (Func name args body)

If you now look, we have defined a function for the body, blockExpr, which says we must have the body strictly greater. Now when we parse test2 we get the following.

parse test2
-- Left "[input]" (line 2, column 2):
-- expecting identifier
--
-- Invalid indentation.
--   Found a token at indentation 1.
--   Expecting a token at an indentation greater than or equal to 2.
--   IndentStream { indentationState =
--                   IndentationState { minIndentation = 2
--                                    , maxIndentation = 9223372036854775807
--                                    , absMode = False
--                                    , tokenRel = Ge}
--                , tokenStream = ""}

localIndentation takes two arguments, what the indentation of an expression should be relative to the current indentation, and the expression itself. Relative indentations can be greater-than and equal (Ge), strictly greater-than (Gt), equal (Eq), a specific amount (Const 5), or anything (Any).

While it seems like this is the only primitive you should need, sometimes the indentation level you want can’t be defined in terms of the parent.

For example, the following is a valid program

test3 = unlines
  [ "def foo(x, y):"
  , "    add x"
  , " y"
  ]

parse test4
-- Right [Func "foo" ["x","y"] (Add (Var "x") (Var "y"))]

The issue is that “y” is indented greater than the “def” but, we really want it to be indented in terms of “add”. To do this we need to use absolute indentation. This mode says indentation is defined in terms of the first token parsed, and all indentation rules apply in terms of where that first token is found.

absBlockExpr :: Parser Expr
absBlockExpr = reservedOp ":" *> localIndentation Gt (absoluteIndentation expr)


def :: Parser Expr
def = do
  reserved "def"
  name <- identifier
  args <- parens (commaSep identifier)
  body <- absBlockExpr
  return (Func name args body)

We define a function absBlockExpr. You’ll notice we also used a localIndentation. The reason for that is absolutionIndentation normally defaults to the first token of the parent. In our case, this is def and we want instead for it to choose add.

parse test3
-- Left "[input]" (line 3, column 3):
-- expecting identifier
--
-- Invalid indentation.
--   Found a token at indentation 2.
--   Expecting a token at an indentation greater than or equal to 5.
--   IndentStream { indentationState =
--                   IndentationState { minIndentation = 5
--                                    , maxIndentation = 5
--                                    , absMode = False
--                                    , tokenRel = Ge}
--                , tokenStream = ""}

Now it works as expected

test4 = unlines
  [ "def foo(x, y):"
  , "    add x"
  , "     y"
  ]

parse test4
-- Right [Func "foo" ["x","y"] (Add (Var "x") (Var "y"))]
parse test1
-- Right [Func "foo" ["x","y"] (Add (Var "x") (Var "y"))]

This library has other bits to it, but this should give enough to figure out, how to add indentation sensitivity to your language.

Special thanks to Aleksey Kliger for helping me understand this library.