Hackle's blog
between the abstractions we want and the abstractions we get.

# Parsing with Parsec: or, optional, retry and back-tracking

"Advent Of Code" is a lot of fun! To me, in no small part it's due to parsing, particularly when combining a few list functions doesn't cut it.

Of course, this should be done with Parsec (or its comparable), not regular expressions, yikes.

Let's start with a simple example, the input for the day 4 challenge, "Camp Cleanup".

``````2-4,6-8
2-3,4-5
5-7,7-9
2-8,3-7
6-6,4-6
2-6,4-8
``````

Sure, this would be done simply with something like `bimap (splitOn '-') (splitOn '-') (splitOn ',' "2-4,6-8")` barring any smarter concoctions that are destined to exist in abundance. Or, if we use `Parsec`.

``````type Range = (Int, Int)
type Parser a = Parsec String () a  -- and alias to save us some typing

int :: Parser Int
int = read @Int <\$> many1 digit

around :: Parser a -> Parser b -> Parser (a, a)
around p sep = do
r1 <- p
sep
r2 <- p
return (r1, r2)

range :: Parser Range
range = int `around` (char '-')

rangePair :: Parser (Range, Range)
rangePair = range `around` (char ',')

pairs :: Parser [(Range, Range)]
pairs = rangePair `sepBy1` endOfLine
``````

(There is small problem with `rangePair `sepBy1` endOfLine` that's not immediately obvious, read on!)

A reasonable first response would be: this is a lot of code, even with the type annotation stripped away!

However, due to the compositional nature of parser combinators, one of the obvious advantages is how structured the code looks like: `range = int `around` (char '-')` leaves nothing to confusion. It's not hard to see such clarity can be vital in dealing with more complex input, such as the Day-7 challenge "No Space Left On Device"

``````\$ cd /
\$ ls
dir a
14848514 b.txt
8504156 c.dat
dir d
\$ cd a
\$ ls
dir e
29116 f
2557 g
62596 h.lst
\$ cd e
\$ ls
584 i
\$ cd ..
\$ cd ..
\$ cd d
\$ ls
4060174 j
8033020 d.log
5626152 d.ext
7214296 k
``````

This really justifies the use of a proper parser library such as `Parsec` for its dreaded statefulness, which would be a handful for any (especially) smart regular expressions.

Luckily this was not a problem as `Parsec` incorporates `State`, so I could `getState`, `putState` or `modifyState` pretty conveniently. Quite the power tool.

However, the problem I ran into was not that advanced, although quite infamous. When parsing the commands such as `\$ ls` or `\$ cd ..`, my code looks like this,

``````cd = do
string "\$ ls" <* endOfLine
-- ... puState for current paths
ls = do
string "\$ ls" <* endOfLine
(file <|> dir) `sepBy` endOfLine    -- problematic!

command = ls <|> cd
``````

When parsing the CLI output above, this parser gives the following error,

``````Left (line 7, column 1):
unexpected '\$'
expecting digit, "dir ", new-line or end of input
``````

a.k.a. the dreaded "back-tracking" problem, but how?

## Back-Tracking

We'll take it from the top again, with a happy-path parser.

``````ghci> parse (string "abc" <> string "def") "(any source)" "abcdef"
Right "abcdef"
``````

For "either / or" situations, the `<|>` combinator is handy.

``````ghci> parse (string "abc" <|> string "def") "(any source)" "def"
Right "def"
``````

Then it gets interesting: what if two parsers share the same prefix, such as `"alice" and "alba"`?

``````ghci> parse (string "alice" <|> string "alba") "(any source)" "alba"
Left (line 1, column 1):
unexpected "b"
expecting "alice"
``````

Ah, we get an error, which is confusing - the code clearly instructs to "parse either an 'alice' or an 'alba'", but it fails as soon as "alice" fails, without trying "alba". This is not how "or" normally works. How come?

This has to do with how `Parsec` consumes input char by char. For `string "alice"`, it consumes `"al"`, and fails at `'b'`; despite its "or" semantics, `<|>` does not back-track to the beginning for `"alba"`. Like many, I found this baffling if not also frustrating.

This creates the need for "back-tracking", for the parser to go back by two letters and parse `"alba"` from the beginning. To tell `Parsec` to do that, we use `try`.

``````ghci> parse (try (string "alice") <|> string "alba") "(any source)" "alba"
Right "alba"
``````

`try (string "alice")` will not consume any input in the case of failure, it starts over for `string "alba"` by going backwards, hence "back-tracking".

This behaviour may also be described as "to tell `Parsec` to undo parsing `string "alice"`". However I find this view imperative, and does not accurately reflect how `try` (or parser combinators in general) works. More of this at the end.

## Shared Separator

This issue may lay in ambush. For example, separators.

Again we start with the happy path: digits separated by commas. `sepBy` seems the perfect combinator for this scenario.

``````ghci> parse (digit `sepBy` char ',') "(any source)" "1,2,3,9"
Right "1239"
``````

Pretty intuitive stuff. What about "comma-separated digits followed by letters"?

``````ghci> parse (digit `sepBy` char ',' <> letter `sepBy` char ',') "(any source)" "1,2,3,9,a,b,c,d"
Left (line 1, column 9):
unexpected "a"
expecting digit
``````

It took me a fair bit of head scratching to figure out this is the same "back-tracking" issue, only this time the perpetrator is `sepBy`. After consuming `9`, `parse` goes on to eat up `,` happily, which is specified as the separator. Fair and square! However, by the time it reaches `a` and fails, it's "too late"!

Put differently, my choice of parsers results in ambiguity: `char ','` is used as separator for two consecutive sequences, one of digits and another of letters.

How do we tell `Parsec` to back out of its greedy behaviour?

`try` is no good here because we need both `digit`s and `letter`s; one workaround is to combine `letter` and `digit`, but it allows digits and letters to arrive out of order.

``````ghci> parse ((digit <|> letter) `sepBy` char ',') "(any source)" "1,2,3,9,a,b,c,d"
Right "1239abcd"
ghci> parse ((digit <|> letter) `sepBy` char ',') "(any source)" "1,a,2,b,3,c,9,d"
Right "1a2b3c9d"    -- out of order, not what we want
``````

Or there is `optional`, according to the documentation,

optional p tries to apply parser p. It will parse p or nothing. It only fails if p fails after consuming input. It discards the result of p.

``````ghci> parse (many (digit <* optional (char ',')) <> many (letter <* optional (char ','))) "(any source)" "1,2,3,9,a,b,c,d"
Right "1239abcd"
``````

Hurrah! Although there is still a catch...

## Optional, one and only

I am terrible at RTFM but the description of `optional` does scare me a little.

It only fails if p fails after consuming input.

Indeed, `optional` can still lead to back-tracking problems. If the separator is more than one character, such as `->`, then failure at the second character spells trouble.

``````ghci> parse (many (digit <* optional (string "->")) <> many (letter <* optional (string "->"))) "" "1->2->3->9->a->b->c->d"Right "1239abcd"

ghci> parse (many (digit <* optional (string "->")) <> string "-" <>  many (letter <* optional (string "->"))) "" "1->2->3->9-a->b->c->d"
Left (line 1, column 11):
unexpected "a"
expecting "->"
``````

See what trips it up? It's the `-` between the digits and letters. `Parsec` will try to get a `->` but fails after `-`, back-tracking required!

Applying the `try` trick fixes the problem.

``````ghci> parse (many (digit <* optional (try (string "->"))) <> string "-" <>  many (letter <* optional (string "->"))) "" "1->2->3->9-a->b->c->d"
Right "1239-abcd"
``````

Verbosity is the price we pay for precision and clarity. Of course, any code suited more than the REPL should be formatted for better reading experience.

## Don't "try" too hard

It would be obvious but worth calling out, using "try" too liberally will not only result in noisy code and confusing errors, but also performance issues. Consider,

``````ghci> parse (try (string "Hello World!") <|> string "Hello Space!") "(any source)" "Hello Space!"
Right "Hello Space!"
``````

Beautiful when it works, and when it doesn't? Utter confusion.

``````ghci> parse (try (string "Hello World!") <|> string "Hello Space!") "(any source)" "Hello Spase!"   -- spa(s)e
Left (line 1, column 1):
unexpected "s"
expecting "Hello Space!"
``````

The error message includes the whole string in `expecting "Hello Space!"`, it could take some eye-balling to find where exactly the parser fails (at letter `"s"`). People may blame `Parsec` for the inaccurate error message, but really, I am to blame for making ambiguous parsers!

Ambiguity arises when `"Hello World!" <|> "Hello Space!"` share the prefix `"Hello "`, but if we shuffle the words around, it needs not necessarily be the case: `"Hello " <> (World!" <|> "Space!")`.

Using the knowledge to my advantage, `try` isn't even needed!

``````ghci> parse (string "Hello " <> (string "World!" <|> string "Space!")) "(any source)" "Hello Space!"
Right "Hello Space!"

ghci> parse (string "Hello " <> (string "World!" <|> string "Space!")) "(any source)" "Hello Spase!"
Left (line 1, column 7):
unexpected "s"
expecting "Space!"
``````

Granted, the error message is still not accurate to the letter, but at least it's more localised to "spase".

The lesson learned: design the parsers around ambiguity, and use `try` judiciously, maybe only as the last resort.

## What back-tracking?

How does `try` work exactly? I mentioned previously there is no imperative operation such as "undo" or "putting characters back to the input". But how else could it be?

The definition of `try` can be found here

``````try :: ParsecT s u m a -> ParsecT s u m a
try p =
ParsecT \$ \s cok _ eok eerr ->  -- note the suspicious _
unParser p s cok eerr eok eerr  -- eerr is used twice, first time in the position of _, what is eerr?
``````

Whereas `ParsecT` is quite a handful, pay attention to the comments,

``````newtype ParsecT s u m a
= ParsecT {unParser :: forall b .
State s u
-> (a -> State s u -> ParseError -> m b) -- consumed ok
-> (ParseError -> m b)                   -- consumed err
-> (a -> State s u -> ParseError -> m b) -- empty ok
-> (ParseError -> m b)                   -- empty err
-> m b
}
``````

Do you see it? A pretty neat trick - `try` will ignore any "consumed err" and replace it with "empty err", which by the name does not consume any input. Nothing is "put back"; the parser being tried (`p` in `try p`) is simply ignored.