I found the code for Instaparse (relatively) easy to follow. I had considered le...

DmitrySoshnikov · on Oct 26, 2020

Great point on combinators, PEG, and GLL -- this potentially would be covered in 201 as suggested, since it's good having a foundation of the LL/LR, and then gradually moving to combinators if needed. LALR(1) covers a pretty wide range of the most practical languages.

samatman · on Oct 26, 2020

To a significant degree, the arrow of causality runs LALR(1) -> practical languages, not the other direction!

The languages and formats we use have been heavily shaped by the practical parsing algorithms of the 20th century. An example: you can't have a struct field called "while" in C, because once the lexer declares a token to be a keyword, that's that.

jcranmer · on Oct 26, 2020

Contextual keywords are a growing thing in modern languages. Modern compilers don't tend to have strictly separated lexers and parsers, but instead use a combined lexer/parser model that feeds information back and forth between them. If your language is designed such that you aren't often in multiple potential parse states, then it's easy to feed into the lexer "get me the next token, and by the way, expect function attributes to be keywords right now."

Note that the requirement to not be in multiple potential parse states also tends to boil down to "build a language that's usually LL(1) or LALR(1)."

tgv · on Oct 26, 2020

TBH, that's not a problem of LALR(1), or any of the other, more old fashioned methods. I've written an LL(1) parser generator that (generates a function that) parses modern awk, without semicolons, but with operator-less concatenation, and that can deal with tokens that can be keywords and identifiers (which is also needed for e.g. FORTRAN). As long as your language is deterministic, it can be expressed as an LR grammar, although legibility might suffer.

DmitrySoshnikov · on Oct 26, 2020

Yes, to some degree -- Syntax tool normally support lexer states, and the same "while" token may mean a keyword or the property/field name of a struct. You can find more details of the lexer states in the docs.

dataflow · on Oct 26, 2020

What is GRR? Did you mean GLR?

samatman · on Oct 26, 2020

Yep, that was a typo, fixed it

dataflow · on Oct 26, 2020

Ah okay. Note that GLR isn't exactly modern (1974). However, I would also remove "modern" as a requirement here... what really matters is how good the algorithm is, not how old it is. "Modern" algorithms easily end up being less powerful than the old ones... they just end up becoming popular due to other factors, e.g. simplicity.

samatman · on Oct 26, 2020

"Advanced" would have been more expressive, agreed.

In the sense that LALR and LL with limited lookahead are the introductory parsing algorithms which everyone knows, and there's a reason for that.