Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

With your IDE example you need the full parser and type checker to be "tolerant". For recursive-descent parsing, there isn't much to say about theory. You try to pick reliable synchronization points and prevent cascading errors. Here's the classic example:

In a statement-oriented language like C#, synchronizing to the next statement upon finding an error by scanning for a semicolon token is a good place to start. (Statements like 'if' with nested statement blocks as arguments have their own sync logic.) Aside from having reliable sync points, statements (unlike expressions) don't have a type that you need to propagate, so you don't usually have to worry about cascading errors in the type checker from skipping a faulty statement. That said, if you skip a faulty statement that was the sole reference to a local variable, that might get flagged as an 'unreferenced variable' warning. It's often a good idea to disable sensitive warnings (and some errors) for the rest of the function as soon as an error is found.

Probably the most important sync point is at the level of symbol declarations. Even if a function's body is totally botched up, as long as you could successfully resolve the function's signature (name, parameter types and return type) you don't get cascading errors from other functions that reference that function. Resyncing to top-level declarations is so important that if you're designing the syntax it's worth having a dedicated declaration keyword (especially for functions) which is only valid at top level. That way you have a reliable sync point even if everything else is out of wack (unbalanced braces, etc). Something like Go's 'func' keyword is close enough: it can appear in function literals and function types as 'func' followed by '(' but 'func' followed by a name is only valid in top-level declarations.

Fine-grained error recovery for expressions is the biggest problem from both a parsing and type checking perspective. There aren't any reliable expression sync points in a C#-like language and you need to fabricate best-effort types for the faulty expressions (with an 'error' type as a fallback) and make sure that the various operators for combining expressions have heuristics so you don't get spurious cascading errors. It generally involves numerous special cases to good results. If you have a choice in the matter, don't worry about expression error recovery: report the error, sync to the next statement and go for the lower-hanging fruit instead. In my experience it isn't worth it in a statement-oriented language.

If you want to eliminate as many spurious warnings/errors like this as possible during error recovery, you do end up adding many special cases over time. But it's an incremental process and you can get good results immediately with basic recovery techniques.

Various comments:

On the lexer side of things, if you get to design the syntax it helps to avoid multi-line lexemes in the common cases. That's why I cringed a little when I learned that Rust's "..." string literals are multi-line. It's fine to have multi-line lexemes like /* ... */ comments in C and """...""" string literals in Python but make the default choices be single-line lexemes like "..." and // ... so you can sync to the newline.

A benefit of a syntax with indentation-defined block structure is that you don't need to rely on balanced grouping tokens like { ... }, so it's easy and reliable to sync to the outer block levels. (In Python there's a caveat that the lexer suspends indentation tracking when the ([{ nesting level is nonzero, but it's still a robust heuristic even when recovering from an error in a nested state.) In particular, if you require top-level declarations to be at column 0 this avoids the need for dedicated keywords for reliable declaration resync. Then the only thing you have to worry about are unbalanced multi-line lexemes gobbling up chunks of your programs.

While I talked about error recovery, a lot of these syntactic properties help with fast symbol indexing. E.g. it's easy to write a fast symbol indexer when you can just sync to "\n<letter>" for top-level declarations and otherwise only need to worry about rare multi-line lexemes. This lets your outer scan loop avoid the byte-at-a-time bottleneck.



Great details, thanks!

> A benefit of a syntax with indentation-defined block structure is that you don't need to rely on balanced grouping tokens like { ... }

In fact from the lexer perspective there is no big difference, the matching indent-dedent is the same token type as would be { and }


Indeed, the difference is that the lexer offers guarantees about the synthetic INDENT/DEDENT tokens. From an error sync perspective, the benefit is that the programmer (redundantly) re-asserts the block level every line by the amount of indentation. As a small addendum on Python's suspension of indentation tracking when nesting > 0, when I designed the syntax for indentation-based block structure in another language, I required such nested code to always have indentation beyond the current block's level even though no INDENT/DEDENT/NEWLINE tokens are emitted in this state. So this was legal:

    x = (1 +
        2)

    x = (1 +
            2)

    x = (1 +
      2)

    x = (1 +
     2)
But this was illegal:

    x = (1 +
    2)
The legal variants are all identical to

    x = (1 + 2)
from the parser's perspective. Adding this restriction (which is already the idiomatic way to indent nested multi-line expressions) means that you can reliably sync to block levels even when recovering from an error in a nested state. If your lexer already strips leading indentation from multi-line string literals you could add a similar constraint for them.

The moral of a lot of these tricks is that by turning idioms and conventions into language enforced constraints you can detect programmer errors more reliably and you can do a better job of error recovery. That said, even in a curly brace language like C# you could still use the indentation structure as a heuristic guide for error recovery--it's just going to be less reliable.


Yeah, this makes sense, thanks.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: