> Too many compilers classes get bogged down in grammar classifications and parsing.
Oh so very this.
This has repercussions far beyond students who can't write compilers. It infects the entire culture of software engineering with people who think that syntax is everything, and who spend their lives designing grammars and writing parsers for them. The problem with that is for every new grammar, it's not just that you need a new parser, but some human needs to learn that grammar too if it's going to do you any good at all. So we have this proliferation of zillions of grammars, not just in programming languages, but in config files, data formats, etc. Two examples: XML and JSON are bad re-inventions of S-expressions. And the sieve language (which I am currently dealing with because I'm working on a new spam filter) is just an abomination from top to bottom.
I agree about XML, but I don't think JSON is a reinvention of S-expression. JSON is great if you are representing a lot of (string) key/value data since it has a canonical way of representing string keyed dictionaries. I guess you can represent these as a list of pairs, but that obviously is ambiguous to a a list of pairs... JSON doesn't have this problem since if you see {"name": ...} you know that you are reading an object which has a 1-1 mapping to a dictionary with string keys.
While this seems like an innocent change, it simplifies serialization a lot. A generic S-expression deserializier can't make nearly as many assumptions as a JSON deserializer can.
S-expressions have first class support for primitives (i.e. strings, numbers, etc.) and lists but nothing else. On the other hand, JSON has first class support for the same plus string keyed dictionaries. It is no coincidence that most other generic serialization formats have included at least support for string dictionaries if not other key types as well.
Sure, but 1) s-expressions also have native symbols whereas JSON has only strings, and 2) it is trivial to extend standard s-expression syntax to include a native dictionary serialization, and to extend existing s-expression parsers to parse that extension. It is much, much harder to add symbols to JSON. JSON is also very profligate with its use of punctuation, with tons of unnecessary commas and colons all over the place. So I stand by my initial characterization.
Also, Common Lisp s-expressions include a standard syntax for structures, which are essentially dictionaries. Extending that to a generic dictionary (using, say, #D(key value key value ...), which is actually available) takes about three lines of code.
> 1) s-expressions also have native symbols whereas JSON has only strings,
This is inconsistent with the claim that s-exprs are better because of their simplicity. Having two very similar string-like things is unnecessarily complex for little (no?) benefit in return.
2) > it is trivial to extend standard s-expression syntax to include a native dictionary serialization
Sure, and when you do you get something at about the level of complexity of JSON, so it's not clear what the value proposition is here.
> JSON is also very profligate with its use of punctuation, with tons of unnecessary commas and colons all over the place. So I stand by my initial characterization.
JSON uses a richer set of delimiters, which provides:
1. Greater brevity. Basic information theory says that it takes fewer characters to encode a given piece of data if your character set is larger.
2. Some level of redundancy. This isn't necessary for data transmission for JSON, but it does make it much easier for parsers to provide good localized error reporting.
3. Easier visual parsing. Most humans have functional eyeballs connected to optical processing neurons dedicated to detecting different shapes. Using characters with a greater variety of shapes to encode structure takes advantage of that.
Don't get me wrong, I like Lisps and s-exprs. But it seems like any time any notation comes up for discussion, a lisper appears to claim how s-exprs are clearly superior. This despite the fact that that notation is decades older than almost every other syntax out there and yet still lost the popularity contest.
I mean, greater brevity in theory, but having used both s-expressions and JSON for config files, I can tell you that the s-expression version ends up noticeably less verbose and noisy in practice.
> This despite the fact that that notation is decades older than almost every other syntax out there and yet still lost the popularity contest.
Beyond some minimal level, quality and popularity aren't all that heavily correlated; something not becoming popular tells us almost nothing about it.
> Having two very similar string-like things is unnecessarily complex for little (no?) benefit in return.
Using symbols (or :keywords) and strings together offers many benefits:
1. Like with JSON's richer syntax for aggregates, omitting the "" makes for much easier visual parsing. Especially if symbols are usually used for keys.
2. Symbols may be suitable for interning, perhaps into a set used to validate input data. Strings usually aren't.
3. Most importantly, with this distinction, symbols can be imbued with particular semantics: they can be commands, references, lookup keys, whatever the program needs. Strings are just data, usually to be displayed somewhere to be read by someone, or to be parsed by something nasty like an interpreter.
The most human-readable data notation I know of is EDN. It combines the best of JSON and s-expressions.
> s-expressions also have native symbols whereas JSON has only strings
I'm not sure “native symbols” are a good thing in an interchange format. If you are serializing constructs from a language (Lisp, Erlang, Ruby) where that's a fundamental type, sure, it's convenient, but largely that’s a language implementation detail, from an interchange perspective there's not a lot of reason to distinguish symbols from (possibly format-constrained) strings. That they are immutable (in languages where strings are either wise mutable) and/or interned doesn't mean anything at the interchange level.
> 1) s-expressions also have native symbols whereas JSON has only strings
The only standardization effort of generalized s-expressions that I know of is http://people.csail.mit.edu/rivest/Sexp.txt, which makes strings and symbols/tokens equivalent.
Most languages do not have a runtime symbol table or the concept of a symbol as a data atom. In languages outside Lisp, what would a symbol even mean?
> extend existing s-expression parsers to parse that extension. It is much, much harder to add symbols to JSON
This sounds like a form of "No True Scotsman" argument. If you're extending S-Exp parsing via Lisp, then you can extend JSON parsing too. Once you add code, then it's not a format. It's whatever you want it to be.
> Extending that to a generic dictionary (using, say, #D(key value key value ...), which is actually available) takes about three lines of code.
Three lines of code in what language? C? C++? Java? It should be obvious you're conflating two things here. Generalized formats vs. a full Lisp ecosystem.
And you need the gazillion things on top of a grammar. Error recovery, IDE, language server, formatter, all the memory bugs, all the semantics that's not described in a formal language (and you can't auto-generate code to check or request it), all the binding generators for other languages and linking conventions... Ugh I wish we had a standard tool for that, like a much much much expanded yacc + interpreter generator + batteries included... There's something with libadalang and its language-agnostic tooling langkit (https://github.com/AdaCore/lang kit) that is very seducing (write once generate much code).
Sorry I'm trying to find out more but the project is a bit confusing. I found http://www.dotgnu.org/treecc_essay.html that looks of historical value (nowadays using java code generation / bytecode engineering quite often with amazing success). And of course libjit, but I'm at a loss finding out about semantic attributes and functions & typechecking and others.
You need a lot of predefined structure to capture attributes using brackets, much more than you do for dictionaries - which are just ordered pairs in a list when serialized.
As an informally trained writer of compilers, I believe parsing could be reduced to just these concepts:
1. Greedy matching
2. Minimal matching
3. Tokenization
Regular expressions and implementation of a subset of their operators are a good way to introduce and discuss each, and recursive descent is the elaboration on that, the graduation to a customized mechanism. The last is to generalize it all to a form of constraint logic and introduce other forms of backtracking(e.g. pathfinding algorithms). Discussion of types follows from discussion of constraints and explains "smartness" in compilers, so it might be the last thing if I were teaching the course.
What I really think is at issue with our vast number of grammars is just the reliance on text as the interface. If we discuss parsing as a thing applicable to arbitrary data streams we can start asking "well, what if we don't serialize it to characters, but to some other token format?" That is way more interesting today, since we seem to have dispensed with the premise of natural language being the way to program computers, that really got the whole parsing thing started.
Oh so very this.
This has repercussions far beyond students who can't write compilers. It infects the entire culture of software engineering with people who think that syntax is everything, and who spend their lives designing grammars and writing parsers for them. The problem with that is for every new grammar, it's not just that you need a new parser, but some human needs to learn that grammar too if it's going to do you any good at all. So we have this proliferation of zillions of grammars, not just in programming languages, but in config files, data formats, etc. Two examples: XML and JSON are bad re-inventions of S-expressions. And the sieve language (which I am currently dealing with because I'm working on a new spam filter) is just an abomination from top to bottom.