I love how the SQLite author is so committed to the quality of his tools that he _wrote his own text editor_ as part of the project.
In about 99% of cases, for 99% of people, this is the wrong thing to do - but DRH pulled it off.
On the "why use flex/yacc ... when you can roll your own", I think the tradeoff is: if you have a deep understanding of theory of parsing and parser generators, then it's actually less mental effort for you to write your own from scratch, than it would be to learn the exact syntax and quirks and possibly bugs of someone else's implementation. Or to find out that some feature that's easy to implement from scratch, doesn't exist in the library.
I wonder, do we teach "classical" parsing that well anymore, in a way that could create the next generation's DRH?
Every UWaterloo CS student has to take CS241 (2nd year class, 4 months after learning C). Each student must write tokenizer and parser by hand. The full assignments include about 8 programs in binary and MIPS assembly, a MIPS assembler including support for labels, a scanner for subset of C, SLR(1) parsing, type checking, code gen including pointers and dynamic memory allocation.
In my day there was a unit on compiler optimization, with a code size contest. Constant propagation, register allocation, dead code elimination. Now it seems they replaced that with a section involving writing a loader for a ELF-like file format in assembly language
It's a long time ago but I'm pretty sure we had similar assignments at utwente. I'm not so sure it's such a must know thing though. Lexers are fairly trivial to write by hand, and parsers are just annoying without code generation. We don't use tools like yacc because parsers are particularly difficult or complex, we use them because there's just a lot to gain from optimisations and fancy features, and without a generator or very powerful language features parsers will be too much work to maintain.
Frankly, I'd expect someone who's never implemented a compiler before make the mistake of not using a parser generator because they are daunted by yacc and think their language is not going to need the complexity of an extra compile step.
By the way if you don't need very fancy features, writing parsers with parser combinators is very easy and fun and doesn't require any metaprogramming tools. You can even write your own little parser combinator library because the concept is so simple.
I wonder if Rust started out hand written, or that they switched to it after they had gotten all the mechanisms they wanted in. There might be a point where you're no longer fiddling a lot with your parsing strategy and basic grammar, and from that point on the parser generator becomes less of a necessity and maybe even somewhat of a burden.
I don't think parser generators are ever really a necessity. Writing recursive descent parsers is easy. You are looking at around 10-20k lines of code for a robust parser for a typical programming language. That's nothing compared to all the rest of the code in the compiler.
Is it nothing? I agree it's easy, but why do it if you don't need to. Besides it being "nothing" compared to all the rest of the code in the compiler it's also the least interesting part of it, unless you're trying to aim for some real fancy programmer comfort features like Rust and newer versions of GCC are. You can sketch your grammer out in EBNF, and just throw that into a parser generator and bam you're ready to spit out some byte code.
And just absolutely talking, 10-20k lines of code is never nothing. That's months if not years of coding, that has to maintained and expanded upon as your project grows.
You can't generally just throw EBNF grammars into a parser generator and get good results. For example, YACC will only accept LALR(n) grammars, not any grammar that can be expressed in EBNF. Some newer parser generators are more flexible, but you'll still have to do some wrangling.
Once you get used to writing recursive descent parsers you can bash them out quite quickly. You can write a mostly-working parser for a programming language in a few days. 10-20k lines of code might sound like quite a lot, but it's mostly repetitive code that's easy to get right and easy to write tests for.
Recursive descent parsers can often be easier to maintain than parser generator based parsers because you don't have to keep hacking around the limitations of the generator. For example, C and C++ don't permit any kind of clean separation between lexing, parsing and symbol resolution.
Even gcc's C parser and lexer combined are only about 30,000 LOC, depending on which files you count besides the main parser module: https://github.com/gcc-mirror/gcc/blob/master/gcc/c/c-parser... And this is code written in Cish C++. In a higher-level language it could be significantly more concise.
Hey foldr, re higher level language parser, I’d just like to say at uwaterloo it was common to use racket to write the compiler for the above^5 mentioned class (you could use racket or C).
Is this what you had in mind? Which language is best in your opinion for a parser?
If you're writing a traditional recursive descent parser, it doesn't make much difference (IMHO). Any vaguely modern programming language will work fine.
If you want to use parser combinators, then something with basic support for functional programming (particularly closures and and a lightweight lambda syntax).
Using parser generators is not easier, it’s just less work. Recursive descent parsing is very easy and is the best option if you want good error handling for parsing. It’s very time-consuming though.
Honestly, I struggled to understand these grammars. Plus, not being conversant with the SQL-2016 was a huge impediment. Just finding a succinct corbis of test cases was a huge hurdle for me.
My ANTLR grammar which passes all of H2's tests diverges significantly from the official or product specific BNFs. Mostly wrt recursion.
Further, I found discrepancies between misc product's BNFs and their implementations.
So a lot of trial & error is required for a "real world" parser. Which would explain why the professional SQL parsing tools ask for money.
I still think creating a parser for SQLite is a great project. Hand-made or using grammar toolkit of choice. (I hope to play around with Pratt and PEG parsers some day.)
Just a note that DRH's doctorate was in computational linguistics. This is reflected on SQLite in the sense that the part he likes working on the most is the front-end as opposed to features like concurrency for example. So, i guess a deep dive into the topic is required.
What the above comments showed in terms of what is taught at unis seem to be great starting points. Since he already knew classical stack-based VMs from the start, this maybe shows he learnt it at uni. Would love to ask him about it!
> Richard liked the idea of a consorsium. He started devising a plan of his own. Luckily someone from the Mozilla foundation reached out to him. They did not like the way he was setting up the framework around the consorsium by giving members voting rights. They proposed keeping the direction of the project in developers hand. The friend from Mozilla being a lawyer was adamant on this point and saw through the implementation of the current setup.
Thank you for the feedback. It is pulled from my notes and is an ongoing book aimed towards contribution to the project. I did not yet pass some spellchecks on it. The book is OpenSource though and you can have a go at improving sections.
I would really appreciate making the material available to a wider audience.
A note on styling. The font is rather large. On my phone, this means the column is narrow. Full justification is hard to read when you have 2 emdash spaces between words.
Are we going to see this under every submission now?
The author comes from a completely different culture than most of HN users. English doesn't seem to be their first language. It can be hard to formulate your thoughts without sounding weird to native speakers, I know that feeling perfectly well.
Apologies, reading it I got stuck on the same sentence about 3 times and gave up. If you want, I don't mind giving more constructive feedback on what's wrong.
While the first bit is a little stilted at times, similar to some of the chatGPT stuff we have seen recently, I can't see AI writing the rest of the article. It goes very deep into the technical architecture of SQLite, I don't think AI could do that.
@samwillis, It is converted from notes i wrote, from a presentation i gave. This did not start as a book. Yes you are right, i did not pass this through chatGPT.
No worries, and I completely agree with the other reply suggesting it's disrespectful to comment on others writing.
I'm Dyslexic and have had to cope with people criticising my writing, judging my ability based on simple mistakes, it hurts and I'm sorry you have had that today.
I haven't had the chance to fully read it yet, but I'm going to later as it's super interesting to me.
Great that you like it! I guess when typing fast and the mind is dazed, errors crop up. Re-reading i saw i wrote piece of mind instead of peace of mind. Thanks for the kind words!
In about 99% of cases, for 99% of people, this is the wrong thing to do - but DRH pulled it off.
On the "why use flex/yacc ... when you can roll your own", I think the tradeoff is: if you have a deep understanding of theory of parsing and parser generators, then it's actually less mental effort for you to write your own from scratch, than it would be to learn the exact syntax and quirks and possibly bugs of someone else's implementation. Or to find out that some feature that's easy to implement from scratch, doesn't exist in the library.
I wonder, do we teach "classical" parsing that well anymore, in a way that could create the next generation's DRH?