Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
SQLite internals: How the most-used database works (compileralchemy.com)
244 points by cube2222 on Dec 17, 2022 | hide | past | favorite | 43 comments


I love how the SQLite author is so committed to the quality of his tools that he _wrote his own text editor_ as part of the project.

In about 99% of cases, for 99% of people, this is the wrong thing to do - but DRH pulled it off.

On the "why use flex/yacc ... when you can roll your own", I think the tradeoff is: if you have a deep understanding of theory of parsing and parser generators, then it's actually less mental effort for you to write your own from scratch, than it would be to learn the exact syntax and quirks and possibly bugs of someone else's implementation. Or to find out that some feature that's easy to implement from scratch, doesn't exist in the library.

I wonder, do we teach "classical" parsing that well anymore, in a way that could create the next generation's DRH?


Every UWaterloo CS student has to take CS241 (2nd year class, 4 months after learning C). Each student must write tokenizer and parser by hand. The full assignments include about 8 programs in binary and MIPS assembly, a MIPS assembler including support for labels, a scanner for subset of C, SLR(1) parsing, type checking, code gen including pointers and dynamic memory allocation.

In my day there was a unit on compiler optimization, with a code size contest. Constant propagation, register allocation, dead code elimination. Now it seems they replaced that with a section involving writing a loader for a ELF-like file format in assembly language

https://student.cs.uwaterloo.ca/~cs241/assignments/


It's a long time ago but I'm pretty sure we had similar assignments at utwente. I'm not so sure it's such a must know thing though. Lexers are fairly trivial to write by hand, and parsers are just annoying without code generation. We don't use tools like yacc because parsers are particularly difficult or complex, we use them because there's just a lot to gain from optimisations and fancy features, and without a generator or very powerful language features parsers will be too much work to maintain.

Frankly, I'd expect someone who's never implemented a compiler before make the mistake of not using a parser generator because they are daunted by yacc and think their language is not going to need the complexity of an extra compile step.

By the way if you don't need very fancy features, writing parsers with parser combinators is very easy and fun and doesn't require any metaprogramming tools. You can even write your own little parser combinator library because the concept is so simple.


> and without a generator or very powerful language features parsers will be too much work to maintain.

Hand-written recursive descent parsers are quite commonly used in production compilers (e.g. GCC, Rust). See e.g. here: https://softwareengineering.stackexchange.com/questions/2546... They don’t seem to be particularly difficult to maintain.


I wonder if Rust started out hand written, or that they switched to it after they had gotten all the mechanisms they wanted in. There might be a point where you're no longer fiddling a lot with your parsing strategy and basic grammar, and from that point on the parser generator becomes less of a necessity and maybe even somewhat of a burden.


I don't think parser generators are ever really a necessity. Writing recursive descent parsers is easy. You are looking at around 10-20k lines of code for a robust parser for a typical programming language. That's nothing compared to all the rest of the code in the compiler.


Is it nothing? I agree it's easy, but why do it if you don't need to. Besides it being "nothing" compared to all the rest of the code in the compiler it's also the least interesting part of it, unless you're trying to aim for some real fancy programmer comfort features like Rust and newer versions of GCC are. You can sketch your grammer out in EBNF, and just throw that into a parser generator and bam you're ready to spit out some byte code.

And just absolutely talking, 10-20k lines of code is never nothing. That's months if not years of coding, that has to maintained and expanded upon as your project grows.


You can't generally just throw EBNF grammars into a parser generator and get good results. For example, YACC will only accept LALR(n) grammars, not any grammar that can be expressed in EBNF. Some newer parser generators are more flexible, but you'll still have to do some wrangling.

Once you get used to writing recursive descent parsers you can bash them out quite quickly. You can write a mostly-working parser for a programming language in a few days. 10-20k lines of code might sound like quite a lot, but it's mostly repetitive code that's easy to get right and easy to write tests for.

Recursive descent parsers can often be easier to maintain than parser generator based parsers because you don't have to keep hacking around the limitations of the generator. For example, C and C++ don't permit any kind of clean separation between lexing, parsing and symbol resolution.

Even gcc's C parser and lexer combined are only about 30,000 LOC, depending on which files you count besides the main parser module: https://github.com/gcc-mirror/gcc/blob/master/gcc/c/c-parser... And this is code written in Cish C++. In a higher-level language it could be significantly more concise.


Hey foldr, re higher level language parser, I’d just like to say at uwaterloo it was common to use racket to write the compiler for the above^5 mentioned class (you could use racket or C).

Is this what you had in mind? Which language is best in your opinion for a parser?


If you're writing a traditional recursive descent parser, it doesn't make much difference (IMHO). Any vaguely modern programming language will work fine.

If you want to use parser combinators, then something with basic support for functional programming (particularly closures and and a lightweight lambda syntax).


I'm surprised nobody has mentioned ANTLR yet. That's the modern way of doing this.


> if you have a deep understanding of theory of parsing and parser generators

No deepness needed, the only thing you need to know is recursion.

Also SQL is even better than that because every query is it's own statement, so the parsing is dead simple. I totally get why he did it this way.


Doing is the best way to learn. The mind looks over gaps in understanding which are revealed when trying to build.

I think DRH picked a self-imposed constraint that sqlite would not have any external dependencies. A lot of his decisions are due to this constraint.


> he _wrote his own text editor_ as part of the project.

I’ve not spotted that - and I have a thing for bespoke editors! Could somebody link me up to the editor? A few cursory searches didn’t prove fruitful.


Based on the rest of GP's comment, I think he meant to say "parser" rather than "text editor". And that's referring to lemon: https://en.wikipedia.org/wiki/Lemon_(parser_generator)


Unfortunately, it just says

> And the text editor that I used to write SQLite is one that I wrote myself. [10]

But with no more details.


If you read reference [10] on the book, he says he did not release the source of the editor.


Hm yeah I didn’t, I left the book for when I’m on a computer - thanks for spotting it!

Obviously, now I’m curiouser about the secret editor :)


Oh i wrote the book ^^, that's why.


Using parser generators is not easier, it’s just less work. Recursive descent parsing is very easy and is the best option if you want good error handling for parsing. It’s very time-consuming though.


> ...than it would be to learn the exact syntax and quirks and possibly bugs of someone else's implementation...

Yup. Also, having deep knowledge of the language is required.

SQLite's grammar is neat, modest. Creating a compatible parser would make a fun project. Here's a pretty good example: https://github.com/bkiers/sqlite-parser (Actual ANTLR 4 grammar: https://github.com/bkiers/sqlite-parser/blob/master/src/main... )

Postgres, which tries to be compliant with the latest standards, however...

SQL-2016 is a beast. Not to mention all the dialects.

I'm updating my personal (soon to be FOSS) SQL DML grammar from ANTLR 3 LL(k) to ANTLR 4 ALL(Star).

I've long had a working knowledge of SQL-92, with some SQL-1999 (eg common table expressions).

But all the new structures and extensions are a bit overwhelming.

Fortunately, ANTLR project has ~dozen FOSS grammars to learn from. https://github.com/antlr/grammars-v4/tree/master/sql

They mostly mechanically translate BNFs to LL(k) with some ALL(Star). Meaning few take advantage of left-recursion. https://github.com/antlr/antlr4/blob/master/doc/left-recursi...

Honestly, I struggled to understand these grammars. Plus, not being conversant with the SQL-2016 was a huge impediment. Just finding a succinct corbis of test cases was a huge hurdle for me.

Fortunately, the H2 Database project is a great resource. https://github.com/h2database/h2database/tree/master/h2/src/...

Now for the exciting conclusion...

My ANTLR grammar which passes all of H2's tests diverges significantly from the official or product specific BNFs. Mostly wrt recursion.

Further, I found discrepancies between misc product's BNFs and their implementations.

So a lot of trial & error is required for a "real world" parser. Which would explain why the professional SQL parsing tools ask for money.

I still think creating a parser for SQLite is a great project. Hand-made or using grammar toolkit of choice. (I hope to play around with Pratt and PEG parsers some day.)


Thank you for the resources and insights. H2 is a great implementation!


Just a note that DRH's doctorate was in computational linguistics. This is reflected on SQLite in the sense that the part he likes working on the most is the front-end as opposed to features like concurrency for example. So, i guess a deep dive into the topic is required.

What the above comments showed in terms of what is taught at unis seem to be great starting points. Since he already knew classical stack-based VMs from the start, this maybe shows he learnt it at uni. Would love to ask him about it!


Really like this:

> Richard liked the idea of a consorsium. He started devising a plan of his own. Luckily someone from the Mozilla foundation reached out to him. They did not like the way he was setting up the framework around the consorsium by giving members voting rights. They proposed keeping the direction of the project in developers hand. The friend from Mozilla being a lawyer was adamant on this point and saw through the implementation of the current setup.


This is highly upvoted but I wouldn't say this document is well-written


Thank you for the feedback. It is pulled from my notes and is an ongoing book aimed towards contribution to the project. I did not yet pass some spellchecks on it. The book is OpenSource though and you can have a go at improving sections.

I would really appreciate making the material available to a wider audience.


A note on styling. The font is rather large. On my phone, this means the column is narrow. Full justification is hard to read when you have 2 emdash spaces between words.


I added the download if display broken, beacause yes, it needs fixing on phone


Quick question to the author: Do you have any references for the "From First Principles approach"? Or is it more just an informal thing?


See the co-recursive podcast episode in the references section. It is Adam's words but, i find it reasonably describing the whole attitude.


Cool, thanks.


[flagged]


Are we going to see this under every submission now?

The author comes from a completely different culture than most of HN users. English doesn't seem to be their first language. It can be hard to formulate your thoughts without sounding weird to native speakers, I know that feeling perfectly well.


Author could be from a non native english speaking country. It's disrespectful to attribute to AI without any benefit of doubt.


I am the author, no it was written by a human!


Apologies, reading it I got stuck on the same sentence about 3 times and gave up. If you want, I don't mind giving more constructive feedback on what's wrong.


I am open! Since new material is constantly being added, feel free to bookmark it and check in about a week. You can also raise a pull request!


Hmmm, that's what a human-imitating robot would say... /s


While the first bit is a little stilted at times, similar to some of the chatGPT stuff we have seen recently, I can't see AI writing the rest of the article. It goes very deep into the technical architecture of SQLite, I don't think AI could do that.

But maybe AI is better than I think.


@samwillis, It is converted from notes i wrote, from a presentation i gave. This did not start as a book. Yes you are right, i did not pass this through chatGPT.


No worries, and I completely agree with the other reply suggesting it's disrespectful to comment on others writing.

I'm Dyslexic and have had to cope with people criticising my writing, judging my ability based on simple mistakes, it hurts and I'm sorry you have had that today.

I haven't had the chance to fully read it yet, but I'm going to later as it's super interesting to me.


Great that you like it! I guess when typing fast and the mind is dazed, errors crop up. Re-reading i saw i wrote piece of mind instead of peace of mind. Thanks for the kind words!


Any chance, that presentation is recorded and public? I would love to watch that


The one i gave in person (more energetic) is lost forever, but, the one i gave to the LibSQL community is up and online:

https://www.youtube.com/watch?v=poA497gSs-4

Edit: The book has more info.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: