Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Every UWaterloo CS student has to take CS241 (2nd year class, 4 months after learning C). Each student must write tokenizer and parser by hand. The full assignments include about 8 programs in binary and MIPS assembly, a MIPS assembler including support for labels, a scanner for subset of C, SLR(1) parsing, type checking, code gen including pointers and dynamic memory allocation.

In my day there was a unit on compiler optimization, with a code size contest. Constant propagation, register allocation, dead code elimination. Now it seems they replaced that with a section involving writing a loader for a ELF-like file format in assembly language

https://student.cs.uwaterloo.ca/~cs241/assignments/



It's a long time ago but I'm pretty sure we had similar assignments at utwente. I'm not so sure it's such a must know thing though. Lexers are fairly trivial to write by hand, and parsers are just annoying without code generation. We don't use tools like yacc because parsers are particularly difficult or complex, we use them because there's just a lot to gain from optimisations and fancy features, and without a generator or very powerful language features parsers will be too much work to maintain.

Frankly, I'd expect someone who's never implemented a compiler before make the mistake of not using a parser generator because they are daunted by yacc and think their language is not going to need the complexity of an extra compile step.

By the way if you don't need very fancy features, writing parsers with parser combinators is very easy and fun and doesn't require any metaprogramming tools. You can even write your own little parser combinator library because the concept is so simple.


> and without a generator or very powerful language features parsers will be too much work to maintain.

Hand-written recursive descent parsers are quite commonly used in production compilers (e.g. GCC, Rust). See e.g. here: https://softwareengineering.stackexchange.com/questions/2546... They don’t seem to be particularly difficult to maintain.


I wonder if Rust started out hand written, or that they switched to it after they had gotten all the mechanisms they wanted in. There might be a point where you're no longer fiddling a lot with your parsing strategy and basic grammar, and from that point on the parser generator becomes less of a necessity and maybe even somewhat of a burden.


I don't think parser generators are ever really a necessity. Writing recursive descent parsers is easy. You are looking at around 10-20k lines of code for a robust parser for a typical programming language. That's nothing compared to all the rest of the code in the compiler.


Is it nothing? I agree it's easy, but why do it if you don't need to. Besides it being "nothing" compared to all the rest of the code in the compiler it's also the least interesting part of it, unless you're trying to aim for some real fancy programmer comfort features like Rust and newer versions of GCC are. You can sketch your grammer out in EBNF, and just throw that into a parser generator and bam you're ready to spit out some byte code.

And just absolutely talking, 10-20k lines of code is never nothing. That's months if not years of coding, that has to maintained and expanded upon as your project grows.


You can't generally just throw EBNF grammars into a parser generator and get good results. For example, YACC will only accept LALR(n) grammars, not any grammar that can be expressed in EBNF. Some newer parser generators are more flexible, but you'll still have to do some wrangling.

Once you get used to writing recursive descent parsers you can bash them out quite quickly. You can write a mostly-working parser for a programming language in a few days. 10-20k lines of code might sound like quite a lot, but it's mostly repetitive code that's easy to get right and easy to write tests for.

Recursive descent parsers can often be easier to maintain than parser generator based parsers because you don't have to keep hacking around the limitations of the generator. For example, C and C++ don't permit any kind of clean separation between lexing, parsing and symbol resolution.

Even gcc's C parser and lexer combined are only about 30,000 LOC, depending on which files you count besides the main parser module: https://github.com/gcc-mirror/gcc/blob/master/gcc/c/c-parser... And this is code written in Cish C++. In a higher-level language it could be significantly more concise.


Hey foldr, re higher level language parser, I’d just like to say at uwaterloo it was common to use racket to write the compiler for the above^5 mentioned class (you could use racket or C).

Is this what you had in mind? Which language is best in your opinion for a parser?


If you're writing a traditional recursive descent parser, it doesn't make much difference (IMHO). Any vaguely modern programming language will work fine.

If you want to use parser combinators, then something with basic support for functional programming (particularly closures and and a lightweight lambda syntax).


I'm surprised nobody has mentioned ANTLR yet. That's the modern way of doing this.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: