Compile times increase superlinearly with program size. How many cycles and deve...

alxlaz · on Oct 11, 2020

Looking at things like e.g. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin... , this is very obviously specific to the underlying hardware's control model in all sorts of ways. If the hardware has a gazillion control registers, then the software will have to handle a gazillion registers and will therefore need several gazillion definitions in order to do it. The hardware won't get any simpler just because it takes a long time to compile the driver.

barumi · on Oct 11, 2020

> Compile times increase superlinearly with program size. How many cycles and developer hours are wasted if this duplication is not necessary?

Care to weight in the impact on developer hours and cycles of having to debug problems caused in how the autogenerated headers are created?

If the code is not expected to change and was already subjected to verification and validation, ensuring that it stays the same is a good way to avoid wasting time tracking bugs on code paths that assuredly were already verified.

mcguire · on Oct 11, 2020

"Compile times increase superlinearly with program size."

I've seen this said a couple of times lately. What is the source?

tylerhou · on Oct 11, 2020

The other time that you saw it was also probably me. It's from this talk, which is about how a large amount of generated protocol buffer code at Google led to a quadratic increase in compile times: https://youtu.be/rHIkrotSwcc?t=720.

TL;DW: The reasoning is because if you use a distributed build system, then your compile time is gated by the file with the longest compile time (tail latency). The more files you have, the greater chance that one of them takes a long time. When you generate source files, you tend to produce more files than if you didn't.

Most users don't use a distributed build system to compile the kernel, so on further thought, in that case compile times probably scale closer to linear with the number of translation units. But wasted cycles are still wasted cycles, and regardless of how exactly compile times scale, you should still consider the cost of longer compile times when you duplicate code.

With regards to link time optimization: sophisticated analyses take superlinear complexity: https://cs.stackexchange.com/questions/22435/time-complexity....

Disclaimer: I work at Google.

jcranmer · on Oct 11, 2020

Not knowing the context of the quote, I can guess a few causes:

* Compiler optimizations mostly work on a per-function basis, so that the 'n' in O(n) or O(n²) is the size of a function and not the size of the entire codebase. Not all optimization algorithms are linear, and while good superlinear optimizations should have code threshold cutoffs, in practice these may be omitted. Function sizes follow something like a power law distribution, which means larger code bases contain larger functions that are more expensive to optimize.

* For C/C++, you compile code by pasting all the header definitions into your translation unit, and feeding that to the actual lexer/parser/codegen. Larger projects are probably going to have correspondingly larger headers, and files will probably include more of them, so that the actual translation unit sizes are going to be larger even than the on-disk source files would indicate.

* This is more speculative, but larger projects are also probably likelier to blow out the various caches on your computer, so that the operations you figure are O(1) (such as hashtable lookup!) are actually more like O(lg N) in practice. And as you get larger projects that are less likely to fit any cache, the effect of the non-constant time becomes more apparent.

saagarjha · on Oct 11, 2020

Internal compiler datastructures and the need to keep more things in memory?

jchw · on Oct 11, 2020

This particular thought thread is about more translation units, not larger translation units. I can't see why those would not mostly scale linearly.

tylerhou · on Oct 11, 2020

Tail latency. See my other comment (https://news.ycombinator.com/item?id=24749000).

rational_indian · on Oct 11, 2020

Are you saying C compilers are O(n)?

barumi · on Oct 11, 2020

> Are you saying C compilers are O(n)?

OP asked for sources that substantiated a technical claim. No assertion was given.

Unless you have a source that either supports or refutes the claim, trying to deflect the question does nothing to address the problem.

formerly_proven · on Oct 11, 2020

I very much expect O(n) compilation of stuff like array declarations, enums, CPP macro definitions and so forth. What else would it be?

FartyMcFarter · on Oct 11, 2020

They may well be, provided one stays away from bad patterns such as humongous complex functions?