A crash in a NIF circumvents the entire supervisor architecture that makes faili...

bryanrasmussen · on Sept 16, 2019

Thanks for pointing out the error in my thinking!

heavenlyblue · on Sept 16, 2019

How does Elixir handle OOMs, in such a case?

Do they fail the entire VM, too?

toast0 · on Sept 16, 2019

Yes mostly; when you get a failed allocation, for the most part there is nothing reasonable to do other than make a nice crash dump. Especially from the point of view of a language/toolkit.

As a operator/user, you can certainly write some code to scan for large processes, and kill them; with a fair bit of success, but that's not a great thing for the toolkit to do, it doesn't have a good way to tell what's too big or what's intended.

ramchip · on Sept 17, 2019

Nowadays in Erlang you can set a memory limit per process[1], ensuring they get killed if they try to allocate more. Of course you still have to estimate how much is reasonable which isn’t always obvious...

[1] See max_heap_size under http://erlang.org/documentation/doc-9.0/erts-9.0/doc/html/er...

puzza007 · on Sept 17, 2019

walterstucco · on Sept 16, 2019

Actually, that's not exactly what Joe Armstrong envisioned when he created Erlang, the system.

If you read his 2003[1] thesis, at page 36 he quotes a paper from Jim Gray (who worked at Tandem computers and was a Turing award winner)

    Although compiler checking and exception handling provided by programming languages are real assets, history seems to have favored the run-time checks plus the process approach to fault-containment. It has the virtue of simplicity—if a process or its processor misbehaves, stop it

and the on page 37

    The idea of “fail-fast” modules is mirrored in our guidelines for programming where we say that processes should only do when they are supposed to do according to the specification, otherwise they should crash.

and then on page 40

    Error handling is non-local.

    When we make a fault-tolerant system we need at least two physically separated computers. Using a single computer will not work, if it crashes, all is lost. The simplest fault-tolerant system we can imagine has exactly two computers, if one computer crashes, then the other computer should take over what the first computer was doing. In this simple situation even the software for fault-recovery must be non-local; the error occurs on the first machine, but is corrected by software running on the second machine.

So actually the VM crash is a signal: something's gone really bad, don't even try to recover, just let it die and let some other non-local process take over (the OS, some `forever` like script, a process on another machine...).

[1] http://erlang.org/download/armstrong_thesis_2003.pdf

dmix · on Sept 17, 2019

You can't put long quotes in code formatting