A crash in a NIF circumvents the entire supervisor architecture that makes failing fast work well in Erlang/Elixir. The idea is that a crash never pulls down the entire thing, only the parts that are directly connected. So in the original telecommunication use case, you'd happily fail a single call if something unexpected happens, but that never should take down the entire telecom switch and all other calls with it.
A bad NIF crashes the entire VM and breaks the Erlang/Elixir model of failure handling, so you really should not put code that crashes in there.
Yes mostly; when you get a failed allocation, for the most part there is nothing reasonable to do other than make a nice crash dump. Especially from the point of view of a language/toolkit.
As a operator/user, you can certainly write some code to scan for large processes, and kill them; with a fair bit of success, but that's not a great thing for the toolkit to do, it doesn't have a good way to tell what's too big or what's intended.
Nowadays in Erlang you can set a memory limit per process[1], ensuring they get killed if they try to allocate more. Of course you still have to estimate how much is reasonable which isn’t always obvious...
Actually, that's not exactly what Joe Armstrong envisioned when he created Erlang, the system.
If you read his 2003[1] thesis, at page 36 he quotes a paper from Jim Gray (who worked at Tandem computers and was a Turing award winner)
Although compiler checking and exception handling provided by programming languages are real assets, history seems to have favored the run-time checks plus the process approach to fault-containment. It has the virtue of simplicity—if a process or its processor misbehaves, stop it
and the on page 37
The idea of “fail-fast” modules is mirrored in our guidelines for programming where we say that processes should only do when they are supposed to do according to the specification, otherwise they should crash.
and then on page 40
Error handling is non-local.
When we make a fault-tolerant system we need at least two physically separated computers. Using a single computer will not work, if it crashes, all is lost. The simplest fault-tolerant system we can imagine has exactly two computers, if one computer crashes, then the other computer should take over what the first computer was doing. In this simple situation even the software for fault-recovery must be non-local; the error occurs on the first machine, but is corrected by software running on the second machine.
So actually the VM crash is a signal: something's gone really bad, don't even try to recover, just let it die and let some other non-local process take over (the OS, some `forever` like script, a process on another machine...).
A bad NIF crashes the entire VM and breaks the Erlang/Elixir model of failure handling, so you really should not put code that crashes in there.