It's not especially difficult to write a parallel array sum in CUDA, which is C+...

It's not especially difficult to write a parallel array sum in CUDA, which is C++ with a couple of keywords bolted on. Haven't done that in a bit, but I wrote a SIMD hsum not long ago without much difficulty either.

C was of course originally designed for the PDP-11, but neither the standard nor the implementations have assumed that anytime this century. It would be a quite a stretch to say that thread local storage, atomics, the weird restrictions on pointers to deal with segmented architectures, IEEE floats, and other "modern" additions have anything to do with PDP-11s. And obviously you can take C/C++ code and efficiently build it for a wildly different architecture, like you do every time you use a compiler (including NVCC).

I'm not even saying that C is the fastest possible language because it really shouldn't be. What I'm saying is that decades of HLL advocates saying that we just need a sufficiently smart compiler to beat C have failed to produce one. C-family languages remain the gold standard for performance, and there's not much that even reliably competes beyond Rust and Fortran. Fortran is also an interesting example of a "low level" language without many of the bad ideas of C that ends up not much faster these days.