Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's difficult to do because of how well matched they are to the hardware we have. They were partially designed to solve the mismatch between RNNs and GPUs, and they are way too good at it. If you come up with something truly new, it's quite likely you have to influence hardware makers to help scale your idea. That makes any new idea fundamentally coupled to hardware, and that's the lesson we should be taking from this. Work on the idea as a simultaneous synthesis of hardware and software. But, it also means that fundamental change is measured in decade scales.

I get the impulse to do something new, to be radically different and stand out, especially when everyone is obsessing over it, but we are going to be stuck with transformers for a while.



This is backwards. Algorithms that can be parallelized are inherently superior, independent of the hardware. GPUs were built to take advantage of the superiority and handle all kinds of parallel algorithms well - graphics, scientific simulation, signal processing, some financial calculations, and on and on.

There’s a reason so much engineering effort has gone into speculative execution, pipelining, multicore design etc - parallelism is universally good. Even when “computers” were human calculators, work was divided into independent chunks that could be done simultaneously. The efficiency comes from the math itself, not from the hardware it happens to run on.

RNNs are not parallelizable by nature. Each step depends on the output of the previous one. Transformers removed that sequential bottleneck.


There are large, large gaps of parallel stuff that GPUs can't do fast. Anything sparse (or even just shuffled) is one example. There are lots of architectures that are theoretically superior but aren't popular due to not being GPU friendly.


That’s not a flaw in parallelism. The mathematical reality remains that independent operations scale better than sequential ones. Even if we were stuck with current CPU designs, transformers would have won out over RNNs.

Unless you are pushing back on my comment "all kinds" - if so, I meant "all kinds" in the way someone might say "there are all kinds of animals in the forest", it just means "lots of types".


I was pushing back against "all kinds". The reason is that I've been seeing a number of inherently parallel architectures, but existing GPUs don't like some aspect of them (usually the memory access pattern).


yeah, bad writing on my part.


When you consider hardware-software co-design, the problem quits being an algorithms problem and becomes a computer engineering problem.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: