In the bringing a tank to a knife fight kind of way, could this be optimized to ...

Dylan16807 · on June 21, 2024

These benchmarks are on 92 million byte files so we're into the range where bringing a tank is fair (and worth the startup cost).

stkdump · on June 21, 2024

I doubt that you can make it faster on the GPU than on CPU when utilizing SIMD, reason being that you are actually doing something close to trivial upon looking at each byte in sequence. So you transfer it from CPU memory to GPU memory in order to do almost nothing with it.

fragmede · on June 21, 2024

I've got it working on a T4 via Google Colab. The PDF takes 178 milliseconds to the 206 listed in the readme for the C version, so 15%?

https://github.com/fragmede/wc-gpu/blob/main/wc_gpu.ipynb

Dylan16807 · on June 21, 2024

It's only at a limit like that if you don't parallelize. And sure you could use more cores, but you can go a lot faster on 20% of a GPU than on 20% of your CPU cores.

fragmede · on June 21, 2024

I got nerd sniped into doing it. https://github.com/fragmede/wc-gpu