Note that this paper appears to have a critical flaw: they didn't actually verify that their "optimized" kernels have the same output as the original kernels.
So some of their "optimized" kernels gamed the system by computing only parts of the actual result.
If you read the code carefully, and in particularly the kernel launch parameter, you will see that this only computes the first row of the matrix. Which means it's generating incorrect results.
So some of their "optimized" kernels gamed the system by computing only parts of the actual result.
Look at their most "improved" (147x faster!) kernel for example: https://pub.sakana.ai/ai-cuda-engineer/kernel/1/15/optimize-...
If you read the code carefully, and in particularly the kernel launch parameter, you will see that this only computes the first row of the matrix. Which means it's generating incorrect results.