§ evals · storyline
ParallelKernelBench: Frontier LLMs can't write fast multi-GPU kernels (yet)
ParallelKernelBench benchmark shows frontier LLMs solve fewer than a third of multi-GPU CUDA kernel tasks despite occasionally outperforming public implementations.
ParallelKernelBench tests whether LLMs can write fast multi-GPU CUDA kernels across 87 real workloads.
The best model solves under a third, but a few generated kernels beat any public implementation.
§ sources1 publication · timeline below