Benchmark Results

Benchmark Results#

Key findings from KernelGenBench evaluation experiments.

Evaluation of 210 operators from three sources (ATen, vLLM, cuBLAS) on NVIDIA A100.

Multi-Source Results

Finding	Details
Highest accuracy	Claude Code (Opus-4.6) achieved 87%
Highest speedup	AutoKernel (Qwen3.5) achieved 1.02×
Most challenging	cuBLAS operators for all methods

Cross-platform evaluation of 110 ATen operators on 6 hardware platforms.

Multi-Chip Results

Cross-platform Performance

Finding	Details
Platform variance	Generation performance varies significantly across hardware
Cross-platform degradation	AutoKernel dropped from 87% (NVIDIA) to 25% (Platform E)
Compiler maturity impact	Non-NVIDIA platforms require 2× or more tokens and time

Warning

Large-scale agent evaluations may consume billions of tokens. Plan your budget accordingly.