Frequently Asked Questions#
This section answers common questions about KernelGenBench.
Installation#
Q: Which Python version do I need?#
A: Python 3.10 or higher is required.
Q: Can I use KernelGenBench on CPU-only machines?#
A: No, KernelGenBench requires GPU hardware for kernel verification. The generated Triton kernels must be executed on actual GPU hardware.
Q: Why does vLLM installation fail on non-NVIDIA platforms?#
A: vLLM is designed for NVIDIA GPUs. On non-NVIDIA platforms, torch and triton are pre-installed in vendor container images. Do NOT install vLLM on these platforms — the ATen dataset is automatically used instead.
Q: How do I install Claude Code CLI for the Agent Track?#
A: Run npm install -g @anthropic-ai/claude-code. You’ll also need an Anthropic API key set via export ANTHROPIC_API_KEY=your_key.
Usage#
Q: What’s the difference between LLM Track and Agent Track?#
A:
LLM Track: Tests direct kernel generation without execution feedback. Lower cost, suitable for comparing base model capabilities.
Agent Track: Tests iterative generation with execution feedback. Higher cost but better results, suitable for production-ready kernel generation.
Q: How do I test a single operator?#
A: Use the --op-name parameter:
# LLM Track
python scripts/generate_kernel_and_verify.py --op-name aten::add --single-test
# Agent Track
cd agent_bench && bash test_ops.sh add --device-count 1
Q: Which dataset should I use?#
A:
NVIDIA GPUs: Use
KernelGenBench(210 operators) for full evaluationNon-NVIDIA platforms: Use
KernelGenBench-aten(110 operators), which is auto-selectedSpecific focus: Use
KernelGenBench-vllmfor inference kernels orKernelGenBench-cublasfor linear algebra
Q: How long does a full benchmark take?#
A:
LLM Track (Pass@5, 210 operators): ~6-12 hours depending on model and hardware
Agent Track (Claude Code, 210 operators): ~24-48 hours depending on operator complexity
Q: How can I reduce evaluation time?#
A:
Use
--debugmode (only 8 operators) for testingIncrease
--device-countfor parallel verificationUse smaller datasets (
KernelGenBench-ateninstead of full)Use LLM Track instead of Agent Track
Results#
Q: What does accuracy mean?#
A: Accuracy is the percentage of operators where at least one generated kernel passes all test cases and anti-hack checks.
Q: What does speedup mean?#
A: Speedup is the geometric mean of (generated kernel time / baseline time). A speedup > 1.0Ă— means the generated kernel is faster than the baseline.
Q: Why is my speedup less than 1.0Ă—?#
A: Generated kernels may not always outperform optimized baselines. This is expected, especially for:
cuBLAS operators (highly optimized over decades)
Complex vLLM operators
Operators on immature non-NVIDIA platforms
Q: Where are my results saved?#
A:
LLM Track:
output/pass_at_k/<timestamp>/Agent Track:
agent_bench/runs/<method>_<dataset>_<timestamp>/
Errors#
Q: Why do I get “CUDA out of memory” errors?#
A: Reduce --device-count or use smaller batch sizes. Some operators require significant GPU memory.
Q: Why do generated kernels fail verification?#
A: Common reasons:
Numerical precision mismatch (tolerance too strict)
Edge cases not handled in kernel logic
Memory access violations
Shape/dtype mismatches
Q: Why does the anti-hack check fail?#
A: The generated kernel may be calling blacklisted APIs instead of implementing the actual computation. Check the kernel code for:
Direct calls to
torch.ops.aten.*Imports of
vllmorctypesAny bypass of Triton computation
Platform-Specific#
Q: How do I run on Ascend NPU?#
A: Install Ascend dependencies and run in the vendor container image:
pip install -r requirements/requirements_ascend.txt
pip install -e .
# Framework will auto-detect Ascend hardware
Q: Why is accuracy lower on non-NVIDIA platforms?#
A: Non-NVIDIA platforms have:
Less mature Triton compilers
Incomplete backend support
Different memory models
Different performance characteristics
This is expected and demonstrates the cross-platform portability challenge.
Q: Can I use custom hardware?#
A: Yes, you can extend KernelGenBench for new platforms by:
Adding device detection in
src/runtime/Creating platform-specific templates in
agent_bench/templates/Adding platform-specific tolerances
Cost#
Q: How much does evaluation cost?#
A: Costs depend on:
Method (Pass@1 < Pass@5 < Claude Code < AKO4ALL)
Number of operators
Model choice
For reference, the full KernelGenBench evaluation consumed 15+ billion tokens.
Q: How can I estimate costs before running?#
A: Start with --debug mode (8 operators) to measure token consumption per operator, then extrapolate.
Contributing#
Q: How do I add a new operator?#
A: See CONTRIBUTING.md for detailed instructions on adding test cases.
Q: How do I add a new agent method?#
A: Create a new directory in agent_bench/methods/ following the existing structure.