Examples#

Common LLM Track use cases.

Quick Verification#

Test that everything is working correctly:

python scripts/generate_kernel_and_verify.py \
    --op-name aten::add \
    --single-test \
    --server-type openai \
    --model-name gpt-4o \
    --max-rounds 1

Pass@1 Evaluation#

Evaluate single-shot generation:

python scripts/generate_kernel_and_verify.py \
    --server-type openai \
    --model-name gpt-4o \
    --max-rounds 1 \
    --temperature 0

Pass@5 Evaluation#

Evaluate best-of-5 generation:

python scripts/generate_kernel_and_verify.py \
    --server-type openai \
    --model-name gpt-4o \
    --max-rounds 5 \
    --temperature 0.8

Cross-Platform Testing#

On non-NVIDIA hardware:

# Dataset automatically set to KernelGenBench-aten
python scripts/generate_kernel_and_verify.py \
    --server-type openai \
    --model-name gpt-4o

Specific Operator Families#

Test all GEMM variants:

python scripts/generate_kernel_and_verify.py \
    --dataset KernelGenBench-cublas \
    --server-type openai \
    --model-name gpt-4o

Debugging and Iteration#

Start with debug mode:

# Test 8 operators
python scripts/generate_kernel_and_verify.py \
    --debug \
    --server-type openai

# If successful, run full benchmark
python scripts/generate_kernel_and_verify.py \
    --server-type openai

Analyzing Results#

python scripts/analyze/analyze.py output/pass_at_k/<run_dir>/

Expected Results#

Based on KernelGenBench experiments (NVIDIA A100):

Method

Accuracy (210 ops)

Pass@1 (Opus-4.6)

41%

Pass@5 (Opus-4.6)

57%

Pass@1 (GPT-4o)

~35%

Pass@5 (GPT-4o)

~50%