Glossary

Glossary#

This section defines technical terminology used throughout the KernelGenBench documentation.

Agent

A coding agent that autonomously generates, executes, and iterates on code based on feedback. In KernelGenBench, agents like Claude Code and OpenCode can debug and optimize kernels through execution-driven reinforcement.

ATen

PyTorch’s native tensor library, providing fundamental operations for deep learning. KernelGenBench includes 110 ATen operators derived from real model training traces.

CUDA

NVIDIA’s proprietary parallel computing platform and programming model for GPU acceleration. CUDA is deeply tied to NVIDIA hardware architecture.

cuBLAS

NVIDIA’s closed-source Basic Linear Algebra Subprograms library, highly optimized for NVIDIA GPUs. KernelGenBench includes 50 cuBLAS operators representing extreme performance challenges.

GEMM

General Matrix Multiplication, a fundamental linear algebra operation. cuBLAS includes numerous GEMM variants across different precisions and batching modes.

Kernel

A function that executes on a GPU, written in CUDA or Triton. Kernels directly determine computational performance and must be optimized for specific hardware.

KernelGenBench

A comprehensive benchmark framework for evaluating LLM and agent-based Triton kernel generation across multiple hardware platforms. Part of the FlagOS ecosystem.

KernelGenBench-aten

A dataset subset containing 110 PyTorch ATen operators, used for cross-platform evaluation on all supported hardware.

KernelGenBench-cublas

A dataset subset containing 50 cuBLAS operators, available only on NVIDIA platforms due to library dependencies.

KernelGenBench-nocublas

A dataset subset containing 160 operators (ATen + vLLM), used for NVIDIA evaluation without cuBLAS dependency.

KernelGenBench-MS

The Multi-Source sub-benchmark evaluating 210 operators from three sources (ATen, vLLM, cuBLAS) on NVIDIA hardware.

KernelGenBench-MC

The Multi-Chip sub-benchmark evaluating 110 ATen operators across six hardware platforms to measure performance portability.

KernelGenBench-vllm

A dataset subset containing 50 vLLM operators, available only on NVIDIA platforms.

LLM

Large Language Model, an AI model trained on vast amounts of text data. In KernelGenBench, LLMs are evaluated on their ability to generate GPU kernels.

Operator

A reusable computational unit in deep learning frameworks. Operators define “what” to compute (e.g., torch.add), while kernels define “how” to execute on hardware.

Pass@K

An evaluation metric measuring whether at least one correct solution exists among K generated samples. Pass@1 tests single-generation capability; Pass@5 allows multiple attempts.

PagedAttention

A memory-efficient attention mechanism used in vLLM for LLM inference. Part of the vLLM operator subset in KernelGenBench.

Speedup

Performance improvement ratio of generated kernel versus baseline implementation. Calculated as geometric mean across test cases and operators.

Triton

An open-source programming language for GPU kernels that abstracts low-level details while maintaining high performance. Triton code is portable across different GPU architectures.

vLLM

A high-throughput LLM inference engine with custom CUDA kernels. KernelGenBench includes 50 vLLM operators representing production inference workloads.


Acronyms#

Acronym

Full Name

AST

Abstract Syntax Tree

ATen

A Tensor Library

BLAS

Basic Linear Algebra Subprograms

CUDA

Compute Unified Device Architecture

DCU

Data Center Accelerator

GEMM

General Matrix Multiplication

GPU

Graphics Processing Unit

LLM

Large Language Model

MUSA

Moore Threads Unified System Architecture

NPU

Neural Processing Unit


Hardware Platforms#

Platform

Vendor

Description

NVIDIA

NVIDIA

A100 GPUs, primary evaluation baseline

Ascend

Huawei

Neural Processing Units

MUSA

Moore Threads

GPU architecture

Hygon

Hygon

Data Center Accelerators

Iluvatar

Iluvatar

AI accelerators

MetaX

MUXI

GPU accelerators