FlagTensor Benchmark Policy

FlagTensor Benchmark Policy#

Scope#

This document defines the acceptance-facing benchmark policy for FlagTensor performance validation.

Benchmark Goals#

Compare FlagTensor kernels against cuTensor baselines.
Produce reproducible operator-level and category-level benchmark artifacts.
Support smoke and acceptance-level execution modes.
Report benchmark results with mode-aware output handling.

Benchmark Modes#

Mode	Meaning	Typical Use
`kernel`	Kernel-focused measurement	Low-level performance analysis
`operator`	Operator-level measurement	Default acceptance reporting
`wrapper`	Wrapper/API-path measurement	Integration-level validation

Execution Levels#

Smoke Benchmark#

Reduced shape set
Reduced dtype set
Intended for CI turnaround speed
Triggered via tools/run_flagtensor_ci.py --smoke --run-perf

Acceptance Benchmark#

Full configured shape coverage
Full supported dtype coverage for the selected operator set
Intended for release and acceptance review
Triggered via tools/run_flagtensor_ci.py --run-perf

Weekly Benchmark#

Registry-driven scheduled or manual regression execution
Intended for broader drift tracking across operators and GPUs
Triggered via tools/run_flagtensor_weekly.py

Shape and Dtype Policy#

Benchmark shapes should be centrally managed where possible.
Category-level benchmark entry points are the formal acceptance interface. Legacy per-operator benchmark files are retained as debugging and migration compatibility shims, not as an acceptance requirement.
Benchmark dtypes default to float16 and float32 unless the operator requires a specialized dtype set.

Timing Policy#

Warmup count and repetition count must be explicit and reproducible.
Current defaults are controlled through environment variables and runner flags.
Future consolidation should move shared timing policy into a centralized benchmark utility layer.

Reporting Policy#

Benchmark CSV selection must be mode-aware.
Reports should distinguish kernel, operator, and wrapper outputs.
Acceptance reporting should include pass/fail status and speedup statistics.
HTML and XLSX reporting are supported.

Category Benchmark Entry Points (Acceptance Interface)#

Benchmark execution uses category-level files as the formal acceptance interface. Individual operators are selected via pytest -m <op> markers.

Current category entry points (all four complete):

benchmark/test_unary_perf.py — 28 unary operators
benchmark/test_binary_perf.py — 4 binary operators
benchmark/test_contraction_perf.py — 5 contraction operators
benchmark/test_sparse_perf.py — 1 sparse operator

Legacy per-operator benchmark files (benchmark/test_CUTENSOR_OP_*_perf.py) are retained as implementation details for debugging and migration compatibility, but they are not part of the formal acceptance interface.

Source of Truth#

Registry metadata: conf/operators.yaml
Benchmark strategy overview: docs/benchmark_strategy.md
CI runner: tools/run_flagtensor_ci.py
Weekly runner: tools/run_flagtensor_weekly.py