FlagFFT Overview#
FlagFFT is an experimental C++ FFT library with a cuFFT-style API and Triton/TLE-generated CUDA kernels. The public runtime interface is C; Python is retained only for Triton/TLE JIT source generation (internal codegen).
FlagFFT is part of the FlagOS ecosystem and provides high-performance FFT computations for scientific computing, signal processing, and machine learning workloads.
Features#
cuFFT-style API — Familiar planning and execution interface for developers migrating from cuFFT.
JIT kernel compilation — Kernels are generated at plan creation time via Triton/TLE, eliminating Python compilation latency at execution time.
Arbitrary-length 1D and 2D transforms — Supports arbitrary composite lengths via fused four-step routes, including very large sizes (e.g., n = 2^23) without falling back to Bluestein. 2D FFT supports all six transform types.
Multiple transform types — C2C, Z2Z (complex), R2C, D2Z, C2R, Z2D (real-to-complex and reverse).
Plan description —
flagfftGetPlanDescriptionreturns detailed information about the plan node tree, kernel names, and compilation details for performance debugging.Native CLI —
flagfft-cliprovides benchmark measurement and plan inspection without Python overhead.
Architecture#
C++ Runtime#
Module |
Description |
|---|---|
|
cuFFT-style opaque handle API, backend-neutral |
|
|
|
Maps |
|
Invokes Python Triton/TLE source generation during plan creation, compiles via libtriton_jit |
|
Pip-installable source generator and bundled codelets |
|
Device allocation, stream/event operations, target identity, capability queries (CUDA Driver backend) |
|
Shared request/key utilities, JSON/SQLite tuning support |
Raw Execution Nodes#
CompiledRawLeafNode— Launches contiguous leaf kernel with plan-owned twiddle and DFT table allocations.CompiledRawFourStepFusedNode— Four-step routes with row and column leaf children, owns twiddle and intermediate buffer.CompiledRawBluesteinNode— Prime and awkward composite lengths via JIT prepare, pointwise, finalize, and convolution FFT child kernels.CompiledRaw2DNode— Contiguous complex 2D plans with RTRT route (row FFT → tiled transpose → row FFT → tiled transpose back).
CLI Tools#
src/cli_tools/common/ owns CaseSpec, deterministic buffer generation, FlagFFT/cuFFT dispatch, and comparison. cuFFT is used only in the CLI as the CUDA validation/performance oracle.
bench— Binds FlagFFT and cuFFT reference plans to one adaptor stream before warmup and timing.tune— Placeholder; exits withFLAGFFT_NOT_SUPPORTED.
Build Options#
Option |
Default |
Description |
|---|---|---|
|
|
Build |
|
|
Build Google Test targets under |
|
|
GPU backend selector (only |
Python Boundary#
The native runtime invokes python -m flagfft_codegen.jit_source; the chosen Python environment must supply compatible Triton/TLE dependencies. Generated JIT source/metadata live in .flagfft beside the executable.
Tests#
ctest/— Google Test accuracy tests for all operators.tools/run_tests.py— Unified test runner orchestrating accuracy and performance testing across multiple GPUs.tests/python/— Code generation tests.
Workflow#
Create an FFT plan with
flagfftPlan1d(orflagfftPlanManyfor batched transforms).Optionally attach a CUDA stream with
flagfftSetStream.Execute the transform with
flagfftExec*functions.Destroy the plan with
flagfftDestroy.