FlagTensor Known Issues#
This document tracks known issues and limitations in the current FlagTensor implementation.
Experimental Operators#
block_sparse_tensor_contraction#
Status: Experimental
Issue: Sparse tensor contraction support is still under active development
Impact: May have limited shape/dtype coverage compared to dense operators
Recommendation: Use for evaluation only; not for production workloads
Known Limitations#
Operator-Specific Numerical Issues#
CI Environment#
GPU Access: CI workflows run on ubuntu-latest (CPU) without GPU access
Actual GPU validation must be done via Slurm on cluster nodes
CI correctness/perf jobs currently validate structure and integration, not actual GPU correctness
Memory: CI runners have limited memory; large shape tests are reduced in smoke mode
Benchmark Mode Coverage#
kernel mode: Fully supported for most operators
operator mode: Supported for subset of operators
wrapper mode: Limited support; mainly for operators where wrapper-level optimization is beneficial
Dtype Coverage#
float16: Fully supported across operators
float32: Fully supported across operators
bfloat16: Supported across unary and contraction operators; verified in correctness tests
complex64/complex128: Supported only for
conjoperator. Tritonβs type system does not natively support complex dtypes; other operators reject complex inputs.
Shape Coverage#
Small shapes: (1024,), (4096,) - covered in correctness and smoke benchmark
Medium shapes: (128, 128), (32, 64, 16) - covered in correctness tests
Large shapes: Up to 2^24 elements - covered in full benchmark runs
Contraction shapes: Specialized shapes for layout/chain validation
Performance Notes#
Triton Autotuner: Current Triton version uses deprecated warmup/rep parameters
Deprecation warnings appear in benchmark output
Does not affect functionality; will be addressed in future Triton upgrade
cuTensor Baseline: Performance comparisons against cuTensor C API
Some operators may show speedup < 1x for certain shapes/dtypes
This is expected behavior and not necessarily an issue
Migration Notes#
Directory Structure Transition#
ctests/: Legacy correctness test directory; migrated to tests/
benchmark/: Single-operator perf files retained as implementation details; category-level entry points are the formal acceptance interface
tests/: Unified correctness entry with proxy layer for legacy tests
src/flagtensor/testing/: Centralized tolerance/assertion helpers
Registry Transition#
weekly_op_test.txt: Removed; operator list is generated from registry
discover_ops(): Legacy discovery function; being replaced by registry-based filtering
Manual exclusion:
--exclude-opflags still supported but registry is preferred
Future Work#
Migrate all correctness tests from ctests/ to tests/ with category organization
Category directories created (unary/, binary/, contraction/, sparse/)
Loader supports skipping migrated operators
Unary operators: 27 migrated
Binary operators: 4 migrated (add, mul, max, min - all complete)
Contraction operators: 4 migrated (gett, tgett, ttgt, tensor_contraction_trinary)
Sparse operators: 1 migrated (block_sparse_tensor_contraction, float16 now active)
Add category-level benchmark entry points (formal acceptance interface)
test_unary_perf.py
test_binary_perf.py
test_contraction_perf.py
test_sparse_perf.py
Upgrade Triton to remove deprecation warnings
Add GPU runner to CI for actual correctness validation
Expand bfloat16 dtype coverage
Improve wrapper mode coverage
Add acceptance-level performance regression detection