Project Structure

Project Structure#

PyTorch-Plugin-FL/
β”œβ”€β”€ include/                  # Public headers
β”‚   β”œβ”€β”€ flagos.h              #   Unified runtime API (memory, stream, device)
β”‚   └── macros.h              #   Common macros
β”œβ”€β”€ accelerator/              # Hardware abstraction layer
β”‚   β”œβ”€β”€ csrc/cuda/            #   CUDA runtime implementation
β”‚   β”œβ”€β”€ csrc/maca/            #   MACA cudart shim (symbol version compatibility)
β”‚   └── csrc/ascend/          #   Ascend runtime (ACL-based memory, stream, device)
β”œβ”€β”€ csrc/
β”‚   β”œβ”€β”€ aten/                 # ATen operator layer
β”‚   β”‚   β”œβ”€β”€ common.{h,cc}     #   Backend config loading, FlagosDevice enum
β”‚   β”‚   β”œβ”€β”€ dispatch_stub.h   #   Lightweight dispatch stub (replaces PyTorch DispatchStub)
β”‚   β”‚   β”œβ”€β”€ device_boxing.h   #   Zero-copy flagos↔CUDA tensor metadata conversion
β”‚   β”‚   β”œβ”€β”€ register.cc       #   PrivateUse1 dispatch key registration
β”‚   β”‚   β”œβ”€β”€ {op}.{h,cc}       #   Per-operator stub definitions (add, mm, silu, etc.)
β”‚   β”‚   β”œβ”€β”€ factory_ops/      #   Basic operators (empty, copy, contiguous, set, fallback)
β”‚   β”‚   β”œβ”€β”€ functional_ops/   #   Compute operators (mm, bmm, cat, embedding, softmax, etc.)
β”‚   β”‚   └── backends/         #   Backend-specific kernel implementations
β”‚   β”‚       β”œβ”€β”€ cuda/         #     CUDA kernels (cuBLAS, modified PyTorch kernels)
β”‚   β”‚       β”œβ”€β”€ flagos/       #     FlagGems C++ native API wrappers
β”‚   β”‚       └── ascend/       #     Ascend kernels (ACL NN API)
β”‚   └── runtime/              # Device runtime
β”‚       β”œβ”€β”€ device_allocator  #   Device memory allocator
β”‚       β”œβ”€β”€ host_allocator    #   Pinned memory allocator
β”‚       β”œβ”€β”€ guard             #   DeviceGuard implementation
β”‚       β”œβ”€β”€ generator         #   RNG generator
β”‚       β”œβ”€β”€ hooks             #   Runtime hooks
β”‚       └── accelerator/      #   Hardware abstraction layer
β”‚           β”œβ”€β”€ cuda/         #     CUDA runtime implementation
β”‚           β”œβ”€β”€ maca/         #     MACA cudart shim (symbol version compatibility)
β”‚           └── ascend/       #     Ascend runtime (ACL-based memory, stream, device)
β”œβ”€β”€ torch_fl/
β”‚   β”œβ”€β”€ __init__.py           # Plugin entry point: register device, load FlagGems operators
β”‚   β”œβ”€β”€ flagos/               # Python device module (stream, event, RNG, AMP)
β”‚   β”œβ”€β”€ accelerator/          # Python accelerator module (MACA shim loader)
β”‚   β”œβ”€β”€ backends.conf         # Default backend routing config (CUDA/FlagGems)
β”‚   β”œβ”€β”€ backends_ascend.conf  # Ascend backend routing config (all ops β†’ ascend)
β”‚   β”œβ”€β”€ distributed.py        # Distributed training support (DDP patch)
β”‚   β”œβ”€β”€ integration.py        # FlagGems operator registration logic
β”‚   β”œβ”€β”€ csrc/                 # C extension (module.cc, stub.c)
β”‚   └── lib/                  # Compiled shared libraries (libtorch_fl.so, libflagos.so)
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ integration/          # Automated integration tests
β”‚   β”‚   β”œβ”€β”€ ops/              #   Per-operator dispatch tests
β”‚   β”‚   β”œβ”€β”€ test_qwen3_*.py   #   End-to-end model tests
β”‚   β”‚   └── conftest.py       #   Pytest configuration
β”‚   β”œβ”€β”€ manual/               # Manual test scripts
β”‚   └── common/               # Test utilities
β”œβ”€β”€ debug/                    # Development notes and debug scripts
β”œβ”€β”€ cmake/                    # CMake modules
β”œβ”€β”€ setup.py                  # CMake build entry point
└── pyproject.toml