Overview#
FlagCX is a scalable and adaptive cross-chip communication library. It serves as a platform where developers, researchers, and AI engineers can collaborate on various projects, contribute to the development of cutting-edge AI solutions, and share their work with the global community.
FlagCX leverages native collective communication libraries to provide full single-chip communication support across platforms. Beyond its native x-CCL integrations, FlagCX introduces original device-buffer IPC and device-buffer RDMA technologies, enabling high-performance P2P operations for both cross-chip and single-chip scenarios. These mechanisms can be seamlessly combined with native x-CCL backends to deliver optimized performance for cross-chip collective communications.
Architecture#
FlagCX is organized into three layers:
User Interface Layer (UIL)#
The public C API defined in flagcx/include/flagcx.h. It exposes:
Communicator lifecycle:
flagcxCommInitRank,flagcxCommDestroy,flagcxCommFinalizeCollective operations: AllReduce, AllGather, ReduceScatter, Broadcast, Reduce, Gather, Scatter, AlltoAll, AlltoAllv, Send, Recv
One-sided RDMA operations:
flagcxGet,flagcxPutSignal,flagcxSignal,flagcxWaitSignalMemory registration:
flagcxMemAlloc,flagcxCommRegister,flagcxCommWindowRegisterGroup semantics:
flagcxGroupStart/flagcxGroupEnd
Communication Runtime Layer (CRL)#
The runtime implements four execution strategies (runners), selected based on communicator type and environment configuration:
Runner |
Mode |
Activation |
|---|---|---|
homoRunner |
Homogeneous communication (same chip type) |
Default for homogeneous communicators |
hostRunner |
Host-side (CPU) communication |
|
hybridRunner |
Multi-cluster heterogeneous communication |
|
uniRunner |
Unified heterogeneous communication |
|
The CRL also includes topology detection, proxy threads for async communication, P2P transport, an auto-tuner (flagcxTuner) for algorithm/protocol selection, and a cost model.
Portable Abstraction Layer (PAL)#
Hardware abstraction via the adaptor pattern. Each build selects exactly one device adaptor and two CCL adaptors (host + device):
CCL adaptors (
flagcx/adaptor/ccl/): One per vendor CCL library โ NCCL, HCCL, IXCCL, CNCL, MCCL, DUCCL, MUSACCL, RCCL, TCCL, ECCLDevice adaptors (
flagcx/adaptor/device/): One per hardware runtime โ CUDA, CANN, IXCUDA, MLU, MACA, MUSA, HIP, TOPS, etc.Net adaptors (
flagcx/adaptor/net/): Network transport โ InfiniBand, socket, UCXTuner adaptors (
flagcx/adaptor/tuner/): Tuning strategy plugins
Starting from v0.11, FlagCX supports adaptor plugins โ user-defined Device, CCL, and Net adaptor implementations that are dynamically loaded at runtime via dlopen. See the adaptor_plugin/ directory in the FlagCX repository for the SDK documentation and examples.