Release Notes#
[2026/03] Released v0.11:
Enables kernel-based communication on heterogeneous platforms, including NVIDIA and Hygon.
Adds support for both host-side and device-side one-sided communication semantics.
Introduces adaptor plugin support, enabling dynamic loading of user-defined Device, CCL, and Net adaptor implementations.
[2026/02] Released v0.10:
Implements 11 chip-decoupled collective communication algorithms in uniRunner mode.
Refactors Device Intra-/Inter-node API and integrates NCCL Device API support on NVIDIA platforms.
Enhances usability with pip install support for FlagCX and an NCCL wrapper plugin for seamless adoption on NVIDIA platforms.
[2026/01] Released v0.9:
Adds support for Enflame, including
topsAdaptorandecclAdaptor.Extends flagcxCCLAdaptor to support symmetric operations.
Introduces the NCCL Device API in ncclAdaptor to enable customized AllReduce operations.
Refactors
glooAdaptorto support both TCP and IB transports, with automatic NIC detection.
[2025/12] Released v0.8:
Enables intra-node zero-copy to improve data transfer efficiency for small messages.
Supports a naive AllReduce implementation in uniRunner mode using a CPU-centric, device-assisted algorithm.
Adds one-sided communication primitives via the new APIs flagcxHeteroPut and flagcxHeteroPutSignal.
[2025/11] Released v0.7:
Added support to TsingMicro, including device adaptor
tsmicroAdaptorand CCL adaptortcclAdaptor.Implemented an experimental kernel-free non-reduce collective communication (SendRecv, AlltoAll, AlltoAllv, Broadcast, Gather, Scatter, AllGather) using device-buffer IPC/RDMA.
Enabled auto-tuning on NVIDIA, MetaX, and Hygon platforms, achieving 1.02รโ1.26ร speedups for AllReduce, AllGather, ReduceScatter, and AlltoAll.
Enhanced
flagcxNetAdaptorwith one-sided primitives (put,putSignal,waitValue) and added retransmission support for reliability improvement.
[2025/10] Released v0.6:
Implemented device-buffer IPC communication to support intra-node SendRecv operations.
Introduced device-initiated, host-launched device-side primitives, enabling kernel-based communication directly from devices.
Enhanced auto-tuning with 50% performance improvement on MetaX platforms for the AllReduce operations.
[2025/09] Released v0.5:
Added support for AMD GPUs, including a device adaptor
hipAdaptorand a CCL adaptorrcclAdaptor.Introduced
flagcxNetAdaptorto unify network backends, currently supporting socket, IBRC, UCX and IBUC (experimental).Enabled zero-copy device-buffer RDMA (user-buffer RDMA) to boost performance for small messages.
Supported auto-tuning in homogeneous scenarios via
flagcxTuner.Added test automation in CI/CD for PyTorch APIs.
[2025/08] Released v0.4:
Supported heterogeneous training of ERNIE4.5 (Baidu) on NVIDIA and Iluvatar GPUs with Paddle + FlagCX.
Improved heterogeneous communication across arbitrary NIC configurations, with more robust and flexible deployments.
Introduced an experimental network plugin interface with extended supports for IBRC and SOCKET. Device buffer registration now can be done via DMA-BUF.
Added an InterOp-level DSL to enable customized C2C algorithm design.
Provided user documentation under
docs/.
[2025/07] Released v0.3:
Integrated three additional native communication libraries: HCCL (Huawei), MUSACCL (Moore Threads) and MPI.
Enhanced heterogeneous collective communication operations with pipeline optimizations.
Introduced device-side functions to enable device-buffer RDMA, complementing the existing host-side functions.
Delivered a full-stack open-source solution, FlagScale + FlagCX, for efficient heterogeneous prefilling-decoding disaggregation.
[2025/05] Released v0.2:
Integrated 3 additional native communications libraries, including MCCL (Moore Threads), XCCL (Mellanox) and DUCCL (BAAI).
Improved 11 heterogeneous collective communication operations with automatic topology detection and full support to single-NIC and multi-NIC environments.
[2025/04] Released v0.1:
Added 5 native communications libraries including CCL adaptors for NCCL (NVIDIA), IXCCL (Iluvatar), and CNCL (Cambricon), and Host CCL adaptors GLOO and Bootstrap.
Supported 11 heterogeneous collective communication operations using the C2C (Cluster-to-Cluster) algorithm.
Provided a full-stack open-source solution, FlagScale + FlagCX, for efficient heterogeneous training.
Natively integrated into PaddlePaddle v3.0.0, with support for both dynamic and static graphs.