verl-FL User Guide#
Overview#
verl-FL is a fork of verl designed to support diverse AI accelerators. It is built on top of FlagOS, a unified open-source AI system software stack, and integrates key components including the training engine Megatron-LM-FL, Transformer-Engine-FL, and the inference engine vllm-plugin-FL.
verl (Volcano Engine Reinforcement Learning for LLMs) is a flexible, efficient, and production-ready RL training framework for large language models (LLMs).
Key Features#
Diverse RL Algorithms: PPO, GRPO, DAPO, DrGRPO, GMPO, SPPO, SPIN, RLOO, ReMax, REINFORCE++, PRIME, and more
Multi-Backend Training: FSDP, FSDP2, and Megatron-LM (via Megatron-LM-FL) for training; vLLM, SGLang, and HF Transformers for inference/rollout
Multi-Hardware Support: NVIDIA (CUDA), AMD (ROCm), Huawei Ascend (NPU) via platform abstraction
Scalable Architecture: Single-controller design with Ray for orchestration, scaling from single-GPU to thousands of GPUs
Advanced Features: Multi-turn tool calling, VLM RL, sequence packing, LoRA RL, expert parallelism, async training (fully async / one-step off-policy), speculative decoding for RL
Model Support: HuggingFace models including Qwen-3, Qwen-2.5, Llama3.1, Gemma2, DeepSeek (up to 671B), and VLMs
Getting Started#
Requirements#
Python >= 3.10
CUDA >= 12.8
Docker Installation (Recommended)#
docker pull verlai/verl:latest
pip Installation#
# Create conda environment
conda create -n verl python=3.10
conda activate verl
# Install PyTorch with CUDA 12.8
pip install torch --index-url https://download.pytorch.org/whl/cu128
# Clone and install verl-FL
git clone https://github.com/flagos-ai/verl-FL.git
cd verl-FL
pip install -e .
# Install vLLM for rollout
pip install vllm>=0.8.5
# Install Flash Attention
pip install flash-attn
Quick Installation Script#
# Installs vLLM, SGLang, and Megatron-Core backends
bash scripts/install_vllm_sglang_mcore.sh
Training Backends#
Backend |
Use Case |
Installation |
|---|---|---|
FSDP |
Default, easy setup |
Included with PyTorch |
FSDP2 |
Latest PyTorch distributed |
Included with PyTorch |
Megatron-LM |
Large-scale training |
Via Megatron-LM-FL |
Rollout Backends#
Backend |
Use Case |
Installation |
|---|---|---|
vLLM |
High-throughput inference |
|
SGLang |
Multi-turn, tool calling |
|
HF Transformers |
Simple, no extra deps |
Included |
Installation#
Custom Environment Setup#
# Create environment
conda create -n verl python=3.10
conda activate verl
# Install CUDA 12.8
conda install -c nvidia cuda-toolkit=12.8
# Install cuDNN
pip install nvidia-cudnn-cu12==9.10.1.2
# Install PyTorch
pip install torch --index-url https://download.pytorch.org/whl/cu128
# Install verl-FL
cd verl-FL
pip install -e .
# Install Apex (optional, for Megatron backend)
pip install -v --disable-pip-version-check --no-cache-dir \
--no-build-isolation --config-settings="--build-option=--cpp_ext" \
--config-settings="--build-option=--cuda_ext" \
git+https://github.com/NVIDIA/apex
AMD ROCm Support#
verl-FL supports AMD GPUs via ROCm. See the docs/amd_tutorial/ directory in the repository for detailed setup instructions.
Ascend NPU Support#
verl-FL supports Huawei Ascend NPUs. Install NPU-specific dependencies:
pip install -r requirements-npu.txt
See docs/ascend_tutorial/ in the repository for detailed instructions.
RL Algorithms#
PPO (Proximal Policy Optimization)#
The standard RL algorithm for LLM post-training with actor-critic architecture and GAE (Generalized Advantage Estimation).
Key configuration:
algorithm:
kl_ctrl:
type: fixed # KL divergence control
kl_coef: 0.001
trainer:
train_batch_size: 256
ppo_mini_batch_size: 64
ppo_epochs: 1
GRPO (Group Relative Policy Optimization)#
Critic-free algorithm that estimates baselines from group scores instead of a learned value function.
Key configuration:
rollout:
n: 8 # Number of samples per prompt
trainer:
train_batch_size: 256
ppo_mini_batch_size: 64
algorithm:
loss_agg_mode: token # or "seq" for sequence-level
DAPO (Decoupled Alignment Policy Optimization)#
Extension of GRPO with separated clip epsilons, dynamic sampling, and overlong reward shaping.
Other Algorithms#
GMPO: Geometric-Mean Policy Optimization for stable training
SPPO: Self-Play Preference Optimization
SPIN: Self-Play Fine-Tuning with online DPO loss
DrGRPO: GRPO with variance reduction
Platform Abstraction#
verl-FL includes a platform abstraction layer (verl/plugin/platform/) that provides a hardware-agnostic interface for multi-accelerator support.
Supported Platforms#
Platform |
Device |
Status |
|---|---|---|
CUDA |
NVIDIA GPUs |
Full support |
NPU |
Huawei Ascend |
Full support |
CPU |
CPU |
Basic support |
Adding a New Accelerator#
To add support for a new accelerator (e.g., XPU, ROCm, MLU), implement the platform interface in verl/plugin/platform/. See the existing platform implementations as reference.
Dataset Format#
verl-FL uses the following RLHF dataset schema:
{
"data_source": "dataset_name",
"prompt": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is 2+2?"}
],
"ability": "math",
"reward_model": {
"style": "rule",
"ground_truth": "4"
}
}
Fields:
data_source: Dataset identifierprompt: Chat-format messages (system/user/assistant roles)ability: Task category tagreward_model: Reward configuration (rule-based or model-based)