端到端用例:TransformerEngine-FL + Megatron-LM-FL + FlagScale#
本指南介绍使用 TransformerEngine-FL、Megatron-LM-FL 和 FlagScale 的端到端训练工作流,在 CUDA(NVIDIA)和 MetaX 平台上执行。
1. Docker 环境#
FlagOS 发版镜像(v0.2.0-rc2,推荐)#
docker pull harbor.baai.ac.cn/flagos21-release/megatron-lm-fl:v0.2.0-rc2-nvidia
内含 torch 2.4.0a0, triton 3.0.0, trans-engine 2.14.0。适用于千亿参数模型预训练。
CUDA(NVIDIA)#
docker pull harbor.baai.ac.cn/flagscale/flagscale-train:dev-cu128-py3.12-20260319182856
docker run -itd --gpus all --shm-size=500g --name <name> harbor.baai.ac.cn/flagscale/flagscale-train:dev-cu128-py3.12-20260319182856 /bin/bash
docker exec -it <name> /bin/bash
conda activate flagscale-train
pip install flash-attn==2.8.3 --no-build-isolation
pip install upgrade wandb tensorboard
MetaX#
docker pull harbor.baai.ac.cn/flagscale/megatron-lm-with-te:202603231839
docker run -itd --gpus all --shm-size=500g --name <name> --ulimit nofile=65535:65535 --device=/dev/dri --device=/dev/mxcd harbor.baai.ac.cn/flagscale/megatron-lm-with-te:202603231839
docker exec -it <name> /bin/bash
conda activate base
2. 准备 FlagScale#
git clone https://github.com/flagos-ai/FlagScale.git
cd FlagScale
# 仅 CUDA
pip install -r requirements/cuda/train.txt
git checkout <release-tag>
3. 准备 Megatron-LM-FL#
git clone https://github.com/flagos-ai/Megatron-LM-FL.git
cd Megatron-LM-FL
git checkout <release-tag>
pip install . --no-build-isolation --root-user-action=ignore
4. 准备 TransformerEngine-FL#
git clone https://github.com/flagos-ai/TransformerEngine-FL.git
cd TransformerEngine-FL
git checkout <release-tag>
git submodule update --init --recursive
MAX_JOBS=64 pip install -v . --no-build-isolation --root-user-action=ignore
# MetaX 环境(镜像:harbor.baai.ac.cn/flagscale/megatron-lm-with-te:202603231839):
TE_FL_SKIP_CUDA=1 MAX_JOBS=64 pip install -v . --no-build-isolation --root-user-action=ignore
5. 准备数据集和分词器#
数据集#
mkdir -p ./data && cd ./data
wget https://baai-flagscale.ks3-cn-beijing.ksyuncs.com/datasets/enron_emails_demo_text_document_qwen/enron_emails_demo_text_document_qwen.idx
wget https://baai-flagscale.ks3-cn-beijing.ksyuncs.com/datasets/enron_emails_demo_text_document_qwen/enron_emails_demo_text_document_qwen.bin
分词器#
mkdir -p ./qwentokenizer && cd ./qwentokenizer
wget "https://baai-flagscale.ks3-cn-beijing.ksyuncs.com/tokenizers/qwentokenizer/tokenizer_config.json" -O tokenizer_config.json
wget "https://baai-flagscale.ks3-cn-beijing.ksyuncs.com/tokenizers/qwentokenizer/qwen.tiktoken" -O qwen.tiktoken
wget "https://baai-flagscale.ks3-cn-beijing.ksyuncs.com/tokenizers/qwentokenizer/qwen_generation_utils.py" -O qwen_generation_utils.py
wget "https://baai-flagscale.ks3-cn-beijing.ksyuncs.com/tokenizers/qwentokenizer/tokenization_qwen.py" -O tokenization_qwen.py
6. Qwen3 训练测试#
需要 4 个 GPU。te_fl_prefer 参数控制后端选择 — 使用 vendor 选择供应商特定实现,或使用 reference 选择纯 PyTorch 回退。
cd FlagScale
python run.py \
--config-path examples/qwen3/conf \
--config-name train_te_fl.yaml \
action=run \
experiment.exp_dir=./ \
experiment.runner.hostfile=null \
'~experiment.runner.ssh_port' \
'experiment.cmds.before_start="ulimit -n 1048576 && source /root/miniconda3/bin/activate base"' \
'experiment.envs.CUDA_VISIBLE_DEVICES="4,5,6,7"' \
train.system.tensor_model_parallel_size=4 \
train.model.num_layers=4 \
train.data.data_path=./data/enron_emails_demo_text_document_qwen \
train.data.tokenizer.tokenizer_path=./qwentokenizer \
train.model.te_fl_prefer=vendor \
train.model.distributed_backend=nccl \
+train.model.attention_backend=flash \
train.model.enable_flag_gems=False \
'~train.model.te_fl_allow_vendors' \
'~train.model.te_fl_deny_vendors' \
'~train.model.te_fl_per_op' \
'~train.model.flag_gems_log_path' \
'~train.model.flag_gems_unused'
关键配置选项:
参数 |
描述 |
值 |
|---|---|---|
|
首选后端 |
|
|
注意力实现 |
|
|
启用 FlagGems 算子调度 |
|
7. DeepSeek-V3 训练测试#
需要 4 个 GPU。使用专家并行和逐算子后端配置进行分组 GEMM。
cd FlagScale
python run.py \
--config-path examples/deepseek_v3/conf \
--config-name train.yaml \
action=run \
experiment.exp_dir=./ \
'experiment.cmds.before_start="ulimit -n 1048576 && source /root/miniconda3/bin/activate base"' \
'experiment.envs.CUDA_VISIBLE_DEVICES="4,5,6,7"' \
train.system.decoder_first_pipeline_num_layers=2 \
train.system.expert_model_parallel_size=2 \
train.model.num_layers=4 \
'train.model.moe_layer_freq="[0]+[1]*3"' \
train.data.data_path=./data/enron_emails_demo_text_document_qwen \
train.data.tokenizer.tokenizer_path=./qwentokenizer \
+train.model.enable_flag_gems=False \
+train.model.attention_backend=unfused \
+train.model.te_fl_prefer=vendor \
'+train.model.te_fl_per_op="te_general_grouped_gemm=vendor"'
8. 注意事项#
根据您的平台更改 conda 环境
所有训练测试需要 4 个 GPU
标记为可选的参数可根据您的设置进行调整
训练日志写入:
./logs/host_0_localhost.output对于 MetaX,安装 TransformerEngine-FL 前设置
TE_FL_SKIP_CUDA=1