Install software for running an inference task#
Install vllm from the official v0.18.1 (optional if the correct version is installed) or from the fork vllm-FL.
Install vllm-plugin-FL
2.1 Clone the repository:
git clone https://github.com/flagos-ai/vllm-plugin-FL
2.2 install
cd vllm-plugin-FL pip install --no-build-isolation . # or editble install pip install --no-build-isolation -e .
Install FlagGems
3.1 Install build dependencies
pip install -U scikit-build-core==0.11 pybind11 ninja cmake
3.2 Install FlagGems
git clone https://github.com/flagos-ai/FlagGems git checkout v5.0.0 cd FlagGems pip install --no-build-isolation . # or editble install pip install --no-build-isolation -e .
(Optional) Install FlagCX
4.1 Clone the repository:
git clone https://github.com/flagos-ai/FlagCX.git cd FlagCX git checkout -b v0.9.0 git submodule update --init --recursive
4.2 Build the library with different flags targeting to different platforms:
make USE_NVIDIA=1
4.3 Set environment
export FLAGCX_PATH="$PWD"
4.4 Install FlagCX
cd plugin/torch/ FLAGCX_ADAPTOR=[xxx] pip install . --no-build-isolation # or editable install FLAGCX_ADAPTOR=[xxx] pip install -e . --no-build-isolation
Note
[xxx] should be selected according to the current platform, e.g., nvidia, ascend, etc.
If there are multiple plugins in the current environment, you can specify use vllm-plugin-fl via VLLM_PLUGINS=‘fl’.
Additional setup steps for running an inference task on Huawei Ascend#
Install FlagTree
RES="--index-url=https://resource.flagos.net/repository/flagos-pypi-hosted/simple --trusted-host=https://resource.flagos.net" python3 -m pip install flagtree==0.4.0+ascend3.2 $RES
Set required environment variable
export TRITON_ALL_BLOCKS_PARALLEL=1
Enable eager execution
Ascend requires eager execution. Add
enforce_eager=Trueto theLLMconstructor or pass--enforce-eageron the command line.
Additional setup steps for running an inference task with CUDA#
This section illustrates how to run an inference task with CUDA through setting environment variables.
For operator dispatch environment variables, see Environment variables.
Use CUDA communication library#
This section demonstrates how to run an inference task with CUDA by setting environment variables.
unset FLAGCX_PATH
Use native CUDA operators#
If you want to use the original CUDA operators, you can set the following environment variables.
export USE_FLAGGEMS=0
Dispatch operators#
With vllm-plugin-FL, you can also dispatch operators.
For concept related information, see vllm-plugin-FL Overview. For configuration related information, see Operator dispatch user guide