Use operators

Use operators#

After installing FlagGems-vLLM, you can use its optimized operators directly in your Python code.

For example, import the library and call the operators on CUDA tensors:

import torch
import flaggems_vllm

# Prepare a simple topk_ids tensor for MoE routing
num_tokens = 128
topk = 2
num_experts = 16
block_size = 32

topk_ids = torch.randint(
    low=0,
    high=num_experts,
    size=(num_tokens, topk),
    device='cuda',
    dtype=torch.int32,
)

# Align tokens by expert and block size
sorted_ids, expert_ids, num_tokens_post_pad = flaggems_vllm.ops.moe_align_block_size(
    topk_ids=topk_ids,
    block_size=block_size,
    num_experts=num_experts,
)

print(sorted_ids.shape, expert_ids.shape, num_tokens_post_pad)

For a full operator list, see Operator List.