Operator List#
This page lists the operators exported by FlagGems-vLLM, sourced from src/flaggems_vllm/ops/__init__.py.
FlagGems-vLLM provides optimized implementations of common vLLM operators using the Triton programming language. The following 75 operators are currently exported:
Activation and Gating#
Operator |
Description |
|---|---|
|
Backward pass for GEGLU activation |
|
Backward pass for ReGLU activation |
|
Backward pass for SwiGLU activation |
|
GEGLU (Gated Linear Unit with GELU) activation |
|
GELU activation combined with element-wise multiplication |
|
ReGLU (Gated Linear Unit with ReLU) activation |
|
SiLU (Swish) activation combined with element-wise multiplication |
|
SiLU activation with multiplication (out-of-place variant) |
|
SiLU activation with multiplication and clamping |
|
SiLU activation with multiplication and clamping (out-of-place variant) |
|
SwiGLU (Gated Linear Unit with SiLU) activation |
Attention#
Operator |
Description |
|---|---|
|
Apply rotary position embeddings |
|
Concatenate and cache for MLA (Multi-Latent Attention) |
|
FlashAttention forward pass |
|
FlashAttention with variable-length sequences |
|
Optimized FlashAttention with variable-length sequences |
|
Flash Multi-Latent Attention |
|
Sparse Flash MLA forward pass |
|
Flash MLA with KV cache support |
|
Reshape and cache KV cache |
|
Reshape and cache for FlashAttention |
|
Sparse attention computation via Triton |
DeepSeek V4 Attention#
Operator |
Description |
|---|---|
|
Combine top-K and sliding window attention indices for DeepSeek V4 |
|
Compute global top-K indices and lengths for DeepSeek V4 |
|
Dequantize and gather K cache for DeepSeek V4 |
|
Fused Q/KV RMSNorm for DeepSeek V4 |
|
Fused Q-norm, RoPE, KV-RoPE, quantize, and insert for DeepSeek V4 |
Mixture of Experts (MoE)#
Operator |
Description |
|---|---|
|
Dispatch fused MoE kernel |
|
Fused MoE experts implementation |
|
Grouped top-K selection for MoE routing |
|
In-place fused MoE experts |
|
Invoke fused MoE Triton kernel |
|
MoE block size alignment |
|
MoE block size alignment (Triton variant) |
|
MoE expert output summation |
|
Out-of-place fused MoE experts |
|
Top-K per row for decode phase |
|
Top-K per row for prefill phase |
|
Top-K with softmax for MoE gating |
|
Top-K with softplus and sqrt for MoE gating |
Linear and Matrix#
Operator |
Description |
|---|---|
|
Scaled matrix multiplication via CUTLASS |
|
Element-wise multiplication |
|
Element-wise multiplication (in-place) |
|
Matrix-vector multiplication |
|
Outer product of two vectors |
Normalization#
Operator |
Description |
|---|---|
|
Addition with RMSNorm |
|
Fused addition with RMSNorm |
|
Instance normalization |
|
Skip connection with LayerNorm |
|
Weight normalization |
|
Weight normalization interface |
|
Weight normalization interface backward pass |
Reduction and Utility#
Operator |
Description |
|---|---|
|
Apply repetition penalties during generation |
|
Count frequency of each value |
|
Cross-entropy loss computation |
|
Pack sequences |
|
Unpack sequences |
Quantization#
Operator |
Description |
|---|---|
|
Fused indexer Q with RoPE and quantization |
|
Fused inverse RoPE with FP8 quantization |
|
Per-token group FP8 quantization |
DSA — Deep Sparse Attention#
Operator |
Description |
|---|---|
|
Bucket sort top-K selection |
|
Gather indexer K with quantized cache |
|
Indexer K with quantization and cache |
FLA — Flash Linear Attention#
Operator |
Description |
|---|---|
|
Gated delta rule computation |
|
Gated delta rule forward pass |
|
Fused recurrent gated delta rule forward pass |
MHC — Multi-Head Compatibility#
Operator |
Description |
|---|---|
|
Head-channel fused kernel |
|
Head-channel fused kernel (reference) |
|
MHC backward pass |
|
MHC backward pass (reference) |
|
MHC post-processing |
|
MHC pre-processing |
|
Sinkhorn forward computation |
RWKV#
Operator |
Description |
|---|---|
|
RWKV key-attention fusion kernel |
|
RWKV matrix multiplication sparsity kernel |