Debugging and diagnostics

Debugging and diagnostics#

This section introduces diagnostics on ops dispatch.

Dispatch log#

See which backend each fused op resolved to (written at server startup):

rm -f /tmp/dispatch.log
SGLANG_FL_DISPATCH_LOG=/tmp/dispatch.log \
  python -m sglang.launch_server \
    --model-path Qwen/Qwen2.5-0.5B-Instruct \
    --port 30000 --disable-piecewise-cuda-graph

sort -u /tmp/dispatch.log
# [OOT-DISPATCH] SiluAndMul → flagos(flagos)
# [OOT-DISPATCH] RMSNorm → flagos(flagos)
# [OOT-DISPATCH] RotaryEmbedding → flagos(flagos)

ATen replacement log#

Record which PyTorch ATen ops were replaced by FlagGems:

rm -f /tmp/gems_aten.txt
SGLANG_FLAGGEMS_RECORD=1 SGLANG_FLAGGEMS_LOG_PATH=/tmp/gems_aten.txt \
  python -m sglang.launch_server \
    --model-path Qwen/Qwen2.5-0.5B-Instruct \
    --port 30000 --disable-piecewise-cuda-graph

# After first inference request:
sort -u /tmp/gems_aten.txt

Note

The log uses _AtenOnlyFilter to record only flag_gems.ops.* namespace calls, excluding internal FlagGems calls triggered by Layer 2 implementations.

Troubleshoot numerical precision issues through Precision Bisection#

When numerical differences appear, isolate the responsible layer. If output diverges at Step N but not Step N-1, the responsible layer/op is isolated.

# Step 1: Disable everything — confirm vanilla SGLang works
SGLANG_PLUGINS="__none__" python -m sglang.launch_server \
    --model-path Qwen/Qwen2.5-0.5B-Instruct \
    --port 30000 --disable-piecewise-cuda-graph

# Step 2: Enable only Layer 2 (fused ops), disable ATen replacement
USE_FLAGGEMS=0 python -m sglang.launch_server \
    --model-path Qwen/Qwen2.5-0.5B-Instruct \
    --port 30000 --disable-piecewise-cuda-graph

# Step 3: Per-op isolation — only SiluAndMul uses flagos, RMSNorm uses reference
USE_FLAGGEMS=0 \
SGLANG_FL_PER_OP="silu_and_mul=flagos;rms_norm=reference" \
    python -m sglang.launch_server \
    --model-path Qwen/Qwen2.5-0.5B-Instruct \
    --port 30000 --disable-piecewise-cuda-graph

# Step 4: Disable Layer 2, only ATen replacement active
SGLANG_FL_OOT_ENABLED=0 python -m sglang.launch_server \
    --model-path Qwen/Qwen2.5-0.5B-Instruct \
    --port 30000 --disable-piecewise-cuda-graph

# Step 5: Gradually enable ATen ops with whitelist
SGLANG_FL_OOT_ENABLED=0 SGLANG_FL_FLAGOS_WHITELIST=rms_norm,silu \
    python -m sglang.launch_server \
    --model-path Qwen/Qwen2.5-0.5B-Instruct \
    --port 30000 --disable-piecewise-cuda-graph

Common Issues#

Symptom	Cause & Fix
`dispatch.log` is empty	Plugin not loaded — check `pip show sglang_fl`
`gems_aten.txt` is empty	`USE_FLAGGEMS=0` is set, or `SGLANG_FL_FLAGOS_WHITELIST` excludes the op
`forward_cuda` error on non-NVIDIA	An op lacks OOT registration — register it or add to whitelist
`ImportError: sgl_kernel`	Normal on non-CUDA — the OOT dispatch bypasses `forward_cuda`
`tp>1` hangs at startup	Check GPU count, NCCL env vars, model TP compatibility
OOM at engine startup	Reduce `--mem-fraction-static` (default 0.5)