How plugin works ?

Contents

How plugin works ?#

Load plugin#

SGLang discovers and loads the plugin automatically at startup via setuptools entry_points.

The plugin registers two entry_points in pyproject.toml:

[project.entry-points."sglang.srt.plugins"]
sglang_fl = "sglang_fl:load_plugin"

[project.entry-points."sglang.srt.platforms"]
sglang_fl = "sglang_fl:activate_platform"

Dispatch hook#

The core mechanism uses an AROUND hook on MultiPlatformOp.dispatch_forward() combined with a standardized dispatch system:

dispatch_forward() called for an op (e.g. RMSNorm)
  → AROUND hook intercepts
    → Check OOT_WHITELIST/OOT_BLACKLIST
    → Find bridge function via MRO (RMSNorm → rms_norm_bridge)
    → Return bridge function as the forward method
  → SGLang calls the bridge function with framework args:
      rms_norm_bridge(self, x, residual, post_residual_addition)
    → Bridge handles SGLang-specific params (post_residual_addition → merge into residual)
    → Bridge calls dispatch.call_op("rms_norm", obj, x, residual)
      → OpManager resolves best impl via policy (flagos > vendor > reference)
      → Calls the selected backend: rms_norm_flaggems(obj, x, residual)

The bridge layer decouples framework-specific parameters from the standardized op signatures. Vendor backends only need to implement the standard signatures — the same impl works for both sglang-plugin-FL and vllm-plugin-FL.

Dispatch Architecture (shared with vllm-plugin-FL)#

┌─────────────────────────────────────────────────────────────┐
│  SGLang AROUND Hook        │  vLLM forward_oot override     │
│  (bridge/rms_norm.py)      │  (vllm_fl/ops/layernorm.py)    │
└────────────┬───────────────┴────────────────┬───────────────┘
             │                                │
             ▼                                ▼
┌─────────────────────────────────────────────────────────────┐
│  dispatch.call_op("rms_norm", obj, x, residual)             │
│  OpManager → SelectionPolicy → OpRegistry → resolve impl    │
└──────────────────────────┬──────────────────────────────────┘
                           │
          ┌────────────────┼────────────────┐
          ▼                ▼                ▼
   ┌─────────────┐  ┌───────────┐  ┌──────────────┐
   │ DEFAULT     │  │ VENDOR    │  │ REFERENCE    │
   │ (FlagGems)  │  │ (Ascend/  │  │ (PyTorch)    │
   │ priority=150│  │  CUDA)    │  │ priority=50  │
   │             │  │ priority= │  │              │
   │             │  │   100     │  │              │
   └─────────────┘  └───────────┘  └──────────────┘

Chip vendors implement the same backend interface for both frameworks. The only framework-specific code is the bridge layer, which is maintained by the plugin.

ATen replacement#

Plugin loads → flag_gems.enable(record=True)
  → PyTorch dispatch table registers Triton kernels for ATen ops
  → On first inference call, each replaced op is logged
  → _AtenOnlyFilter ensures only flag_gems.ops.* calls are recorded
    (excludes internal FlagGems calls from Layer 2 flagos implementations)