Coverage for src/flag_gems/runtime/backend/_arm/int8/__init__.py: 0%
4 statements
« prev ^ index » next coverage.py v7.6.9, created at 2026-06-05 07:36 +0800
« prev ^ index » next coverage.py v7.6.9, created at 2026-06-05 07:36 +0800
1"""ARM CPU INT8 model utilities.
3Drop-in Linear replacement with decode-optimized TLE SDOT GEMV + prefill
4via torch._int_mm (SVE2 i8mm), plus a helper to replace all nn.Linear
5layers in a transformers model from a pre-quantized state dict.
7Usage:
8 from safetensors.torch import load_file
9 from flag_gems.runtime.backend._arm.int8 import replace_linears_with_tle_int8
11 model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-1.7B", dtype=bf16)
12 state = load_file("Qwen3-1.7B-W8A8-INT8/model.safetensors")
13 replace_linears_with_tle_int8(model, state)
14"""
16from .quantize_live import quantize_and_replace_linears # noqa: F401
17from .replace import replace_linears_with_tle_int8 # noqa: F401
18from .tle_int8_linear import TLEInt8Linear, pack_weights_sdot # noqa: F401
20__all__ = [
21 "TLEInt8Linear",
22 "pack_weights_sdot",
23 "replace_linears_with_tle_int8",
24 "quantize_and_replace_linears",
25]