Use Hints

Use Hints#

flagtree_hints allows users to provide optimization hints to the compiler through trailing comments in the Triton Kernel code.

You can simply add hints by placing a trailing comment with the format # @hint: <hint_name> on the same line as operations like tl.load.

Example 1

# Hints are embedded as trailing comments using the '@hint:' prefix.
mat_a_block = tl.load(mat_a + mat_a_offset, mask=mat_a_mask, other=0.0)  # @hint: dot_pad_only_k
x = tl.load(x_ptr + offsets, mask=mask) 
for s in range(0, 2):  # @hint: bind_sub_block
    # ... code ...

Example 2

import triton
import triton.language as tl

@triton.jit
def kernel(x_ptr, y_ptr, N):
    pid = tl.program_id(0)
    x = tl.load(x_ptr + pid)  #@hint: shared_memory
    y = x + 1
    tl.store(y_ptr + pid, y)

Supported hints#

The following tables list the optimization hints applicable to Triton operations for compilation on different backends.

NVIDIA#

Hint Name	Triton Operation	Description	Branch
shared_memory	tl.load	Converts a global memory load operation to an asynchronous copy to shared memory, then loads from shared memory. The load must be at least 4 bytes and convertible to an async load.	triton_v3.5.x and triton_v3.6.x

Huawei Ascend#

Hint Name	Triton Operation	Description	Branch
dot_pad_only_k	tl.load	Optimizes matrix multiplication performance by padding only the K dimension for dot operations. Equivalent to `tl.compile_hint(tensor, "dot_pad_only_k")` in triton-ascend.	triton_v3.2.x_ascend_hints
bind_sub_block	for loop	Optimizes parallel execution by binding loop iterations to sub-blocks. Equivalent to `tl.parallel(..., bind_sub_block=True)` in triton-ascend.	triton_v3.2.x_ascend_hints
multibuffer	tl.load	Enables multi-buffering optimization to overlap data transfer and computation (fixed to 2 buffer copies). Equivalent to `tl.multibuffer(tensor, 2)` or `tl.compile_hint(tensor, "multi_buffer", 2)` in triton-ascend.	triton_v3.2.x_ascend_hints

AIPU#

Hint Name	Triton Operation	Description	Branch
dma	tl.load	Enables asynchronous DMA transfers for improved performance. Lowers `memref.copy` operations to `memref.dma_start` and `memref.dma_wait` operations. Requires stride-1 memory access patterns.	triton_v3.3.x
shared_memory	tl.load	Loads data into shared memory for faster access. Allocates shared memory (memory space 8) with a size 4x (for AIPUcore parallel) the original tensor shape and copies data from global memory.	triton_v3.3.x

For Hints usage information, see Use Hints.