May 2026

DFlash + PFlash on AMD Strix Halo

PR #119 lands DFlash and PFlash on the AMD Ryzen AI MAX+ 395 iGPU (gfx1151, Strix Halo, 128 GiB unified memory). End-to-end on Qwen3.6-27B Q4_K_M with the Luce DFlash drafter: 26.85 tok/s decode and 20.2 s PFlash prefill at 16K context. That is 2.23x faster decode and 3.05x faster prefill than llama.cpp HIP on the same silicon. The same box can host checkpoints up to ~100 GiB, an entire class of models a 24 GiB consumer GPU cannot touch.

Ryzen AI MAX+ 395 Strix Halo box running DFlash spec decode inside lucebox

llama.cpp HIP vs Lucebox PFlash + DFlash on Qwen3.6-27B, 16K prompt + 1K generation, 10x real time

TL;DR

PR #119 ports lucebox's Phase 2 rocWMMA flashprefill kernels to HIP. The DFlash drafter, the DDTree verifier, the speculative prefill compress, and the sparse target prefill all run on the gfx1151 iGPU directly. Companion PR #159 bumps the compressed-prefill ubatch default from 16 to 512.
Decode (Qwen3.6-27B Q4_K_M): 26.85 tok/s with our Q8_0 GGUF DFlash drafter and SWA=2048. 2.23x over llama.cpp HIP AR (12.02), 2.16x over llama.cpp Vulkan AR (12.45).
Prefill (Qwen3.6-27B, 16K): 20.2 s TTFT vs llama.cpp HIP's 61.69 s. NIAH retrieval passes. Speedup grows with context: PFlash compress is O(S), AR prefill is O(S^2).
End-to-end: at a 16K prompt + 1K generation workload, total wall clock drops from 147 s to 58 s. 2.5x faster.
Tuning: --ddtree-budget=22 is the gfx1151 optimum. Higher budgets accept more tokens per step (AL keeps climbing) but each step gets more expensive on LPDDR5X. Bandwidth caps the benefit before tile utilization pays off.
What is next: BSA scoring kernel needs a rocWMMA-native port (currently CUDA/CUTLASS only). Closing that gap projects another 2-3x on prefill at long context.

The numbers

Hardware: Ryzen AI MAX+ 395, Radeon 8060S iGPU (gfx1151), 128 GiB LPDDR5X-8000, ROCm 7.2.2. Target: Qwen3.6-27B Q4_K_M (15.65 GiB). Drafter: Lucebox/Qwen3.6-27B-DFlash-GGUF Q8_0 + DFLASH27B_DRAFT_SWA=2048. Bench: 10-prompt HumanEval-style, --n-gen 128 --ddtree-budget 22 --fast-rollback.

Decode tok/s, Qwen3.6-27B Q4_K_M, gfx1151

llama.cpp HIP AR

12.02

llama.cpp Vulkan AR

12.45

PR #119 DFlash + SWA

26.85

Same hardware, 10-prompt HumanEval-style bench, n_gen=128, ROCm 7.2.2. Lucebox/Qwen3.6-27B-DFlash-GGUF Q8_0 drafter, draft SWA window 2048. Mean AL 5.58, accept rate 34.9%.

The 3.6 SWA path is the canonical Qwen3.6 setup. We published the matching Q8_0 GGUF drafter at Lucebox/Qwen3.6-27B-DFlash-GGUF. DFLASH27B_DRAFT_SWA=2048 activates the sliding-window correction for the 3.6 drafter's full-attention layers. Without SWA the same path drops to 24.29 tok/s.

Prefill: PFlash vs raw AR

Long-context TTFT is the second axis. Vanilla llama.cpp on gfx1151 prefills 16K tokens of Qwen3.6-27B Q4_K_M at 265.6 tok/s, which is 61.7 s of staring at a blank screen. PFlash compresses the prompt with a Qwen3-0.6B BF16 drafter, scores per-token importance, keeps a 5% slice, and feeds only that slice to the target. NIAH retrieval still passes at 16K with the WMMA fallback (BSA on HIP is the remaining piece).

TTFT at 16K context, Qwen3.6-27B Q4_K_M, gfx1151 (lower is better)

llama.cpp HIP AR

61.7 s

PR #119 PFlash (UB=16)

27.6 s

PR #119 + #159 (UB=512)

20.2 s

3.05x at 16K with the PR #159 ubatch bump. The gap grows with context: AR prefill is O(S^2), PFlash compress is O(S). Expect ~5-7x at 32K, ~7-10x at 128K once the BSA-HIP port lands.

The PFlash compress phase (drafter scoring + selection) is constant at any source S below the daemon's KV cap; the dominant cost is the target prefill on the compressed prompt. PR #159 bumps the daemon's compressed-prefill ubatch default from 16 to 512, which lifts target_prefill from 12.4 s to 5.2 s at 1205 kept tokens. Zero kernel work, byte-identical commit stream.

End-to-end wall clock

Decode speedup matters most for long generation. Prefill speedup matters most for big prompts. The full request is both. Numbers below: PR #119 PFlash TTFT (with PR #159 ubatch=512) + PR #119 DFlash decode at 26.85 tok/s, both on Qwen3.6-27B Q4_K_M with the Lucebox Q8_0 drafter.

Workload (prompt + gen)	llama.cpp HIP	PR #119 + #159 (Qwen3.6)	Speedup
128 prompt + 128 gen	11.1 s	5.2 s	2.13x
128 prompt + 512 gen	43.1 s	19.5 s	2.21x
16K prompt + 128 gen	72.3 s	24.9 s	2.91x
16K prompt + 1K gen	146.9 s	58.4 s	2.51x
16K prompt + 2K gen	232.1 s	96.5 s	2.40x

Why budget=22 on Strix Halo

DDTree builds a speculative tree of N candidate tokens per step and verifies them in one batched target forward. Bigger tree means more acceptance per step, but each step costs more KV memory traffic. On bandwidth-bound silicon the cost wins. We swept budgets from 8 to 128:

--ddtree-budget sweep, gfx1151 (3.5/3.5 Q8_0 stand-in; trend holds for 3.6)

budget=8 (AL 5.32)

36.45

budget=22 (AL 7.17)

36.76

budget=32 (AL 7.78)

33.29

budget=48 (AL 7.82)

27.69

budget=64 (AL 8.09)

23.88

budget=96 (AL 8.22)

18.16

budget=128 (AL 8.42)

13.45

Acceptance length keeps rising with budget (AL goes 5.32 to 8.42), but tok/s peaks at budget=22 and falls off a cliff above 32. LPDDR5X-8000 cannot pay for the larger verify tree.

Compare to gfx1100 (7900 XTX, GDDR6 936 GB/s): per PR #156, budget=8 wins +53% on that silicon because tile waste matters more than launch amortization there. On Strix Halo the opposite holds. The default ship is arch-aware.

The 128 GiB headroom

For Qwen3.5-27B Q4_K_M (15.65 GiB target + 1.84 GiB drafter + KV cache) that leaves ~100 GiB free. The same box hosts can run: Qwen3.5-122B-A10B, MiniMax-M2.7-REAP 139B-A10B at 78 GiB, full BF16 27B at 50 GiB. PR #119's speedups apply to the 27B class today.

Reproduce

# 1. Build PR #119 for gfx1151
git clone https://github.com/Luce-Org/lucebox-hub.git
cd lucebox-hub
git fetch origin pull/119/head:pr119 && git checkout pr119
git submodule update --init --recursive
cd dflash
cmake -B build -S . \
  -DCMAKE_BUILD_TYPE=Release \
  -DDFLASH27B_GPU_BACKEND=hip \
  -DDFLASH27B_HIP_ARCHITECTURES=gfx1151 \
  -DDFLASH27B_HIP_SM80_EQUIV=ON
cmake --build build --target test_dflash -j

# 2. Models: Qwen3.6-27B target + Lucebox Q8_0 DFlash drafter
mkdir -p models/draft
hf download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-Q4_K_M.gguf --local-dir models/
hf download Lucebox/Qwen3.6-27B-DFlash-GGUF dflash-draft-3.6-q8_0.gguf --local-dir models/draft/

# 3. Bench (DFlash decode + PFlash long-context prefill)
LD_LIBRARY_PATH=/opt/rocm/lib:$LD_LIBRARY_PATH \
DFLASH_BIN=$PWD/build/test_dflash \
DFLASH_TARGET=$PWD/models/Qwen3.6-27B-Q4_K_M.gguf \
DFLASH_DRAFT=$PWD/models/draft/dflash-draft-3.6-q8_0.gguf \
DFLASH27B_DRAFT_SWA=2048 \
DFLASH27B_PREFILL_UBATCH=512 \
  python3 scripts/bench_he.py --n-gen 128 --ddtree-budget 22

The DFLASH27B_PREFILL_UBATCH=512 override applies the PR #159 fix on top of the PR #119 base. Once #159 merges, this will be the daemon default.

What is still missing

BSA scoring kernel on HIP. The drafter compress-score path uses BSA (block-sparse attention) on CUDA. PR #119 disables it on HIP and falls back to ggml's flash_attn_ext, which the daemon's own warning flags as ~3.4x slower. A rocWMMA-native sparse-FA kernel closes the gap. After it lands, PFlash TTFT at 16K drops from 27.6 s to roughly 8 s. At 128K, projected 7-10x over llama.cpp AR.
Multi-row q4_K decode GEMV. RDNA-native multi-row GEMV pattern (R=4-8 output rows sharing activation register state) for the drafter forward, which is 30% of compress time at long context.
Phase 2 tile shape tuning for gfx1151. The current rocWMMA flashprefill tiles are tuned for gfx1100. Strix Halo has different LDS and VGPR characteristics.
70B+ MoE targets. The 128 GiB headroom is wasted on a 16 GiB 27B. Qwen3.5-122B-A10B and MiniMax-M2.7-REAP 139B-A10B both fit. DFlash math ports cleanly to MoE; the big work is wiring the expert-routed forward into the spec verify loop.

Bottom line

PR #119 plus PR #159 make lucebox fast on Strix Halo for the canonical Qwen3.6-27B path. 26.85 tok/s decode and 20.2 s prefill at 16K, both end-to-end measured, 2.23x and 3.05x over llama.cpp HIP on the same iGPU. The architecture lift (CUDA to HIP, rocWMMA flashprefill, DDTree verifier) was a big piece; the remaining gains are kernel work.

The local-inference story on consumer AMD is no longer a myth. A Ryzen AI MAX+ 395 box has 128 GiB of unified memory, runs Qwen3.6-27B, hosts the DFlash spec decode and the PFlash long-context prefill, and the wall clock at a realistic 16K + 1K workload comes in at 58 s vs llama.cpp's 147 s. The same hardware is sized to host the 122B and 139B MoE class next.

Hardware: Ryzen AI MAX+ 395, Radeon 8060S iGPU (gfx1151), 128 GiB LPDDR5X-8000, Ubuntu 24.04 HWE kernel 6.17, ROCm 7.2.2. Stack: lucebox PR #119 (rocWMMA Phase 2 flashprefill on HIP), llama.cpp mainline for AR baselines. All benches run on a single physical box. References: PR #119, PR #156 cross-arch perf plan.

Run Qwen3.6-27B at 27 tok/s on a single AMD box

Open-source lucebox PR #119, DFlash + PFlash on gfx1151. Ryzen AI MAX+ 395 class hardware.

GitHub PFlash post Discord