May 2026
DFlash + PFlash on AMD Strix Halo
PR #119 lands DFlash and PFlash on the AMD Ryzen AI MAX+ 395 iGPU (gfx1151, Strix Halo, 128 GiB unified memory). End-to-end on Qwen3.6-27B Q4_K_M with the Luce DFlash drafter: 26.85 tok/s decode and 20.2 s PFlash prefill at 16K context. That is 2.23x faster decode and 3.05x faster prefill than llama.cpp HIP on the same silicon. The same box can host checkpoints up to ~100 GiB, an entire class of models a 24 GiB consumer GPU cannot touch.
TL;DR
- PR #119 ports lucebox's Phase 2 rocWMMA flashprefill kernels to HIP. The DFlash drafter, the DDTree verifier, the speculative prefill compress, and the sparse target prefill all run on the gfx1151 iGPU directly. Companion PR #159 bumps the compressed-prefill ubatch default from 16 to 512.
- Decode (Qwen3.6-27B Q4_K_M): 26.85 tok/s with our Q8_0 GGUF DFlash drafter and SWA=2048. 2.23x over llama.cpp HIP AR (12.02), 2.16x over llama.cpp Vulkan AR (12.45).
- Prefill (Qwen3.6-27B, 16K): 20.2 s TTFT vs llama.cpp HIP's 61.69 s. NIAH retrieval passes. Speedup grows with context: PFlash compress is O(S), AR prefill is O(S^2).
- End-to-end: at a 16K prompt + 1K generation workload, total wall clock drops from 147 s to 58 s. 2.5x faster.
- Tuning:
--ddtree-budget=22is the gfx1151 optimum. Higher budgets accept more tokens per step (AL keeps climbing) but each step gets more expensive on LPDDR5X. Bandwidth caps the benefit before tile utilization pays off. - What is next: BSA scoring kernel needs a rocWMMA-native port (currently CUDA/CUTLASS only). Closing that gap projects another 2-3x on prefill at long context.
The numbers
Hardware: Ryzen AI MAX+ 395, Radeon 8060S iGPU (gfx1151), 128 GiB LPDDR5X-8000, ROCm 7.2.2. Target: Qwen3.6-27B Q4_K_M (15.65 GiB). Drafter: Lucebox/Qwen3.6-27B-DFlash-GGUF Q8_0 + DFLASH27B_DRAFT_SWA=2048. Bench: 10-prompt HumanEval-style, --n-gen 128 --ddtree-budget 22 --fast-rollback.
The 3.6 SWA path is the canonical Qwen3.6 setup. We published the matching Q8_0 GGUF drafter at Lucebox/Qwen3.6-27B-DFlash-GGUF. DFLASH27B_DRAFT_SWA=2048 activates the sliding-window correction for the 3.6 drafter's full-attention layers. Without SWA the same path drops to 24.29 tok/s.
Prefill: PFlash vs raw AR
Long-context TTFT is the second axis. Vanilla llama.cpp on gfx1151 prefills 16K tokens of Qwen3.6-27B Q4_K_M at 265.6 tok/s, which is 61.7 s of staring at a blank screen. PFlash compresses the prompt with a Qwen3-0.6B BF16 drafter, scores per-token importance, keeps a 5% slice, and feeds only that slice to the target. NIAH retrieval still passes at 16K with the WMMA fallback (BSA on HIP is the remaining piece).
The PFlash compress phase (drafter scoring + selection) is constant at any source S below the daemon's KV cap; the dominant cost is the target prefill on the compressed prompt. PR #159 bumps the daemon's compressed-prefill ubatch default from 16 to 512, which lifts target_prefill from 12.4 s to 5.2 s at 1205 kept tokens. Zero kernel work, byte-identical commit stream.
End-to-end wall clock
Decode speedup matters most for long generation. Prefill speedup matters most for big prompts. The full request is both. Numbers below: PR #119 PFlash TTFT (with PR #159 ubatch=512) + PR #119 DFlash decode at 26.85 tok/s, both on Qwen3.6-27B Q4_K_M with the Lucebox Q8_0 drafter.
| Workload (prompt + gen) | llama.cpp HIP | PR #119 + #159 (Qwen3.6) | Speedup |
|---|---|---|---|
| 128 prompt + 128 gen | 11.1 s | 5.2 s | 2.13x |
| 128 prompt + 512 gen | 43.1 s | 19.5 s | 2.21x |
| 16K prompt + 128 gen | 72.3 s | 24.9 s | 2.91x |
| 16K prompt + 1K gen | 146.9 s | 58.4 s | 2.51x |
| 16K prompt + 2K gen | 232.1 s | 96.5 s | 2.40x |
Why budget=22 on Strix Halo
DDTree builds a speculative tree of N candidate tokens per step and verifies them in one batched target forward. Bigger tree means more acceptance per step, but each step costs more KV memory traffic. On bandwidth-bound silicon the cost wins. We swept budgets from 8 to 128:
Compare to gfx1100 (7900 XTX, GDDR6 936 GB/s): per PR #156, budget=8 wins +53% on that silicon because tile waste matters more than launch amortization there. On Strix Halo the opposite holds. The default ship is arch-aware.
The 128 GiB headroom
For Qwen3.5-27B Q4_K_M (15.65 GiB target + 1.84 GiB drafter + KV cache) that leaves ~100 GiB free. The same box hosts can run: Qwen3.5-122B-A10B, MiniMax-M2.7-REAP 139B-A10B at 78 GiB, full BF16 27B at 50 GiB. PR #119's speedups apply to the 27B class today.
Reproduce
# 1. Build PR #119 for gfx1151
git clone https://github.com/Luce-Org/lucebox-hub.git
cd lucebox-hub
git fetch origin pull/119/head:pr119 && git checkout pr119
git submodule update --init --recursive
cd dflash
cmake -B build -S . \
-DCMAKE_BUILD_TYPE=Release \
-DDFLASH27B_GPU_BACKEND=hip \
-DDFLASH27B_HIP_ARCHITECTURES=gfx1151 \
-DDFLASH27B_HIP_SM80_EQUIV=ON
cmake --build build --target test_dflash -j
# 2. Models: Qwen3.6-27B target + Lucebox Q8_0 DFlash drafter
mkdir -p models/draft
hf download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-Q4_K_M.gguf --local-dir models/
hf download Lucebox/Qwen3.6-27B-DFlash-GGUF dflash-draft-3.6-q8_0.gguf --local-dir models/draft/
# 3. Bench (DFlash decode + PFlash long-context prefill)
LD_LIBRARY_PATH=/opt/rocm/lib:$LD_LIBRARY_PATH \
DFLASH_BIN=$PWD/build/test_dflash \
DFLASH_TARGET=$PWD/models/Qwen3.6-27B-Q4_K_M.gguf \
DFLASH_DRAFT=$PWD/models/draft/dflash-draft-3.6-q8_0.gguf \
DFLASH27B_DRAFT_SWA=2048 \
DFLASH27B_PREFILL_UBATCH=512 \
python3 scripts/bench_he.py --n-gen 128 --ddtree-budget 22 The DFLASH27B_PREFILL_UBATCH=512 override applies the PR #159 fix on top of the PR #119 base. Once #159 merges, this will be the daemon default.
What is still missing
- BSA scoring kernel on HIP. The drafter compress-score path uses BSA (block-sparse attention) on CUDA. PR #119 disables it on HIP and falls back to ggml's flash_attn_ext, which the daemon's own warning flags as ~3.4x slower. A rocWMMA-native sparse-FA kernel closes the gap. After it lands, PFlash TTFT at 16K drops from 27.6 s to roughly 8 s. At 128K, projected 7-10x over llama.cpp AR.
- Multi-row q4_K decode GEMV. RDNA-native multi-row GEMV pattern (R=4-8 output rows sharing activation register state) for the drafter forward, which is 30% of compress time at long context.
- Phase 2 tile shape tuning for gfx1151. The current rocWMMA flashprefill tiles are tuned for gfx1100. Strix Halo has different LDS and VGPR characteristics.
- 70B+ MoE targets. The 128 GiB headroom is wasted on a 16 GiB 27B. Qwen3.5-122B-A10B and MiniMax-M2.7-REAP 139B-A10B both fit. DFlash math ports cleanly to MoE; the big work is wiring the expert-routed forward into the spec verify loop.
Bottom line
PR #119 plus PR #159 make lucebox fast on Strix Halo for the canonical Qwen3.6-27B path. 26.85 tok/s decode and 20.2 s prefill at 16K, both end-to-end measured, 2.23x and 3.05x over llama.cpp HIP on the same iGPU. The architecture lift (CUDA to HIP, rocWMMA flashprefill, DDTree verifier) was a big piece; the remaining gains are kernel work.
The local-inference story on consumer AMD is no longer a myth. A Ryzen AI MAX+ 395 box has 128 GiB of unified memory, runs Qwen3.6-27B, hosts the DFlash spec decode and the PFlash long-context prefill, and the wall clock at a realistic 16K + 1K workload comes in at 58 s vs llama.cpp's 147 s. The same hardware is sized to host the 122B and 139B MoE class next.
Hardware: Ryzen AI MAX+ 395, Radeon 8060S iGPU (gfx1151), 128 GiB LPDDR5X-8000, Ubuntu 24.04 HWE kernel 6.17, ROCm 7.2.2. Stack: lucebox PR #119 (rocWMMA Phase 2 flashprefill on HIP), llama.cpp mainline for AR baselines. All benches run on a single physical box. References: PR #119, PR #156 cross-arch perf plan.