May 2026

By Davide Ciffa

DFlash + PFlash on AMD Strix Halo

PR #119 lands DFlash and PFlash on the AMD Ryzen AI MAX+ 395 iGPU (gfx1151, Strix Halo, 128 GiB unified memory). End-to-end on Qwen3.6-27B Q4_K_M with the Luce DFlash drafter: 26.85 tok/s decode and 20.2 s PFlash prefill at 16K context. That is 2.23x faster decode and 3.05x faster prefill than llama.cpp HIP on the same silicon. The same box can host checkpoints up to ~100 GiB, an entire class of models a 24 GiB consumer GPU cannot touch.

Ryzen AI MAX+ 395 Strix Halo box running DFlash spec decode inside lucebox llama.cpp HIP vs Lucebox PFlash + DFlash on Qwen3.6-27B, 16K prompt + 1K generation, 10x real time

TL;DR

The numbers

Hardware: Ryzen AI MAX+ 395, Radeon 8060S iGPU (gfx1151), 128 GiB LPDDR5X-8000, ROCm 7.2.2. Target: Qwen3.6-27B Q4_K_M (15.65 GiB). Drafter: Lucebox/Qwen3.6-27B-DFlash-GGUF Q8_0 + DFLASH27B_DRAFT_SWA=2048. Bench: 10-prompt HumanEval-style, --n-gen 128 --ddtree-budget 22 --fast-rollback.

Decode tok/s, Qwen3.6-27B Q4_K_M, gfx1151
llama.cpp HIP AR
12.02
llama.cpp Vulkan AR
12.45
PR #119 DFlash + SWA
26.85
Same hardware, 10-prompt HumanEval-style bench, n_gen=128, ROCm 7.2.2. Lucebox/Qwen3.6-27B-DFlash-GGUF Q8_0 drafter, draft SWA window 2048. Mean AL 5.58, accept rate 34.9%.

The 3.6 SWA path is the canonical Qwen3.6 setup. We published the matching Q8_0 GGUF drafter at Lucebox/Qwen3.6-27B-DFlash-GGUF. DFLASH27B_DRAFT_SWA=2048 activates the sliding-window correction for the 3.6 drafter's full-attention layers. Without SWA the same path drops to 24.29 tok/s.

Prefill: PFlash vs raw AR

Long-context TTFT is the second axis. Vanilla llama.cpp on gfx1151 prefills 16K tokens of Qwen3.6-27B Q4_K_M at 265.6 tok/s, which is 61.7 s of staring at a blank screen. PFlash compresses the prompt with a Qwen3-0.6B BF16 drafter, scores per-token importance, keeps a 5% slice, and feeds only that slice to the target. NIAH retrieval still passes at 16K with the WMMA fallback (BSA on HIP is the remaining piece).

TTFT at 16K context, Qwen3.6-27B Q4_K_M, gfx1151 (lower is better)
llama.cpp HIP AR
61.7 s
PR #119 PFlash (UB=16)
27.6 s
PR #119 + #159 (UB=512)
20.2 s
3.05x at 16K with the PR #159 ubatch bump. The gap grows with context: AR prefill is O(S^2), PFlash compress is O(S). Expect ~5-7x at 32K, ~7-10x at 128K once the BSA-HIP port lands.

The PFlash compress phase (drafter scoring + selection) is constant at any source S below the daemon's KV cap; the dominant cost is the target prefill on the compressed prompt. PR #159 bumps the daemon's compressed-prefill ubatch default from 16 to 512, which lifts target_prefill from 12.4 s to 5.2 s at 1205 kept tokens. Zero kernel work, byte-identical commit stream.

End-to-end wall clock

Decode speedup matters most for long generation. Prefill speedup matters most for big prompts. The full request is both. Numbers below: PR #119 PFlash TTFT (with PR #159 ubatch=512) + PR #119 DFlash decode at 26.85 tok/s, both on Qwen3.6-27B Q4_K_M with the Lucebox Q8_0 drafter.

Workload (prompt + gen)llama.cpp HIPPR #119 + #159 (Qwen3.6)Speedup
128 prompt + 128 gen11.1 s5.2 s2.13x
128 prompt + 512 gen43.1 s19.5 s2.21x
16K prompt + 128 gen72.3 s24.9 s2.91x
16K prompt + 1K gen146.9 s58.4 s2.51x
16K prompt + 2K gen232.1 s96.5 s2.40x

Why budget=22 on Strix Halo

DDTree builds a speculative tree of N candidate tokens per step and verifies them in one batched target forward. Bigger tree means more acceptance per step, but each step costs more KV memory traffic. On bandwidth-bound silicon the cost wins. We swept budgets from 8 to 128:

--ddtree-budget sweep, gfx1151 (3.5/3.5 Q8_0 stand-in; trend holds for 3.6)
budget=8 (AL 5.32)
36.45
budget=22 (AL 7.17)
36.76
budget=32 (AL 7.78)
33.29
budget=48 (AL 7.82)
27.69
budget=64 (AL 8.09)
23.88
budget=96 (AL 8.22)
18.16
budget=128 (AL 8.42)
13.45
Acceptance length keeps rising with budget (AL goes 5.32 to 8.42), but tok/s peaks at budget=22 and falls off a cliff above 32. LPDDR5X-8000 cannot pay for the larger verify tree.

Compare to gfx1100 (7900 XTX, GDDR6 936 GB/s): per PR #156, budget=8 wins +53% on that silicon because tile waste matters more than launch amortization there. On Strix Halo the opposite holds. The default ship is arch-aware.

The 128 GiB headroom

For Qwen3.5-27B Q4_K_M (15.65 GiB target + 1.84 GiB drafter + KV cache) that leaves ~100 GiB free. The same box hosts can run: Qwen3.5-122B-A10B, MiniMax-M2.7-REAP 139B-A10B at 78 GiB, full BF16 27B at 50 GiB. PR #119's speedups apply to the 27B class today.

Reproduce

# 1. Build PR #119 for gfx1151
git clone https://github.com/Luce-Org/lucebox-hub.git
cd lucebox-hub
git fetch origin pull/119/head:pr119 && git checkout pr119
git submodule update --init --recursive
cd dflash
cmake -B build -S . \
  -DCMAKE_BUILD_TYPE=Release \
  -DDFLASH27B_GPU_BACKEND=hip \
  -DDFLASH27B_HIP_ARCHITECTURES=gfx1151 \
  -DDFLASH27B_HIP_SM80_EQUIV=ON
cmake --build build --target test_dflash -j

# 2. Models: Qwen3.6-27B target + Lucebox Q8_0 DFlash drafter
mkdir -p models/draft
hf download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-Q4_K_M.gguf --local-dir models/
hf download Lucebox/Qwen3.6-27B-DFlash-GGUF dflash-draft-3.6-q8_0.gguf --local-dir models/draft/

# 3. Bench (DFlash decode + PFlash long-context prefill)
LD_LIBRARY_PATH=/opt/rocm/lib:$LD_LIBRARY_PATH \
DFLASH_BIN=$PWD/build/test_dflash \
DFLASH_TARGET=$PWD/models/Qwen3.6-27B-Q4_K_M.gguf \
DFLASH_DRAFT=$PWD/models/draft/dflash-draft-3.6-q8_0.gguf \
DFLASH27B_DRAFT_SWA=2048 \
DFLASH27B_PREFILL_UBATCH=512 \
  python3 scripts/bench_he.py --n-gen 128 --ddtree-budget 22

The DFLASH27B_PREFILL_UBATCH=512 override applies the PR #159 fix on top of the PR #119 base. Once #159 merges, this will be the daemon default.

What is still missing

Bottom line

PR #119 plus PR #159 make lucebox fast on Strix Halo for the canonical Qwen3.6-27B path. 26.85 tok/s decode and 20.2 s prefill at 16K, both end-to-end measured, 2.23x and 3.05x over llama.cpp HIP on the same iGPU. The architecture lift (CUDA to HIP, rocWMMA flashprefill, DDTree verifier) was a big piece; the remaining gains are kernel work.

The local-inference story on consumer AMD is no longer a myth. A Ryzen AI MAX+ 395 box has 128 GiB of unified memory, runs Qwen3.6-27B, hosts the DFlash spec decode and the PFlash long-context prefill, and the wall clock at a realistic 16K + 1K workload comes in at 58 s vs llama.cpp's 147 s. The same hardware is sized to host the 122B and 139B MoE class next.


Hardware: Ryzen AI MAX+ 395, Radeon 8060S iGPU (gfx1151), 128 GiB LPDDR5X-8000, Ubuntu 24.04 HWE kernel 6.17, ROCm 7.2.2. Stack: lucebox PR #119 (rocWMMA Phase 2 flashprefill on HIP), llama.cpp mainline for AR baselines. All benches run on a single physical box. References: PR #119, PR #156 cross-arch perf plan.

Related

Run Qwen3.6-27B at 27 tok/s on a single AMD box

Open-source lucebox PR #119, DFlash + PFlash on gfx1151. Ryzen AI MAX+ 395 class hardware.

GitHub PFlash post Discord