May 2026

By Davide Ciffa

Laguna XS.2 on a 3090: 111 tok/s, 5.4x prefill, first MoE target for PFlash

Poolside released Laguna XS.2, a 33B-A3B MoE model. We ported it into Luce Dflash + PFlash, making Laguna the first MoE target supported by PFlash. Result: ~107 tok/s decode at short context and 15.91 s TTFT at 128K on a single RTX 3090, 5.4x faster prefill than llama.cpp. A fast local model on consumer hardware, with a 128K context window that no longer costs minutes to fill.

Laguna XS.2 running on a single RTX 3090 inside the dflash daemon

TL;DR

Where Laguna fits in the local-coder lineup

For most local-coding workloads on a 24 GB GPU, the choice is between Qwen3.6 35B-A3B (the current open MoE benchmark leader on hard agentic tasks) and a smaller / faster runner-up. Laguna XS.2 is a competitve pick: same MoE class (33B-A3B vs 35B-A3B, both 3B active), Apache 2.0, fully open training stack from Poolside. For the light end of the workload, Laguna is a viable alternative.

Where we would reach for Laguna over Qwen3.6 35B-A3B today:

The angle that matters for the easier workloads is throughput-per-cost. PFlash compresses 128K of context in 16 s, then the target decodes at ~107 tok/s at short reply lengths on a single 24 GB GPU. Long-context prompts that used to be a coffee break are now a few seconds of waiting.

Other places it earns a slot in the local lineup:

What it took to run Laguna in Lucebox-hub

Laguna XS.2 is not architecturally vanilla. The loader and forward graph had to handle several non-standard features:

The result is a roughly ~2.9K-node ggml graph, no libllama dependency, hand-rolled CUDA only. The loader places 678 tensors at 18.77 GiB on the GPU plus 110 MiB of token embeddings on the CPU. Fits comfortably on a 24 GB card alongside a Q4_0 KV cache for 128K context.

Numbers on a single RTX 3090

Time to first token, Q4_K_M weights

ContextKVdense dflashPFlash dflashllama.cpp ppPFlash vs llama.cpp
4 096Q8_00.82 s0.56 s1.73 s3.1x
16 384Q4_03.73 s2.54 s8.81 s3.5x
65 536Q4_023.50 s6.35 s32.85 s5.2x
131 072Q4_0OOM15.91 s86.60 s5.4x

Dense dflash OOMs at 128K. PFlash compresses the prompt with a Qwen3-0.6B drafter using block-sparse attention scoring, then hands the compressed token stream to the Laguna target. Cross-tokenizer round-trip uses byte-level BPE plus a word-boundary recovery pass that pulls dropped sub-token fragments back into the kept set.

Decode throughput

Measured on RTX 3090 with bench_laguna_generate (Q4_K_M target, default KV, n_gen=128, greedy):

Prompt ctxDecode tok/s
128107.4
1 02497.1
4 09659.0

Decode is autoregressive (single token per forward) until a Laguna spec-decode draft model is published. The dflash daemon's draft-loaded path is reserved for that drop-in; on Qwen3.5/3.6-27B the same machinery delivers a 3.4–3.75x speedup, so a future Laguna draft should land Laguna in the same range.

NIAH retrieval, depth 0.5, BLUEHORIZON-7421 needle

ContextKVkeepdraftertarget prefillend-to-end TTFTNIAH
16 384Q8_00.10~1.5 s~3 s~4.5 sPASS
65 536Q4_00.10~5 s~6 s~11 sPASS
65 536Q4_00.30~5 s~10 s~15 sPASS
131 072Q4_00.1011.11 s4.79 s15.91 sPASS
131 072Q4_00.2011.20 s13.55 s24.75 sPASS
131 072Q4_00.3011.41 s26.43 s37.84 sPASS

Every (context, keep) point passes including the previously failing 64K+ at keep=0.10. The earlier failure was not a drafter bug. The cross-tokenizer step was truncating multi-token needles at PFlash chunk boundaries: BLUEH survived but ORIZON-7421 got dropped when its tokens fell in low-importance chunks. The fix is a word-boundary expansion pass that pulls partial-word fragments back into the kept set before decoding.

OpenAI-compatible server, with sampling

Same server.py as qwen35. Point --target at the Laguna GGUF and the binary detects arch=laguna from the metadata, then routes to run_laguna_daemon. You get the existing FastAPI surface: /v1/chat/completions (stream and non-stream), /v1/models, /health, CORS, prefix cache, prefill cache. Sampling parameters from the request body forward through to a CPU sampler chain on the daemon side. No new server, no new code path.

Smoke test, prompt = "Tell me a one-line haiku about clouds." on a 3090:

Sampler tailOutput (first 90 chars)
(greedy)Fluffy white giants / Sail through the sky on gentle / Wings of summer breeze
1.0,0.5,0,1.0,99Clouds drift like cotton dreams floating through the sky.

Two distinct decodes from the same prompt confirms the chain wires HTTP body all the way to sample_logits.

Reproduce

Model GGUF: Lucebox/Laguna-XS.2-GGUF (Q4_K_M 20.3 GB, BF16 66.9 GB, imatrix included).

# clone
git clone https://github.com/Luce-Org/lucebox-hub
cd lucebox-hub/dflash

# build with sm_86 (3090 / A6000)
cmake -B build -DCMAKE_CUDA_ARCHITECTURES=86
cmake --build build -j

# fetch the Q4_K_M GGUF + Poolside tokenizer
hf download Lucebox/Laguna-XS.2-GGUF laguna-xs2-Q4_K_M.gguf --local-dir models/
hf download poolside/Laguna-XS.2 chat_template.jinja tokenizer.json tokenizer_config.json \
   special_tokens_map.json config.json --local-dir models/Laguna-XS-2

# run the OpenAI server (same server.py as qwen35, arch auto-detected from GGUF).
# -ctk/-ctv q4_0 keeps the 131K KV cache under ~6 GB so weights + KV fit on 24 GB.
python3 scripts/server.py \
  --target models/laguna-xs2-Q4_K_M.gguf \
  --tokenizer models/Laguna-XS-2 \
  --port 8000 --max-ctx 131072 \
  -ctk q4_0 -ctv q4_0

# chat
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"luce-dflash","messages":[{"role":"user","content":"hello"}],"stream":true}'
Hardware note. The build defaults to 75;86; pass -DCMAKE_CUDA_ARCHITECTURES=86 to skip the extra arch and shave compile time on a 3090.

What is missing

Bottom line

Laguna XS.2 is the first MoE target on PFlash and a clean fit for consumer hardware. Apache 2.0, 3B active parameters, fits next to a Q4_0 KV cache in 24 GB, and PFlash compresses 128K context in 16 seconds on a used RTX 3090, so the long-context loop is no longer a wait state. A solid second model to keep loaded next to your dense Qwen.


Source: PR #116 on github.com/Luce-Org/lucebox-hub. Model GGUF: Lucebox/Laguna-XS.2-GGUF. Benchmark numbers from dflash/RESULTS.md on the integration branch, measured on RTX 3090 24 GB. Upstream model: poolside/Laguna-XS.2; deeper dive at poolside.ai/blog/laguna-a-deeper-dive.

Related

Run Laguna locally

Open-source dflash + PFlash. RTX 3090 class hardware.

GitHub PFlash post Discord