May 2026
Launch and tune Lucebox with real agent harnesses
Seven client paths wired to Lucebox, with launch scripts, RTX 3090 profiles, and saved DDTree runs.
TL;DR
- Harness code. The repo now has one launcher per client: Codex, Claude Code, OpenCode, Hermes, Pi, OpenClaw, Open WebUI chat, and Open WebUI tools.
- Model path. The runs use Lucebox/Qwen3.6-27B-DFlash-GGUF, Q4_K_M target weights and the Lucebox HTTP server.
- Context range. The practical RTX 3090 settings landed between 32K and 256K. The limit is set by each client's default prompt and tool schema, not by a single server flag.
- Speed. Long real-client turns with TQ3/DDTree/lazy draft landed around 22.6-25.9 tok/s on the agent paths we measured.
- Open WebUI status. Chat passes at 256K. Default server-side tool execution and native tool-call emission pass at 64K.
Why this exists
An OpenAI request is not enough evidence for an agent harness speed. The clients in this test use different request shapes, different streaming expectations and very different default prompts.
Codex uses Responses and sends tools. Claude Code uses Anthropic Messages API. OpenCode needs a project-local config. Hermes, Pi and OpenClaw each have their own provider settings. Open WebUI adds another server between the user and Lucebox.
The harness exists to keep those paths reproducible. It starts the server, runs the actual client, records the output and logs, and makes client-specific regressions visible before they become user reports.
Context ceilings
Same hardware, same model family, same server. Different clients still need different profiles because the prompt and tool overhead are not the same.
Decode speed
For speed, we only chart runs where the server log captured the decode split and the client output was useful. Claude Code is included after stripping Claude-specific skill reminders before rendering the prompt for Qwen.
Lucebox vs llama.cpp
For backend-pair runs we keep the client path fixed and swap only the model backend. The chart shows Lucebox decode speed divided by llama.cpp decode speed. Claude Code is the TQ3/TQ3 pair and is shown at parity.
Harness Performance
There is no single best profile. A 256K chat/proxy request and a 32K tool-agent request are different workloads. The last column records the ceiling or the next check.
| Client | Working profile | Path exercised | Ceiling / next check |
|---|---|---|---|
| Claude Code | 48K, DDTree, budget 22, lazy draft | Anthropic Messages, default tools, long output, 23.4 tok/s server decode | Passes after stripping Claude-specific skill reminders from the Qwen prompt |
| Codex | 32K, DDTree, budget 22, lazy draft | /v1/responses, system prompt, tools, long output, 24.1 tok/s | 34K+ returned zero output |
| OpenCode | 84K, DDTree, budget 22, lazy draft | Project opencode.json, file tools, long output, 24.0 tok/s | 88K+ became wrong or empty |
| Hermes Agent | 96K, DDTree, budget 22, lazy draft | Chat Completions with 19 tools, long output, 22.6 tok/s | 112K+ skipped useful work |
| Pi | 64K, DDTree, budget 22, lazy draft | Responses streaming, function-call lifecycle, two read calls, 25.9-29.6 tok/s | 256K short-prompt smoke passed; 256K repo/tool run OOMed during prefill on RTX 3090 |
| OpenClaw | 200K, DDTree, budget 22, lazy draft | Real 29-tool prompt, long output, 25.6 tok/s | 224K+ OOM with full prompt |
| Open WebUI chat | 256K, DDTree, budget 22, lazy draft | OpenAI proxy chat route generated correctly, 38.5 tok/s on the long chat run | Tool routes are covered in the Open WebUI tools row |
| Open WebUI tools | 64K, DDTree, budget 22, lazy draft | Default server-side tool execution returns OPENWEBUI_TOOL_OK; native mode emits tool_calls at 88.0 tok/s | No failure in the saved 64K tool tests; keep both modes in regression coverage |
Harness Profiles
For a 24 GB RTX 3090, this is the current starting point. Move up only after the real client still reads, calls tools, and generates correctly.
Claude Code MAX_CTX=49152 BUDGET=22 VERIFY_MODE=ddtree EXTRA_SERVER_ARGS=--lazy-draft
Codex MAX_CTX=32768 BUDGET=22 VERIFY_MODE=ddtree EXTRA_SERVER_ARGS=--lazy-draft
OpenCode MAX_CTX=86016 BUDGET=22 VERIFY_MODE=ddtree EXTRA_SERVER_ARGS=--lazy-draft
Hermes Agent MAX_CTX=98304 BUDGET=22 VERIFY_MODE=ddtree EXTRA_SERVER_ARGS=--lazy-draft
Pi MAX_CTX=65536 BUDGET=22 VERIFY_MODE=ddtree EXTRA_SERVER_ARGS=--lazy-draft
OpenClaw MAX_CTX=204800 BUDGET=22 VERIFY_MODE=ddtree EXTRA_SERVER_ARGS=--lazy-draft
Open WebUI MAX_CTX=262144 BUDGET=22 VERIFY_MODE=ddtree EXTRA_SERVER_ARGS=--lazy-draft # chat route
Open WebUI tools MAX_CTX=65536 BUDGET=22 VERIFY_MODE=ddtree EXTRA_SERVER_ARGS=--lazy-draft # default execution + native emission What went into the test suite
The harness is small on purpose: shared server startup code, one launcher per client, and a few fixed prompts. After a server change, a contributor can rerun the same client path and compare logs.
harness/
client_test_runner.py
clients/
run_claude_code.sh
run_codex.sh
run_opencode.sh
run_hermes.sh
run_pi.sh
run_openclaw.sh
run_openwebui.sh The runner saves request shape, context size, server settings, token counts, client output, and server logs. That separates protocol failures from model-output failures.
What broke first
The failures were mostly integration details:
- First token correctness. The Qwen3.6 daemon fallback path now emits the prefill argmax instead of starting empty or wrong.
- DDTree dispatch. Single-GPU
--daemon --ddtreewas routing into an autoregressive fallback. The harness now catches that by requiring real decode speed, not only correct text. - OpenClaw tool turns. The stable long-output profile reached 200K with DDTree. The 224K run OOMed with the full prompt.
- Streaming shape. Anthropic Messages and OpenAI Responses needed the event details real clients expect.
- Tool calls. Function-call lifecycle was checked with real clients, not hand-written JSON. Open WebUI has both default server-side execution and native tool-call emission coverage.
- Project root. OpenCode needed its tools pointed at the same repo the user sees.
Takeaway
For agent clients, compatibility means more than returning text from one route. The client has to send its default prompt, stream or call tools in its own format, and still get useful output from the local server.
On a single RTX 3090, Lucebox ran Qwen3.6 through the tested chat and agent paths with DDTree as the default fast path. The profile is still client-specific: Codex is stable at 32K, Pi's real tool path is stable at 64K, OpenClaw reached 200K, Open WebUI chat reached 256K, and Open WebUI's 64K tool paths pass.
Model: Lucebox/Qwen3.6-27B-DFlash-GGUF. Project: github.com/Luce-Org/lucebox-hub.