May 2026

Launch and tune Lucebox with real agent harnesses

Seven client paths wired to Lucebox, with launch scripts, RTX 3090 profiles, and saved DDTree runs.

Lucebox client harness experiments on a RTX 3090

TL;DR

Harness code. The repo now has one launcher per client: Codex, Claude Code, OpenCode, Hermes, Pi, OpenClaw, Open WebUI chat, and Open WebUI tools.
Model path. The runs use Lucebox/Qwen3.6-27B-DFlash-GGUF, Q4_K_M target weights and the Lucebox HTTP server.
Context range. The practical RTX 3090 settings landed between 32K and 256K. The limit is set by each client's default prompt and tool schema, not by a single server flag.
Speed. Long real-client turns with TQ3/DDTree/lazy draft landed around 22.6-25.9 tok/s on the agent paths we measured.
Open WebUI status. Chat passes at 256K. Default server-side tool execution and native tool-call emission pass at 64K.

Why this exists

An OpenAI request is not enough evidence for an agent harness speed. The clients in this test use different request shapes, different streaming expectations and very different default prompts.

Codex uses Responses and sends tools. Claude Code uses Anthropic Messages API. OpenCode needs a project-local config. Hermes, Pi and OpenClaw each have their own provider settings. Open WebUI adds another server between the user and Lucebox.

The harness exists to keep those paths reproducible. It starts the server, runs the actual client, records the output and logs, and makes client-specific regressions visible before they become user reports.

Pass criterion: the real CLI starts, reaches Lucebox, uses the intended protocol or tool path, produces the expected answer, and leaves enough logs to debug a failure.

Context ceilings

Same hardware, same model family, same server. Different clients still need different profiles because the prompt and tool overhead are not the same.

Largest context that still produced a useful answer in the saved real-client run.

Decode speed

For speed, we only chart runs where the server log captured the decode split and the client output was useful. Claude Code is included after stripping Claude-specific skill reminders before rendering the prompt for Qwen.

DDTree/TQ3/lazy-draft profiles. Server decode tok/s, not wall-clock CLI time.

Lucebox vs llama.cpp

For backend-pair runs we keep the client path fixed and swap only the model backend. The chart shows Lucebox decode speed divided by llama.cpp decode speed. Claude Code is the TQ3/TQ3 pair and is shown at parity.

Same client request shape where available. Higher is better; the dashed line is parity with llama.cpp.

Harness Performance

There is no single best profile. A 256K chat/proxy request and a 32K tool-agent request are different workloads. The last column records the ceiling or the next check.

Client	Working profile	Path exercised	Ceiling / next check
Claude Code	`48K`, DDTree, budget 22, lazy draft	Anthropic Messages, default tools, long output, 23.4 tok/s server decode	Passes after stripping Claude-specific skill reminders from the Qwen prompt
Codex	`32K`, DDTree, budget 22, lazy draft	`/v1/responses`, system prompt, tools, long output, 24.1 tok/s	34K+ returned zero output
OpenCode	`84K`, DDTree, budget 22, lazy draft	Project `opencode.json`, file tools, long output, 24.0 tok/s	88K+ became wrong or empty
Hermes Agent	`96K`, DDTree, budget 22, lazy draft	Chat Completions with 19 tools, long output, 22.6 tok/s	112K+ skipped useful work
Pi	`64K`, DDTree, budget 22, lazy draft	Responses streaming, function-call lifecycle, two read calls, 25.9-29.6 tok/s	256K short-prompt smoke passed; 256K repo/tool run OOMed during prefill on RTX 3090
OpenClaw	`200K`, DDTree, budget 22, lazy draft	Real 29-tool prompt, long output, 25.6 tok/s	224K+ OOM with full prompt
Open WebUI chat	`256K`, DDTree, budget 22, lazy draft	OpenAI proxy chat route generated correctly, 38.5 tok/s on the long chat run	Tool routes are covered in the Open WebUI tools row
Open WebUI tools	`64K`, DDTree, budget 22, lazy draft	Default server-side tool execution returns `OPENWEBUI_TOOL_OK`; native mode emits `tool_calls` at 88.0 tok/s	No failure in the saved 64K tool tests; keep both modes in regression coverage

Harness Profiles

For a 24 GB RTX 3090, this is the current starting point. Move up only after the real client still reads, calls tools, and generates correctly.

Claude Code   MAX_CTX=49152  BUDGET=22  VERIFY_MODE=ddtree  EXTRA_SERVER_ARGS=--lazy-draft
Codex         MAX_CTX=32768  BUDGET=22  VERIFY_MODE=ddtree  EXTRA_SERVER_ARGS=--lazy-draft
OpenCode      MAX_CTX=86016  BUDGET=22  VERIFY_MODE=ddtree  EXTRA_SERVER_ARGS=--lazy-draft
Hermes Agent  MAX_CTX=98304  BUDGET=22  VERIFY_MODE=ddtree  EXTRA_SERVER_ARGS=--lazy-draft
Pi            MAX_CTX=65536  BUDGET=22  VERIFY_MODE=ddtree  EXTRA_SERVER_ARGS=--lazy-draft
OpenClaw      MAX_CTX=204800 BUDGET=22  VERIFY_MODE=ddtree  EXTRA_SERVER_ARGS=--lazy-draft
Open WebUI    MAX_CTX=262144 BUDGET=22  VERIFY_MODE=ddtree  EXTRA_SERVER_ARGS=--lazy-draft  # chat route
Open WebUI tools  MAX_CTX=65536 BUDGET=22 VERIFY_MODE=ddtree EXTRA_SERVER_ARGS=--lazy-draft  # default execution + native emission

What went into the test suite

The harness is small on purpose: shared server startup code, one launcher per client, and a few fixed prompts. After a server change, a contributor can rerun the same client path and compare logs.

harness/
  client_test_runner.py
  clients/
    run_claude_code.sh
    run_codex.sh
    run_opencode.sh
    run_hermes.sh
    run_pi.sh
    run_openclaw.sh
    run_openwebui.sh

The runner saves request shape, context size, server settings, token counts, client output, and server logs. That separates protocol failures from model-output failures.

What broke first

The failures were mostly integration details:

First token correctness. The Qwen3.6 daemon fallback path now emits the prefill argmax instead of starting empty or wrong.
DDTree dispatch. Single-GPU --daemon --ddtree was routing into an autoregressive fallback. The harness now catches that by requiring real decode speed, not only correct text.
OpenClaw tool turns. The stable long-output profile reached 200K with DDTree. The 224K run OOMed with the full prompt.
Streaming shape. Anthropic Messages and OpenAI Responses needed the event details real clients expect.
Tool calls. Function-call lifecycle was checked with real clients, not hand-written JSON. Open WebUI has both default server-side execution and native tool-call emission coverage.
Project root. OpenCode needed its tools pointed at the same repo the user sees.

Takeaway

For agent clients, compatibility means more than returning text from one route. The client has to send its default prompt, stream or call tools in its own format, and still get useful output from the local server.

On a single RTX 3090, Lucebox ran Qwen3.6 through the tested chat and agent paths with DDTree as the default fast path. The profile is still client-specific: Codex is stable at 32K, Pi's real tool path is stable at 64K, OpenClaw reached 200K, Open WebUI chat reached 256K, and Open WebUI's 64K tool paths pass.

Model: Lucebox/Qwen3.6-27B-DFlash-GGUF. Project: github.com/Luce-Org/lucebox-hub.

Run Lucebox with your agent client

Use the launchers, inspect the logs, and tune the profile for your client.

GitHub Model Discord

Launch and tune Lucebox with real agent harnesses

TL;DR

Why this exists

Context ceilings

Decode speed

Lucebox vs llama.cpp

Harness Performance

Harness Profiles

What went into the test suite

What broke first

Takeaway

Related

Run Lucebox with your agent client