May 2026

By Davide Ciffa

Launch and tune Lucebox with real agent harnesses

Seven client paths wired to Lucebox, with launch scripts, RTX 3090 profiles, and saved DDTree runs.

Lucebox client harness experiments on a RTX 3090

TL;DR

Why this exists

An OpenAI request is not enough evidence for an agent harness speed. The clients in this test use different request shapes, different streaming expectations and very different default prompts.

Codex uses Responses and sends tools. Claude Code uses Anthropic Messages API. OpenCode needs a project-local config. Hermes, Pi and OpenClaw each have their own provider settings. Open WebUI adds another server between the user and Lucebox.

The harness exists to keep those paths reproducible. It starts the server, runs the actual client, records the output and logs, and makes client-specific regressions visible before they become user reports.

Pass criterion: the real CLI starts, reaches Lucebox, uses the intended protocol or tool path, produces the expected answer, and leaves enough logs to debug a failure.

Context ceilings

Same hardware, same model family, same server. Different clients still need different profiles because the prompt and tool overhead are not the same.

Largest context that still produced a useful answer in the saved real-client run.

Decode speed

For speed, we only chart runs where the server log captured the decode split and the client output was useful. Claude Code is included after stripping Claude-specific skill reminders before rendering the prompt for Qwen.

DDTree/TQ3/lazy-draft profiles. Server decode tok/s, not wall-clock CLI time.

Lucebox vs llama.cpp

For backend-pair runs we keep the client path fixed and swap only the model backend. The chart shows Lucebox decode speed divided by llama.cpp decode speed. Claude Code is the TQ3/TQ3 pair and is shown at parity.

Same client request shape where available. Higher is better; the dashed line is parity with llama.cpp.

Harness Performance

There is no single best profile. A 256K chat/proxy request and a 32K tool-agent request are different workloads. The last column records the ceiling or the next check.

Client Working profile Path exercised Ceiling / next check
Claude Code 48K, DDTree, budget 22, lazy draft Anthropic Messages, default tools, long output, 23.4 tok/s server decode Passes after stripping Claude-specific skill reminders from the Qwen prompt
Codex 32K, DDTree, budget 22, lazy draft /v1/responses, system prompt, tools, long output, 24.1 tok/s 34K+ returned zero output
OpenCode 84K, DDTree, budget 22, lazy draft Project opencode.json, file tools, long output, 24.0 tok/s 88K+ became wrong or empty
Hermes Agent 96K, DDTree, budget 22, lazy draft Chat Completions with 19 tools, long output, 22.6 tok/s 112K+ skipped useful work
Pi 64K, DDTree, budget 22, lazy draft Responses streaming, function-call lifecycle, two read calls, 25.9-29.6 tok/s 256K short-prompt smoke passed; 256K repo/tool run OOMed during prefill on RTX 3090
OpenClaw 200K, DDTree, budget 22, lazy draft Real 29-tool prompt, long output, 25.6 tok/s 224K+ OOM with full prompt
Open WebUI chat 256K, DDTree, budget 22, lazy draft OpenAI proxy chat route generated correctly, 38.5 tok/s on the long chat run Tool routes are covered in the Open WebUI tools row
Open WebUI tools 64K, DDTree, budget 22, lazy draft Default server-side tool execution returns OPENWEBUI_TOOL_OK; native mode emits tool_calls at 88.0 tok/s No failure in the saved 64K tool tests; keep both modes in regression coverage

Harness Profiles

For a 24 GB RTX 3090, this is the current starting point. Move up only after the real client still reads, calls tools, and generates correctly.

Claude Code   MAX_CTX=49152  BUDGET=22  VERIFY_MODE=ddtree  EXTRA_SERVER_ARGS=--lazy-draft
Codex         MAX_CTX=32768  BUDGET=22  VERIFY_MODE=ddtree  EXTRA_SERVER_ARGS=--lazy-draft
OpenCode      MAX_CTX=86016  BUDGET=22  VERIFY_MODE=ddtree  EXTRA_SERVER_ARGS=--lazy-draft
Hermes Agent  MAX_CTX=98304  BUDGET=22  VERIFY_MODE=ddtree  EXTRA_SERVER_ARGS=--lazy-draft
Pi            MAX_CTX=65536  BUDGET=22  VERIFY_MODE=ddtree  EXTRA_SERVER_ARGS=--lazy-draft
OpenClaw      MAX_CTX=204800 BUDGET=22  VERIFY_MODE=ddtree  EXTRA_SERVER_ARGS=--lazy-draft
Open WebUI    MAX_CTX=262144 BUDGET=22  VERIFY_MODE=ddtree  EXTRA_SERVER_ARGS=--lazy-draft  # chat route
Open WebUI tools  MAX_CTX=65536 BUDGET=22 VERIFY_MODE=ddtree EXTRA_SERVER_ARGS=--lazy-draft  # default execution + native emission

What went into the test suite

The harness is small on purpose: shared server startup code, one launcher per client, and a few fixed prompts. After a server change, a contributor can rerun the same client path and compare logs.

harness/
  client_test_runner.py
  clients/
    run_claude_code.sh
    run_codex.sh
    run_opencode.sh
    run_hermes.sh
    run_pi.sh
    run_openclaw.sh
    run_openwebui.sh

The runner saves request shape, context size, server settings, token counts, client output, and server logs. That separates protocol failures from model-output failures.

What broke first

The failures were mostly integration details:

Takeaway

For agent clients, compatibility means more than returning text from one route. The client has to send its default prompt, stream or call tools in its own format, and still get useful output from the local server.

On a single RTX 3090, Lucebox ran Qwen3.6 through the tested chat and agent paths with DDTree as the default fast path. The profile is still client-specific: Codex is stable at 32K, Pi's real tool path is stable at 64K, OpenClaw reached 200K, Open WebUI chat reached 256K, and Open WebUI's 64K tool paths pass.


Model: Lucebox/Qwen3.6-27B-DFlash-GGUF. Project: github.com/Luce-Org/lucebox-hub.

Related

Run Lucebox with your agent client

Use the launchers, inspect the logs, and tune the profile for your client.

GitHub Model Discord