Making the coding agent legible

2026-04-19 · 6 min read devlog emptyos agents testing

The coding agent had shipped a session earlier — a Claude-Code-like tool-use loop with a terminal frontend (eos chat) and a web frontend (/agent/) both driving the same run_turn core. The pitch was "same brain, two transports." The reality was rougher: eos chat printed a tidy banner with session id, provider, mode, and tool count; the web page just said "Agent" and had an empty sidebar. Asking the agent itself "which model are you?" produced an evasion — I don't have visibility into the underlying model — because nothing in the system told it what was running it. Two surfaces, one of them markedly less informative than the other, and a brain that didn't know its own face.

Before rewriting any of it we wrote tests, because neither app had any. The agent's existing test file was 600 lines of pure-unit work against a scripted fake provider — excellent coverage of the loop logic, zero coverage of the web surface. The run app (a small command-runner) had nothing at all. We added live-daemon API and UI classes for both, plus Playwright page tests that opened the real pages in a real browser. That first pass paid immediately: the run app's UI was calling EOS.api('/api/execute') inside a fetch(...) call, wrapping a fetch-wrapper in another fetch — the UI had been stuck on "Running…" forever because the inner call returned a promise that the outer fetch then coerced to [object Promise] as a URL. Nobody had noticed because nobody had ever run the test. One bug the test didn't have to try to catch; just loading the page was enough.

With tests in place the agent UI itself got the attention it needed. The colours were hardcoded dark (designed against a GitHub-dark palette that no EmptyOS theme actually provides), which on the default paper theme produced a bright white gap in the chat area. That became theme variables throughout. Assistant replies came through as literal markdown — asterisks, hash marks, code-fence syntax all rendered as plain text — so we wired them through the existing EOS_UI.renderMarkdown helper (the one the vault notes view already uses) on turn completion. A status header now sits above the transcript: the session name, a chip showing provider · model, a chip showing tool count, a chip showing the approval policy. Clicking the model chip opens a provider-switch modal; clicking the tools chip runs /tools inline. The page now answers three questions a user should never have to dig for: what's running this, what can it do, and how strict is the approval gate.

The command surface was a bigger thread. The agent had no slash commands — not in the terminal, not on the web. The terminal had exactly two handlers: /quit and /exit. Anything else got shipped to the LLM. So we defined a single SLASH_COMMANDS list in the agent app, served it at /agent/api/slash-commands, and taught both transports to execute the same set client-side (never round-tripping to the model): /help, /status, /tools, /clear, /new, /model <provider>, /settings, /quit. The web gets a floating palette when the input starts with /, with arrow-key navigation, tab-to-complete, escape-to-dismiss — Claude-Code ergonomics for the keyboard path. The terminal gets the same commands at the REPL. /model openai switches the session's provider mid-conversation, persists the choice through a new PATCH /api/sessions/{sid} endpoint, and reconnects the websocket so the next turn goes to the new model. The two surfaces now share not just the brain but the grammar.

For the self-awareness problem, we appended a small "Runtime (factual, for self-reference)" footer to the system prompt at turn time — provider name, model string, wire protocol, tool count. The footer tells the agent to answer truthfully when asked and not to recite unprompted. It's four lines, but it's the difference between I don't have visibility and I'm running ollama with qwen3.5:latest via the openai-compatible wire protocol, with seven tools available. The footer is built from the same /api/status endpoint the header chips read from, so everyone — the user looking at the header, the agent generating its reply, the CLI printing its banner — is looking at the same facts.

Then somebody asked the agent to play a song, and everything fell over. Ollama returned a 400. Our client caught it and raised RuntimeError: 400, message='Bad Request' — which is worse than useless, because it tells you a failure happened without telling you what Ollama objected to. The error path had been swallowing the response body for months, possibly forever. First fix: read the body, parse error.message, and raise a real exception: RuntimeError: ollama tool-call request failed (HTTP 400): <actual message>. That change is the kind of thing that feels like chore work until it pays. The next retry revealed the actual problem immediately: invalid message content type: <nil>. Ollama was rejecting the assistant message we'd just sent, because it had content: null alongside tool_calls. OpenAI's spec permits that shape; Ollama's implementation doesn't. Our loop had been setting content = text or None for tool-only assistant turns — perfectly fine against OpenAI and Anthropic, silently wrong against Ollama. One character: None → "". And a matching fix one layer down, because sessions persisted before today had null stored in the database; the normalization layer that rebuilds OpenAI message shape now coerces any null assistant content to empty string at wire time, so replay through Ollama heals historical rows without a migration.

What this makes possible: the agent is now a real Claude-Code-grade tool you can sit in front of and use, whether you prefer the terminal or the browser. Both know what model they're running; both know what commands exist; both route slash commands client-side; both show the tool list; both handle provider switches. And because the error-surfacing path now tells us what providers are actually complaining about, the next wire-format mismatch — from some future local model, or from a provider that reads the spec slightly differently — will show up as a real error message rather than a silent 400 swallow. There's still one frayed edge worth naming: the assistant app has its own /api/slash-commands endpoint with different semantics (server-side routing to other EmptyOS apps, not client-side UI control), and the two shouldn't be unified yet — different contracts, different execution models, different UX. If a third app ever wants client-side slash commands, that's the moment to extract the palette into a shared helper. Not before.

The bench earns its keep
2026-04-19
The audit tools clean themselves up
2026-04-27
Boards discovers what it really is
2026-04-26

← Back to posts

Making the coding agent legible

Related Posts