When the agent stops flailing
Earlier in the session we watched eos chat try to fix a broken calculator app. The page was returning {} and the agent's plan was to add another route to work around it. Then another. Then another. Four shadow routes in, the calculator still didn't render and the agent was asking us what we could see in the browser.
The whole loop was a bandaid factory. The real bug was that the generated code imported aiohttp.web.Response in a FastAPI app — EmptyOS has always been FastAPI — and FastAPI can't serialize an aiohttp response, so it silently returned {}. But nothing in the agent's context told it "EmptyOS is FastAPI, the platform already auto-mounts pages/index.html at the prefix, and Python edits need a daemon restart before they take effect." It reasoned from generic Python training, decided a custom handler was missing, and piled patches on top of a wrong mental model.
The rebuild has four parts, each aimed at a specific failure mode from that debugging.
The first is context. We pull a curated slice of CLAUDE.md — the development gotchas and the handful of rules the agent violates most often — into the system prompt at session start, alongside a compact catalogue of every loaded app. The catalogue lists each app's id, short description, CLI surface, and web prefix, so the agent goes straight to CallApp(app_id, method, arguments) instead of burning two turns on discovery. More importantly it now knows the daemon-restart rule is a rule, not advice. When the same calculator scenario runs today, the agent reads emptyos/web/server.py early in the investigation, notices the _mount_loaded_app_routes pattern, and writes clean FastAPI code the first time.
The second is reflexes inside the tool-use loop. If the agent hits three tool errors in a row, a stop-and-replan nudge gets appended to the next tool_result content — the model reads it on the next iteration and backs off from whatever shape it was retrying. If it edits a .py file successfully, a daemon-hint reminder is appended automatically: the running daemon still has the old bytecode; tell the user to restart, then Fetch the affected endpoint to verify. If it edits the same file more than five times in one turn, the sixth edit is refused with a synthetic error asking it to Read the file fresh and reconsider. These aren't prompt engineering. They're bytes appended to the right place at the right moment so the model sees them in-band, without us restructuring the conversation.
The third is self-verification. A new Fetch tool makes HTTP requests so the agent can hit its own endpoints, auto-approved for localhost. A new RestartDaemon tool spawns restart.bat detached so the agent can close its own restart-then-verify loop without handing control back to the user. A new Screenshot tool opens a headless Chromium and returns a PNG plus body.innerText plus any JS/console errors — enough to distinguish "the HTML is empty" from "the HTML looks right but JS errored on load." No more "please open the URL in a browser and tell me what you see." The agent checks its own work.
The fourth is that everything the CLI has, the web UI at /agent/ now has too. Per-turn footer with elapsed time, tokens, cache-hit percentage, tools used, cost. A sticky TaskList panel above the transcript so multi-step plans are visible as checkboxes that tick through in real time. Inline notices when the server appends a [daemon-hint] or [loop-guard] marker, so a human watching sees the same bytes the model is reading. Plan mode — read-only investigation only, toggled by /plan — with a banner and a ⚑ ▸ prompt indicator so you can't forget you're in it. An undo chip in the header showing how many Write/Edit changes are revertable this session. /revert, /skills, /tasks, session CRUD, model persistence, input history — all surfaced in both places. Both surfaces share the same server-side code paths, so they can't drift by accident; the CLI's /revert and the web UI's ↶ chip literally call the same function.
One cleanup worth mentioning because the pattern comes up a lot: we had three parallel pricing tables — two fallback tables on the client and CLI plus the authoritative per-provider PRICING dicts on the server. The fallbacks existed because the Anthropic SDK provider never populated usage.cost. Once we taught it to (with cache-aware math — uncached input at 100%, cache_read at 10%, cache_create at 125%), the fallbacks became dead code that would drift silently whenever rates changed. Deleting them removed about fifty lines and fixed a latent bug where cached Claude turns were displaying roughly 10× their actual cost. Duplicated data tables feel like defensive redundancy; they're usually a latent failure mode waiting for one of the copies to fall behind.
What's left open. We didn't build subagents or cross-session memory — a long-running agent would still bloat its own context without the compaction pass we shipped, and it still forgets what it learned last time the next time it starts. The web UI's undo stack is in-memory on the daemon, so it clears on restart; fine for "just did that" recovery, wrong for anything longer. A few polish items are deferred: right-click rename on session items in the sidebar, a /review slash that runs a pre-commit audit on the agent's own diff. And the whole thing is still a tool-use loop around a single model — no native multi-agent orchestration. The ceiling is higher than it was this morning, but the room above it is large.
The model was always capable of this. It just needed to know where to look.