When retrieval becomes the bottleneck

2026-04-21 · 5 min read devlog emptyos assistant sdk

Earlier in the session we typed a question into the assistant and got back four paragraphs of generic wellness advice with no file references. We typed the same question into the agent — same underlying model, same vault, same day — and it opened the right note on its own and answered grounded in what was actually written there. Two surfaces with a shared brain, only one of them was looking at anything.

The asymmetry was worth naming because it pointed at the real bottleneck. The assistant's retrieval is a keyword grep of the user's question text against every note in the vault — pick the top three, inject their bodies into the system prompt, ask the model to answer. That works when the question contains the keywords that actually live in the note. It falls apart the moment the question is semantic: "what chant should I say" doesn't appear in the note that would answer it, so the grep returns nothing, and the model — with no context and no instruction to admit ignorance — makes up plausible filler and cites note names that don't exist. The agent does something different: it has Read, Grep, and Glob as tools and decides per turn what to look at. Given the same question it ran three searches, opened two real files, and cited both. The difference wasn't reasoning. It was that one of them could look.

The fix was to give the assistant the same eyes. We wired an opt-in tool loop into the existing chat path: when use_tools is set on a message (or as an account default), the assistant runs the same turn driver the agent uses, with the tool set filtered to the read-only three — Read, Grep, Glob — and the consent gate set to auto-approve since nothing can mutate. The iteration cap is eight, the temperature is 0.3, the system prompt points at the vault's PARA folders and tells the model to cite the filenames it actually read and to say so plainly if nothing in the vault fits. The classic grep-then-inject path is still the default, because most chat turns are conversational and don't need the three-call tax. But when retrieval matters, the tool loop closes the gap.

We put three entry points on the toggle. A wrench button in the composer for per-message experiments — the kind of thing you flip when a question is clearly vault-adjacent. A checkbox in settings for people who want it on by default. And the request-level use_tools: true for external callers. All three resolve to the same backend branch; they only differ in who chose when. This matches the "with you, not for you" principle we keep trying to keep honest — tool-use isn't silently on, but it isn't three menus deep either.

The more interesting consequence was structural. The assistant's _chat_with_tools method needed two imports from the agent app — the turn driver and the tool registry. Cross-app imports are a smell EmptyOS has rules about: apps are supposed to talk through the event bus or call_app, not through each other's modules. And there's a rule one level above that about when to move shared code into the SDK proper: build specific first, extract to sdk/ when a second app needs it. We had just become the second app. So the agent loop and its thirteen tool classes graduated — apps/agent/loop.py moved to emptyos/sdk/agent_loop.py, apps/agent/tools/ moved to emptyos/sdk/agent_tools/. About fifty import sites rewrote themselves across the consuming apps, the bench, and the tests. The smell is gone, and any future app that wants tool-use imports the same machinery from the same place.

One app stayed deliberately untouched: staff. Staff runs about twenty personas on cron schedules, each with a small action whitelist and a JSON output contract that's auditable after the fact. Its strength is precisely the opposite of what we just gave the assistant — narrow scope, cheap shifts, predictable outputs. Handing that fleet a grep+read tool loop would multiply its per-day cost by an order of magnitude, make shift duration unpredictable, and turn its structured action trail into free-form tool transcripts. The three-layer shape — staff as narrow autonomy, assistant as conversation with optional retrieval, agent as full autonomous loop — only holds if the envelopes stay distinct. Putting tools on all three collapses them into three agents wearing different UIs, and at that point the question becomes why we have three.

What's still rough. There's a helper for resolving the right tool-capable provider that now exists in two slightly different versions — assistant's skips native-agentic providers (claude-cli runs its own loop and we don't want to double-drive it), agent's allows them. About eighty percent of the code is shared. The same extraction rule says we should consolidate when a third caller appears, so we're holding. More substantively, the assistant's tool loop is stateless across turns — if you ask three questions in a row about the same prep doc, the model re-opens it each time, burning three sets of tool calls on the same file. Session-scoped tool memory (or a cached-reads layer) is the obvious next shape, but it's bigger than this session earned. And retrieval quality still depends on the model writing good search patterns — Grep is only as good as the regex the model composes, and we've seen it miss the obvious lookup on questions that reference unusual vocabulary.

The assistant was always capable of grounded answers. It just had to be looking at the right file when we asked.

Splitting the big apps before they get bigger
2026-04-26
The cron got poisoned because writes weren't validated
2026-04-25
Ready to be run by anyone
2026-04-18

← Back to posts

When retrieval becomes the bottleneck

Related Posts