One provider was hiding three
It started as a ten-minute UX fix. The dictionary app showed the phonetic spelling of every word it looked up — /ˈherəld/, /ɪmˈbɑrɡoʊ/ — but there was no way to actually hear it. We opened the codebase expecting to find the button broken and the endpoint working; instead we found the endpoint working and no button. The /api/pronounce/{word} route called self.speak() and returned a file path, but nothing on the page ever asked for that route, and even if something had, there was no HTTP endpoint that could hand the resulting MP3 back to the browser. A half-built feature, quietly.
So we finished it. A speaker icon next to the word, a second icon in the vocab detail view, a third on the SRS flashcard. The JS caches the per-word audio URL, shows a playing state while the clip runs, and toasts a one-liner when the voice service is offline. A new /dict/api/audio/{filename} route serves the MP3 from the shared TTS temp directory with a path-traversal guard so nothing else can be exfiltrated through it. Tested end-to-end via curl before we trusted the browser — speak.execute("hello world") returned a valid 10-kilobyte MP3 on the first call. That part took the expected ten minutes.
The problem showed up when we asked what happens if the voice service is down. The answer was: everything breaks. Every self.speak() call across every app fails. That isn't how the capability pattern is supposed to work. The whole point of capabilities having a provider chain is that a single outage shouldn't starve the feature — the chain is meant to fall through to the next provider, and finally to a human prompt. Speak had exactly one provider registered: voice-api, which was really three engines in disguise (Kokoro for fast local English, XTTS for voice cloning, an Edge alias path as a fallback inside the service itself). If the service process died, the capability reported itself unavailable even though the idea of fallback was still perfectly valid — there were other ways to turn text into sound, they just weren't wired in.
The fix was to stop pretending. We split the voice-api plugin so it registers three separate providers against the capability chain — kokoro, xtts, and whisper — each with its own available() check that reads the service's health endpoint and reports on just one engine. A five-second TTL cache sits in front so the capability layer can hammer available() without hammering the service. If Kokoro's model files are missing the kokoro provider reports unavailable while XTTS and Whisper keep working — and vice versa. The providers no longer die together.
With the internal split done, we added two more providers to the chain from outside the service. A new edge-tts plugin runs the edge-tts pip package in-process — no subprocess, no HTTP hop, no model files to load — and registers at priority zero so it's tried first. A new openai-tts baseline provider lives under emptyos/capabilities/providers/ and is picked up automatically when OPENAI_API_KEY is set, subject to the same cloud consent gate every other cloud provider passes through. The listen capability got the same treatment — openai-whisper as a cloud baseline, local whisper through the voice service as the fallback. The chain is now [edge-tts, openai-tts, kokoro, xtts] for speak and [openai-whisper, whisper] for listen, each entry independent. The 🔊 button in the dictionary works with the voice service completely offline; edge-tts picks up the call and returns a valid MP3 in about 400 milliseconds.
We got a smaller win while we were in the dictionary code. Looking up a word that's already been saved used to fire the LLM every time — a noticeable pause on a phone, a pointless cost on anything else. The vault already has the full definition, example, synonyms, etymology, Chinese gloss. So we taught the /api/lookup endpoint to read the saved note first, reshape its frontmatter and body sections into the same schema a fresh lookup would return, and skip the model entirely. A ?fresh=1 query parameter forces a real re-lookup when that's what you want; otherwise the UI shows a "↻ Refresh with AI" button on the result card instead of a "Save to Vault" button, because what would saving even do. Saved words are instant again.
One housekeeping extract fell out of the split: the TTS temp directory constant — {tempdir}/emptyos-voice — had gotten duplicated in four places (the service, two providers, and the dictionary app's audio route). We pulled it into emptyos/capabilities/audio.py alongside a small MIME map and rewired the three in-package call sites to import it. The standalone voice service keeps its own copy on purpose; it runs as a subprocess that shouldn't couple itself to the emptyos package just to share a string. A classic two-against-one extract: three code copies consolidate, one outlier stays the outlier for a reason worth writing down.
What's left open: the provider chain order is a reasonable default but a strong opinion. Someone who wants privacy-first and never-cloud should put voice-api first and cut the OpenAI baseline; someone who wants quality-first on English should probably move openai-tts ahead of edge-tts. That sort of preference belongs in the providers app we built last session — the one that already handles the think capability's chain — but the UI still only edits think. When a second capability grows a reason to be reorderable from the browser (and speak is probably the first of those), the row schema will need a capability field and the save logic will need to learn which capability to mutate. Noted and deferred.