Voice becomes native

2026-04-15 · 6 min read devlog emptyos voice sdk

Audio had been second-class in EmptyOS for weeks. The speak capability existed, the voice-api plugin wrapped it, but the defaults were clunky: cloud-only Edge-TTS for most calls, an XTTS voice-clone branch for the handful of cases that needed it, and a silent 404 on the dedicated port because the plugin was still pointing at a legacy service from a different project. Fixing that at the plugin level triggered a series of moves that, by nightfall, made voice a first-class default.

Kokoro-82M became the default speak engine. Kokoro is local, free, higher quality than Edge, and lazy-loads a 325 MB ONNX model the first time it's needed. A new dispatch layer in voice-api routes kokoro:<id> explicitly, auto-detects bare IDs (af_heart, zf_xiaoxiao) by prefix, picks a sensible language-detected default when no voice is specified, and preserves Edge for the aliased voices that already existed (sarah, en-US-*). Edge-TTS stays as an automatic fallback on Kokoro failure. XTTS remains for custom voice cloning. The /health endpoint surfaces the full picture (default_engine, kokoro_available, kokoro_loaded, kokoro_voices). The root-cause port bug that had been firing TTS 404s for days turned out to be a host-port default pointing at a retired service — flipped from 8601 to 8602 and everything on the consumer side came alive.

Alongside that, a publish cover pipeline with a human gate. New POST /api/generate-cover renders an image through ComfyUI via the draw capability, saves it to the source media/ folder, and opens a preview modal with three actions: approve & embed, regenerate, reject. Two durable frontmatter fields — summary (the article's thesis in 2–3 sentences, reused for RSS and OG) and image_prompt (the visual brief) — are generated on first run by staff consult-agents and then editable in the note itself. Regenerate reuses the prompt verbatim unless the author clicks "Rewrite brief." This replaces a brittle earlier heuristic that just used the first 1200 characters of body text. The writer UI gained an explicit image-prompt textarea and — more important — started preserving unknown frontmatter fields through save, closing a silent data-loss bug where the editor was rebuilding frontmatter from six hard-coded keys and quietly wiping cover, featured, image_prompt, and anything else the user had added.

Slideshow podcasts got one more move: a social-shareable MP4 render via ffmpeg. The builder writes a concat file from scene timings, renders 1080×1080 H.264/AAC timed to the podcast audio, and drops a "Download as video" link under the interactive player. The interactive player stays primary — it has scrubber, crossfade, synced subtitles, and follows the audio. The MP4 exists for platforms (LinkedIn, X) that won't render an embed.

Sites also got per-site favicons and a search-engines toggle — a simple boolean in the site profile that, when off, writes <meta name="robots" content="noindex, nofollow"> on every page plus a Disallow: / robots.txt. When on with a configured domain, robots.txt includes a sitemap link. Useful for the EmptyOS site (which is public) versus draft or in-review sites (which shouldn't be indexed yet).

A different thread ran underneath: the assistant finally learned to remember its own conversations. Tracing a user report that "it treats each question as new" turned up a three-layer bug. The WebSocket path was flattening history into a labelled transcript with a hard 300-character cap per message. The REST /api/chat path sent zero history. And beneath both, BaseApp.think() and think_stream() only accepted a single prompt: str — so even a well-intentioned caller couldn't pass proper turns. Fix: the SDK grew a messages=[{role, content}, ...] parameter alongside the legacy single-prompt kwarg, threading through both providers. The OpenAI-compat provider passes messages through natively. The claude-cli provider flattens them into a labelled transcript (CLI only takes one -p arg) with the last user turn as the live question. The assistant then got a _build_chat_messages(session) helper that maps DB rows to the messages shape, drops slash-command meta replies, keeps the last 40 turns with no per-message truncation, and injects vault context into the current user turn only (not polluting history). Smoke test: a fresh session, a nonsense token in turn 1 (impossible to vault-search or guess), ask for it back in turn 2 — the model returns the exact string. Before the fix, the REST path couldn't have passed that test.

SDK discipline tightened in the same stretch. A deliberate /eos-sdk-extract pass ran against the whole codebase and surfaced one clean migration (three apps duplicating the HistoryStore append-and-keep-last-N pattern) plus a batch of candidates that didn't pass the two-callers-minimum rule. More useful than the extraction itself was the structural-duplicate scanner that came out of the skill-weakness conversation. Name-grep has good precision but thin recall — it misses copy-paste where callers renamed their locals, and it can't see inline duplication against a methodified helper elsewhere. So scripts/sdk_duplicate_scan.py landed: AST-based, ~140 lines, stdlib only. It parses every function, strips docstrings, renames all Name/arg locals to positional _v0.. placeholders in first-seen order (keeping self/cls, attribute names, literals), hashes the normalised unparse, groups by digest. On first run against 118 files it surfaced a legitimate core+apps duplicate — _weekly_path inlined twice in core journal and duplicated across three downstream apps — that would never have been caught by name-grep because core journal had no method name to search for. The eos-sdk-extract skill got a new Phase 1d citing the scanner.

Two smaller things worth recording. The voice-api plugin now restarts cleanly from both restart.bat and the plugin's auto_start(), with a stricter available() check that catches the port-conflict case explicitly instead of hanging. And restart.bat's headless-launch pattern (pushd + start /b with >nul 2>nul) became the canonical way to launch external services — any plugin that wraps a python-embedded service should follow it. CLAUDE.md got the pattern documented.

One strategic decision also landed: release-repo strategy. Making this repo public would expose pre-d0fcf2e commits that still contain personal data from before the privacy scanning was in place. Rather than git-filter-repo across a large history, we decided to publish EmptyOS through a separate public repo seeded from scripts/package-release.py <tier> output. The current repo stays private as emptyos-dev; a fresh public emptyos gets bootstrapped from the packaged tier with a clean history. The packaging script already guarantees no personal data leaks into dist/, so the public repo starts clean by construction.

Left for later: Kokoro loads a 325 MB model on first use — fine for a laptop, possibly too much for a small VPS. A streaming/remote voice-api mode would help. The scanner's recall is still just "structural equivalence" — it won't see semantic duplicates that differ in trivial control flow. And the test_sys_publish.py that the simplify pass keeps flagging still doesn't exist; the publish app is now carrying enough behaviour that its absence is starting to bite.

The shape of the day: rich media moved from "supported" to "first-class default," the assistant stopped being amnesic, and the SDK grew both a new primitive (messages) and a new discipline (AST duplicate scanning). Compounding effects — next time a session needs voice or multi-turn memory or a duplicate check, none of it has to be thought about again.

Splitting the big apps before they get bigger
2026-04-26
Voice gets a feedback loop
2026-04-26
The cron got poisoned because writes weren't validated
2026-04-25

← Back to posts

Voice becomes native

Related Posts