Most developer tools that speak use cloud TTS: send text to an API, get audio back. It works, but it means every utterance is an HTTP round-trip with variable latency, accumulating costs, and your text leaving your network. For tools that generate speech frequently (monitoring, notifications, dashboards), the economics and privacy tradeoffs get real fast.
Kokoro is a different approach. It's an 82-million parameter TTS model that runs locally. On a GPU, it renders a typical sentence to audio in about 50 milliseconds. On CPU, it takes about a second. No API keys, no per-request billing, no internet connection required after the initial model download. This post covers how we use it in Radio Agent and what it takes to integrate local TTS into your own tools.
Why local TTS for developer tools
Cloud TTS services (ElevenLabs, Google Cloud TTS, Amazon Polly, Azure Speech) are excellent. The voice quality is often better than local models. But they come with three costs that matter for developer tooling:
- Latency. A cloud TTS round-trip adds 200-800ms to every utterance. For a notification system that speaks 30+ times per hour, that latency compounds into perceivable delay between the event and hearing about it.
- Cost. Cloud TTS is priced per character. At 40 words per announcement and 30 announcements per hour over an 8-hour session, you're generating roughly 10,000 characters per day. That's about $0.50-$2/day depending on the service. Not ruinous, but nonzero for a tool that should just work after install.
- Privacy. The text you send to cloud TTS contains your project names, error messages, agent identifiers, and workflow details. For open-source dev tools, requiring users to create accounts and send their activity data to third parties is a hard sell.
Local TTS eliminates all three. The model runs on the same machine as your tool. Latency is compute time only (50ms on GPU). Cost is zero after install. Privacy is total because nothing leaves the machine.
Kokoro in practice
Kokoro is an 82M parameter model released under the Apache 2.0 license. It outputs 24kHz audio, supports multiple voices (male and female, various accents), and ships as a Python package. Installation is one command:
pip install kokoro soundfile
The basic render path is about ten lines of Python:
from kokoro import KPipeline
import numpy as np
import soundfile as sf
pipeline = KPipeline(lang_code="a", repo_id="hexgrad/Kokoro-82M")
chunks = []
for _, _, audio in pipeline("Config module shipped with tests.", voice="af_heart"):
chunks.append(audio)
combined = np.concatenate(chunks)
sf.write("output.wav", combined, 24000)
That produces a WAV file with natural-sounding speech. The first call downloads the model (~170MB) and warms up the pipeline. Subsequent calls are fast: about 50ms per sentence on a modern GPU, 800ms-1.5s on CPU.
How Radio Agent uses Kokoro
Radio Agent wraps Kokoro in a TTSEngine protocol that handles warmup, error recovery, and multi-voice support. The full render pipeline from webhook to audible speech looks like this:
- An agent POSTs JSON to the
/announcewebhook with event details - The brain's script generator translates the event into a natural sentence (or the DJ skill rewrites it creatively)
- Kokoro renders the sentence to a WAV file in
/tmp/agent-radio/ - The brain validates the WAV (duration bounds, silence check via RMS)
- The brain pushes the WAV path to Liquidsoap via a Unix socket command
- Liquidsoap ducks the music, plays the voice clip, and restores the music
- A timer deletes the WAV file after 60 seconds
End-to-end latency from webhook POST to audible speech is typically under 2 seconds on GPU, with Kokoro accounting for about 50ms of that. The rest is Liquidsoap buffer flushing and Icecast encoding.
Multiple voices for context
One of Kokoro's practical strengths is multi-voice support. Each voice is a separate model checkpoint that can be loaded alongside the default. In Radio Agent, we use this for event differentiation:
- Normal announcements (completions, status updates) use
af_heart, a female voice with warm intonation - Failures and blockers use
am_michael, a male voice with more neutral delivery
The voice switch happens automatically based on the event kind. The effect is subtle but powerful: after a few hours, your ear recognizes "male voice = something went wrong" before your brain processes the words. It's the audio equivalent of red vs. green status indicators.
Both voices are warmed up on startup so the first failure announcement doesn't pay a cold-start penalty:
pipeline = KPipeline(lang_code="a", repo_id="hexgrad/Kokoro-82M")
# Warm up default voice
for _ in pipeline("warm up", voice="af_heart"):
pass
# Warm up failure voice
for _ in pipeline("warm up", voice="am_michael"):
pass
Tradeoffs vs. cloud TTS
| Cloud TTS | Kokoro (local) | |
|---|---|---|
| Voice quality | Excellent. ElevenLabs leads. | Good. Natural enough for notifications. |
| Latency | 200-800ms (network) | 50ms GPU, ~1s CPU |
| Cost | $0.50-2/day at notification scale | Free after install |
| Privacy | Text sent to third party | Nothing leaves the machine |
| Offline | No | Yes |
| Voices | Hundreds, highly customizable | Dozens, less variety |
| Emotion/prosody | Advanced (ElevenLabs, Azure) | Basic. No emotion tags. |
| Model size | N/A (cloud) | ~170MB download |
| GPU required | No | No (but 15x faster with one) |
The honest assessment: cloud TTS sounds better, especially for long-form content or emotional delivery. Kokoro sounds good enough for the notification use case, where utterances are 5-15 seconds and clarity matters more than expressiveness. If you're building a podcast generator or an audiobook engine, cloud TTS is probably worth the cost. If you're building a status notifier that speaks 30 times an hour, local TTS saves you money and latency without a meaningful quality tradeoff.
What's next for local TTS
The local TTS landscape is moving fast. Models like Fish Speech and Orpheus 3B offer better prosody and emotion control than Kokoro, but at higher compute cost (5-10 seconds per clip instead of 50ms). For Radio Agent's use case, that latency is too high: the TTS evaluation showed that sub-second render time is critical for the event-to-audio pipeline to feel responsive.
As local models improve, the quality gap with cloud services will narrow. Kokoro-82M was released in late 2024 and it's already good enough for production notification use. The next generation of lightweight TTS models will likely match cloud quality for short-form speech within a year or two.
For now, Kokoro hits the sweet spot: fast enough for real-time use, good enough that listeners don't cringe, and simple enough that pip install kokoro is the entire setup.