Self-Hosted TTS for Developer Tools with Kokoro

Most developer tools that speak use cloud TTS: send text to an API, get audio back. It works, but it means every utterance is an HTTP round-trip with variable latency, accumulating costs, and your text leaving your network. For tools that generate speech frequently (monitoring, notifications, dashboards), the economics and privacy tradeoffs get real fast.

Kokoro is a different approach. It's an 82-million parameter TTS model that runs locally. On a GPU, it renders a typical sentence to audio in about 50 milliseconds. On CPU, it takes about a second. No API keys, no per-request billing, no internet connection required after the initial model download. This post covers how we use it in Radio Agent and what it takes to integrate local TTS into your own tools.

Why local TTS for developer tools

Cloud TTS services (ElevenLabs, Google Cloud TTS, Amazon Polly, Azure Speech) are excellent. The voice quality is often better than local models. But they come with three costs that matter for developer tooling:

Latency. A cloud TTS round-trip adds 200-800ms to every utterance. For a notification system that speaks 30+ times per hour, that latency compounds into perceivable delay between the event and hearing about it.
Cost. Cloud TTS is priced per character. At 40 words per announcement and 30 announcements per hour over an 8-hour session, you're generating roughly 10,000 characters per day. That's about $0.50-$2/day depending on the service. Not ruinous, but nonzero for a tool that should just work after install.
Privacy. The text you send to cloud TTS contains your project names, error messages, agent identifiers, and workflow details. For open-source dev tools, requiring users to create accounts and send their activity data to third parties is a hard sell.

Local TTS eliminates all three. The model runs on the same machine as your tool. Latency is compute time only (50ms on GPU). Cost is zero after install. Privacy is total because nothing leaves the machine.

Kokoro in practice

Kokoro is an 82M parameter model released under the Apache 2.0 license. It outputs 24kHz audio, supports multiple voices (male and female, various accents), and ships as a Python package. Installation is one command:

pip install kokoro soundfile

The basic render path is about ten lines of Python:

from kokoro import KPipeline
import numpy as np
import soundfile as sf

pipeline = KPipeline(lang_code="a", repo_id="hexgrad/Kokoro-82M")

chunks = []
for _, _, audio in pipeline("Config module shipped with tests.", voice="af_heart"):
    chunks.append(audio)

combined = np.concatenate(chunks)
sf.write("output.wav", combined, 24000)

That produces a WAV file with natural-sounding speech. The first call downloads the model (~170MB) and warms up the pipeline. Subsequent calls are fast: about 50ms per sentence on a modern GPU, 800ms-1.5s on CPU.

How Radio Agent uses Kokoro

Radio Agent wraps Kokoro in a TTSEngine protocol that handles warmup, error recovery, and multi-voice support. The full render pipeline from webhook to audible speech looks like this:

An agent POSTs JSON to the /announce webhook with event details
The brain's script generator translates the event into a natural sentence (or the DJ skill rewrites it creatively)
Kokoro renders the sentence to a WAV file in /tmp/agent-radio/
The brain validates the WAV (duration bounds, silence check via RMS)
The brain pushes the WAV path to Liquidsoap via a Unix socket command
Liquidsoap ducks the music, plays the voice clip, and restores the music
A timer deletes the WAV file after 60 seconds

End-to-end latency from webhook POST to audible speech is typically under 2 seconds on GPU, with Kokoro accounting for about 50ms of that. The rest is Liquidsoap buffer flushing and Icecast encoding.

Multiple voices for context

One of Kokoro's practical strengths is multi-voice support. Each voice is a separate model checkpoint that can be loaded alongside the default. In Radio Agent, we use this for event differentiation:

Normal announcements (completions, status updates) use af_heart, a female voice with warm intonation
Failures and blockers use am_michael, a male voice with more neutral delivery

The voice switch happens automatically based on the event kind. The effect is subtle but powerful: after a few hours, your ear recognizes "male voice = something went wrong" before your brain processes the words. It's the audio equivalent of red vs. green status indicators.

Both voices are warmed up on startup so the first failure announcement doesn't pay a cold-start penalty:

pipeline = KPipeline(lang_code="a", repo_id="hexgrad/Kokoro-82M")

# Warm up default voice
for _ in pipeline("warm up", voice="af_heart"):
    pass

# Warm up failure voice
for _ in pipeline("warm up", voice="am_michael"):
    pass

Tradeoffs vs. cloud TTS

	Cloud TTS	Kokoro (local)
Voice quality	Excellent. ElevenLabs leads.	Good. Natural enough for notifications.
Latency	200-800ms (network)	50ms GPU, ~1s CPU
Cost	$0.50-2/day at notification scale	Free after install
Privacy	Text sent to third party	Nothing leaves the machine
Offline	No	Yes
Voices	Hundreds, highly customizable	Dozens, less variety
Emotion/prosody	Advanced (ElevenLabs, Azure)	Basic. No emotion tags.
Model size	N/A (cloud)	~170MB download
GPU required	No	No (but 15x faster with one)

The honest assessment: cloud TTS sounds better, especially for long-form content or emotional delivery. Kokoro sounds good enough for the notification use case, where utterances are 5-15 seconds and clarity matters more than expressiveness. If you're building a podcast generator or an audiobook engine, cloud TTS is probably worth the cost. If you're building a status notifier that speaks 30 times an hour, local TTS saves you money and latency without a meaningful quality tradeoff.

What's next for local TTS

The local TTS landscape is moving fast. Models like Fish Speech and Orpheus 3B offer better prosody and emotion control than Kokoro, but at higher compute cost (5-10 seconds per clip instead of 50ms). For Radio Agent's use case, that latency is too high: the TTS evaluation showed that sub-second render time is critical for the event-to-audio pipeline to feel responsive.

As local models improve, the quality gap with cloud services will narrow. Kokoro-82M was released in late 2024 and it's already good enough for production notification use. The next generation of lightweight TTS models will likely match cloud quality for short-form speech within a year or two.

For now, Kokoro hits the sweet spot: fast enough for real-time use, good enough that listeners don't cringe, and simple enough that pip install kokoro is the entire setup.