Self-hosted TTS for developer tools with Kokoro

Most developer tools that speak use cloud TTS: send text to an API, get audio back. It works, but it means every utterance is an HTTP round-trip with variable latency, accumulating costs, and your text leaving your network. For tools that generate speech frequently (monitoring, notifications, dashboards), the economics and privacy tradeoffs get real fast.

Kokoro is a different approach. It's an 82-million parameter TTS model that runs locally. On a GPU, it renders a typical sentence to audio in about 50 milliseconds. On CPU, it takes about a second. No API keys, no per-request billing, no internet connection required after the initial model download. This post covers how we use it in Radio Agent and what it takes to integrate local TTS into your own tools.

Why local TTS for developer tools

Cloud TTS services (ElevenLabs, Google Cloud TTS, Amazon Polly, Azure Speech) are excellent. The voice quality is often better than local models. But they come with three costs that matter for developer tooling:

Local TTS eliminates all three. The model runs on the same machine as your tool. Latency is compute time only (50ms on GPU). Cost is zero after install. Privacy is total because nothing leaves the machine.

Kokoro in practice

Kokoro is an 82M parameter model released under the Apache 2.0 license. It outputs 24kHz audio, supports multiple voices (male and female, various accents), and ships as a Python package. Installation is one command:

pip install kokoro soundfile

The basic render path is about ten lines of Python:

from kokoro import KPipeline
import numpy as np
import soundfile as sf

pipeline = KPipeline(lang_code="a", repo_id="hexgrad/Kokoro-82M")

chunks = []
for _, _, audio in pipeline("Config module shipped with tests.", voice="af_heart"):
    chunks.append(audio)

combined = np.concatenate(chunks)
sf.write("output.wav", combined, 24000)

That produces a WAV file with natural-sounding speech. The first call downloads the model (~170MB) and warms up the pipeline. Subsequent calls are fast: about 50ms per sentence on a modern GPU, 800ms-1.5s on CPU.

How Radio Agent uses Kokoro

Radio Agent wraps Kokoro in a TTSEngine protocol that handles warmup, error recovery, and multi-voice support. The full render pipeline from webhook to audible speech looks like this:

End-to-end latency from webhook POST to audible speech is typically under 2 seconds on GPU, with Kokoro accounting for about 50ms of that. The rest is Liquidsoap buffer flushing and Icecast encoding.

Multiple voices for context

One of Kokoro's practical strengths is multi-voice support. Each voice is a separate model checkpoint that can be loaded alongside the default. In Radio Agent, we use this for event differentiation:

The voice switch happens automatically based on the event kind. The effect is subtle but powerful: after a few hours, your ear recognizes "male voice = something went wrong" before your brain processes the words. It's the audio equivalent of red vs. green status indicators.

Both voices are warmed up on startup so the first failure announcement doesn't pay a cold-start penalty:

pipeline = KPipeline(lang_code="a", repo_id="hexgrad/Kokoro-82M")

# Warm up default voice
for _ in pipeline("warm up", voice="af_heart"):
    pass

# Warm up failure voice
for _ in pipeline("warm up", voice="am_michael"):
    pass

Tradeoffs vs. cloud TTS

Cloud TTSKokoro (local)
Voice quality Excellent. ElevenLabs leads. Good. Natural enough for notifications.
Latency 200-800ms (network) 50ms GPU, ~1s CPU
Cost $0.50-2/day at notification scale Free after install
Privacy Text sent to third party Nothing leaves the machine
Offline No Yes
Voices Hundreds, highly customizable Dozens, less variety
Emotion/prosody Advanced (ElevenLabs, Azure) Basic. No emotion tags.
Model size N/A (cloud) ~170MB download
GPU required No No (but 15x faster with one)

The honest assessment: cloud TTS sounds better, especially for long-form content or emotional delivery. Kokoro sounds good enough for the notification use case, where utterances are 5-15 seconds and clarity matters more than expressiveness. If you're building a podcast generator or an audiobook engine, cloud TTS is probably worth the cost. If you're building a status notifier that speaks 30 times an hour, local TTS saves you money and latency without a meaningful quality tradeoff.

What's next for local TTS

The local TTS landscape is moving fast. Models like Fish Speech and Orpheus 3B offer better prosody and emotion control than Kokoro, but at higher compute cost (5-10 seconds per clip instead of 50ms). For Radio Agent's use case, that latency is too high: the TTS evaluation showed that sub-second render time is critical for the event-to-audio pipeline to feel responsive.

As local models improve, the quality gap with cloud services will narrow. Kokoro-82M was released in late 2024 and it's already good enough for production notification use. The next generation of lightweight TTS models will likely match cloud quality for short-form speech within a year or two.

For now, Kokoro hits the sweet spot: fast enough for real-time use, good enough that listeners don't cringe, and simple enough that pip install kokoro is the entire setup.

← Back to blog Self-hosted monitoring →