All teardowns
Teardown · 11 min

ElevenLabs voice agents: what real latency actually costs

ElevenLabs makes the best-sounding AI voices most people have heard, and it has turned that into a full conversational-agent platform — turn-taking, telephony, RAG, 70+ languages, the works. The demos are spellbinding. The thing founders learn the hard way is that the naive integration everyone ships first feels laggy, costs more than expected, and is hard to trust in production — not because ElevenLabs is bad, but because a voice agent is a pipeline, and the pipeline is where the hard parts live.

We took a typical production ElevenLabs voice-agent build into the Lab. Here's what the platform nails, where the real-world experience settles, and how we'd architect a voice agent that's actually fast, predictable, and observable.

ElevenLabsVoice agent platformTeardown · 11 min
01 · The premise

ElevenLabs Conversational AI (ElevenAgents) is a developer platform for real-time voice agents: speech-to-text, an LLM, expressive text-to-speech, turn-taking models, telephony, and RAG over your knowledge base, deployable to phone or web. The TTS itself is genuinely fast — flagship low-latency models stream first audio in roughly 75–100ms. Pricing is credit-based, with conversational agents billed per minute on the higher tiers.

We picked it because voice agents are one of the most-requested AI features right now, and ElevenLabs is the default starting point. Understanding where a real deployment strains is useful to anyone putting a voice in front of customers.

02 · What they got right

The voice quality is the moat, and it's real. Expressive, natural, multilingual TTS — with recent models carrying a speaker's emotion across languages — is a meaningful experience advantage in a category where robotic voices kill trust instantly. Folding the full stack (STT, LLM, TTS, turn-taking, telephony, RAG) into one platform genuinely lowers the barrier to a working agent. The developer surface and API are clean. And the low-latency TTS models are a legitimate engineering achievement.

If the job is "get an impressive voice agent talking this week," ElevenLabs is the fastest path. The gap is everything between "talking" and "in production."

03 · Where they settled

"Sub-100ms" is the TTS number, not the latency a caller feels

This is the most common trap. End-to-end conversational latency is the sum of the pipeline: speech-to-text + turn-detection + your LLM's thinking time + time-to-first-audio + network. The TTS can be 75ms and the conversation can still feel sluggish because a slow LLM or naive turn-taking added a second somewhere else. Teams optimize the number on the box and ship something that still feels like talking to a hold line.

The bill surprises people — usually upward

Conversational AI is billed per minute, and that line routinely dwarfs the TTS / "voiceover" line teams budget for. Worse, you pay for the whole conversation duration, including dead air and hold time, unless you've configured auto-hangup on silence. LLM usage is deducted from credits and varies by model; telephony is billed separately at cost. The result is a bill that's hard to forecast right when you're scaling traffic.

Thin production monitoring

Out of the box there's limited visibility into whether the agent is actually following instructions, calling tools correctly, hallucinating, or drifting on latency — which is why a cottage industry of third-party eval / monitoring layers exists to sit on top of it. For a customer-facing agent, "we can't see what it's doing in production" is not a small gap.

Cost and latency pull in opposite directions, by design

The most expressive voices cost the most and aren't always the fastest; the fast, cheap models are less expressive. Picking one model for the whole agent means over-paying, under-performing, or both.

04 · The rebuild

Keep ElevenLabs for what it's unmatched at — the voice. Rebuild the pipeline around it for latency, cost predictability, and observability. The rebuild here isn't replacing the platform; it's the production architecture nobody ships on day one.

1. Budget latency per segment, not in total

Set a target (say, sub-1.5s perceived) and allocate it across STT, turn-detection, LLM, and TTS — then optimize the segment that's actually blowing it (usually the LLM), not the one that's already fast (TTS).

2. Route the LLM by turn difficulty

A fast model handles the routine majority of turns; a frontier model is reserved for the genuinely hard ones.

Pipeline segmentCandidateEst. latency contributionEst. cost driver
Speech-to-textFast streaming STT~0.2–0.5sper minute
LLM — routine turnsSmall / fast model~0.3–0.8slow per turn
LLM — hard turns (minority)Mid–top frontier~1.5–3shigher per turn
TTS first audioLow-latency expressive TTS~0.075–0.15sper character / minute

Planning-stage estimates, not a benchmark. The LLM is the biggest lever on both felt latency and cost — exactly the segment the "sub-100ms" framing distracts from.

3. Put cost controls in from day one

Auto-hangup on silence, per-call duration caps, model-tier limits, and a real-time spend dashboard — so the bill scales with value, not with dead air.

4. Add an eval and monitoring layer

Log and score every conversation for instruction-following, tool-call correctness, latency, and hallucination — before launch and continuously after. You can't run a customer-facing agent you can't observe.

05 · The 6-week plan

What we'd cut, and how we'd ship it.

Week 1

Latency budget & baseline

Instrument the full pipeline; measure where the seconds actually go. Set the perceived-latency target.

Weeks 2–3

LLM routing + turn-taking

Fast model default, frontier on hard turns; tune turn-detection so the agent stops interrupting and stops lagging.

Weeks 3–4

Cost controls

Auto-hangup, per-call caps, model limits, spend dashboard.

Week 5

Eval & monitoring

Instruction-following, tool-call, latency, and hallucination checks, pre- and post-launch.

Week 6

Load test & ship

Run it at expected concurrency, confirm latency and cost hold, release.

06 · The verdict

Twelve months out, ElevenLabs keeps winning on voice — that lead is real and hard to copy. The platform layer gets more competitive as fast, cheap rivals close the latency gap, so the durable advantage stays the sound, not the orchestration. For founders, the lesson is steady: the platform is a great component, and a production voice agent is a systems problem — latency budgeting, model routing, cost control, and observability — that no single vendor hands you in a box.

A best-in-class voice on a platform that makes a demo easy and a production agent deceptively hard. The voice is the reason to start here. The architecture around it is the reason your agent will or won't survive contact with real callers.

FAQ

Yes — for voice quality and as a starting platform, it's excellent. The work is in the pipeline around it: latency budgeting, model choice, cost control, and monitoring.

Because end-to-end latency is the sum of STT, turn-detection, your LLM, and TTS. The TTS can be fast while a slow LLM or naive turn-taking makes the whole conversation feel laggy.

Conversational agents are billed per minute — including dead air unless auto-hangup is set — and LLM usage and telephony add on top, so cost scales faster than the TTS line alone suggests.

Set a per-segment latency budget, optimize the LLM (usually the bottleneck) with model routing, tune turn-taking, and stream audio — rather than only chasing the TTS number.