Keep ElevenLabs for what it's unmatched at — the voice. Rebuild the pipeline around it for latency, cost predictability, and observability. The rebuild here isn't replacing the platform; it's the production architecture nobody ships on day one.
1. Budget latency per segment, not in total
Set a target (say, sub-1.5s perceived) and allocate it across STT, turn-detection, LLM, and TTS — then optimize the segment that's actually blowing it (usually the LLM), not the one that's already fast (TTS).
2. Route the LLM by turn difficulty
A fast model handles the routine majority of turns; a frontier model is reserved for the genuinely hard ones.
| Pipeline segment | Candidate | Est. latency contribution | Est. cost driver |
|---|
| Speech-to-text | Fast streaming STT | ~0.2–0.5s | per minute |
| LLM — routine turns | Small / fast model | ~0.3–0.8s | low per turn |
| LLM — hard turns (minority) | Mid–top frontier | ~1.5–3s | higher per turn |
| TTS first audio | Low-latency expressive TTS | ~0.075–0.15s | per character / minute |
Planning-stage estimates, not a benchmark. The LLM is the biggest lever on both felt latency and cost — exactly the segment the "sub-100ms" framing distracts from.
3. Put cost controls in from day one
Auto-hangup on silence, per-call duration caps, model-tier limits, and a real-time spend dashboard — so the bill scales with value, not with dead air.
4. Add an eval and monitoring layer
Log and score every conversation for instruction-following, tool-call correctness, latency, and hallucination — before launch and continuously after. You can't run a customer-facing agent you can't observe.