Engineering voice agents: Latency, quality, and scale

Engineering voice agents: Latency, quality, and scale — Rishabh Bhargava, Together AI https://video.ut0pia.org/videos/watch/0d693833-d12f-449e-9f56-e7091ffa7b6c Users notice latency above 500ms and hang up above one second. In an already optimized pipeline, 75ms of network latency from models sitting in a different data center adds 30% overhead. Colocating everything in the same building drops that to around 5ms. Rishabh Bhargava from Together AI walks through the full speech to text, LLM, and text to speech pipeline at that level of specificity. The LLM dominates the budget: 200 to 300ms time to first token target, 8 to 30B parameter range — larger models blow the latency budget, smaller ones break tool calling. Speech to text target is P90 under 100ms with around 6% word error rate. One pattern for handling complex workflows without adding latency: a small thinker LLM handles conversation flow and issues a single tool call to a larger model when the request is complex, keeping the fast path fast. Speaker info: https://www.linkedin.com/in/bhargavarishabh Mon, 01 Jun 2026 05:52:50 GMT https://validator.w3.org/feed/docs/rss2.html PeerTube - https://video.ut0pia.org Engineering voice agents: Latency, quality, and scale — Rishabh Bhargava, Together AI https://video.ut0pia.org/lazy-static/avatars/0287a09a-aae7-4840-9843-b416426e7046.webp https://video.ut0pia.org/videos/watch/0d693833-d12f-449e-9f56-e7091ffa7b6c All rights reserved, unless otherwise specified in the terms specified at https://video.ut0pia.org/about and potential licenses granted by each content's rightholder.