Why AI Voice Agents Are Finally Learning to Listen: Inworld's New Model Understands Emotion in Real Time
Inworld AI has released Realtime TTS-2, a voice model that fundamentally changes how AI agents respond in conversations by understanding emotional context and adjusting their tone, pacing, and delivery in real time. Unlike traditional text-to-speech systems that generate audio in isolation from the conversation around them, TTS-2 listens to how users sound, analyzes the full conversation history, and responds with appropriate emotional awareness.
What Makes This Different From Today's Voice AI?
Current voice agents sound mechanical in conversations because they lack access to critical information. When a frustrated customer calls support, today's systems respond with the same bright, even tone regardless of the caller's emotional state. TTS-2 changes this by capturing the user's audio and extracting context, emotion, and tone before generating any response.
Consider a patient calling to discuss lab results. They start measured and calm, but when the agent shares unexpected findings, their voice tightens and questions come faster. Traditional voice agents would continue at the same pace and pitch, ignoring the gravity of the moment. TTS-2 registers the shift in real time, slows down, leaves space, and delivers information with steadiness and care, not because someone scripted a specific pathway, but because the model heard how the person was speaking and adapted naturally.
How Does Realtime TTS-2 Actually Work?
- Audio Context Capture: Before speech is generated, TTS-2 captures the user's audio and extracts emotional state, tone, and pacing information in real time, giving the system awareness of how the person is actually feeling.
- Conversational Reasoning: The model processes the full conversation history, identifying what was said in previous turns, which moments were most important, and what can be inferred from how the user sounds right now.
- Multi-Layered Input Processing: TTS-2 receives what to say, how to say it using natural-language voice direction, direct audio from the user to condition expression, and the complete conversation history, synthesizing all inputs into emotionally aware speech.
- Real-Time Adjustment: The system adjusts tone, pacing, and delivery based on the complete picture of the interaction, producing voice that sounds like a person in conversation rather than someone reading an audiobook.
Developers control the model using natural language prompts, similar to how they would prompt a large language model (LLM), an AI system trained on vast amounts of text to understand and generate language. Developers can use full descriptions like "act like you just got home from a long day, tired but warm" combined with inline controls for specific moments like "whispering," "sigh," or "excited".
How to Integrate Realtime TTS-2 Into Your Voice AI Application
- API Access: Realtime TTS-2 is available via the Inworld API and as part of the Inworld Realtime API for end-to-end speech-to-speech over a single persistent connection, allowing developers to build conversational AI without managing multiple systems.
- Integration Partners: The model works with established platforms including Layercode, LiveKit, NLX, Pipecat, Vapi, and Voximplant, making it easier to integrate into existing voice infrastructure.
- Live Demo and Documentation: Developers can try the live demo or access detailed documentation at inworld.ai/tts to understand how the model performs with their specific use cases before committing to implementation.
- Multilingual Support: TTS-2 supports over 100 languages with on-the-fly switching inside a single generation, preserving the speaker's voice identity across every language for global applications.
How Does This Compare to Competitors?
Inworld's previous model, TTS-1.5, already ranked number one on the Artificial Analysis Speech Arena, a benchmark that evaluates voice AI quality, positioning it above Google and ElevenLabs. With voice quality already achieved, Inworld built TTS-2 with a fundamentally different architecture designed to process conversation the way a human listener would, before a single word is spoken.
"We are obsessed with how voice AI feels, not just how it sounds. Realtime voice is the most natural way for people to communicate with AI, because it is the most natural way people communicate with each other. Voice is how we actually connect. We built TTS-2 to make that connection feel real," said Kylan Gibbs, CEO and Co-Founder of Inworld AI.
Kylan Gibbs, CEO and Co-Founder of Inworld AI
What Problem Does This Solve for Businesses?
Voice AI is increasingly used in customer service, healthcare, and other industries where human connection matters. However, mechanical-sounding responses damage trust and user experience. When a frustrated customer hears an agent respond with the same cheerful tone it uses for every interaction, it signals that the system isn't actually listening. TTS-2 solves this by making voice agents sound genuinely responsive to what users are experiencing.
The technical breakthrough required solving problems that the AI research field had previously treated as future work. Igor Poletaev, Chief Science Officer at Inworld AI, explained the architectural challenge: "Most TTS models generate speech in isolation from the conversation around them. TTS-2 is trained to use audio context from the full multi-turn exchange, and take voice direction so how the model speaks adjusts to how it was spoken to. Building a system that does this in real-time, at production quality, with full controllability, required solving problems that the field had treated as future work for years".
Igor Poletaev, Chief Science Officer at Inworld AI
"Most TTS models generate speech in isolation from the conversation around them. TTS-2 is trained to use audio context from the full multi-turn exchange, and take voice direction so how the model speaks adjusts to how it was spoken to. Building a system that does this in real-time, at production quality, with full controllability, required solving problems that the field had treated as future work for years," explained Igor Poletaev, Chief Science Officer at Inworld AI.
Igor Poletaev, Chief Science Officer at Inworld AI
What's Next for Voice AI?
Inworld is a research lab focused on solving realtime interaction, with a founding team from DeepMind and Google. The company has raised more than $125 million from leading investors including Lightspeed Venture Partners, Section 32, Bitkraft, Kleiner Perkins, and Founders Fund. Beyond TTS-2, Inworld offers Realtime Speech-to-Text (STT) for speech recognition that includes voice profiling to detect detailed user context, a Realtime Router that selects the optimal model and prompt for every context, and a Realtime API that unifies everything into a single persistent connection for full-duplex conversational AI.
The release of TTS-2 signals a shift in how the voice AI industry approaches the problem of human-like interaction. Rather than focusing solely on audio quality, the field is now prioritizing emotional intelligence and contextual awareness. As voice becomes the primary interface for AI in customer service, healthcare, and personal assistance, the ability to sound genuinely responsive to human emotion may become as important as the ability to understand what users are saying.