AI Still Can't Reliably Watch and Listen: New Benchmark Exposes Critical Gaps in Multimodal Understanding
Artificial intelligence systems that watch and listen to real-world interactions like job interviews, medical consultations, and customer service calls still struggle with fundamental understanding tasks, according to a new benchmark released by researchers at the Vector Institute. The benchmark, called SONIC-O1 (Social Natural Interaction Corpus, Omnimodal v1), tested leading AI models on their ability to jointly process audio and video content, revealing that even the most advanced systems fail at critical tasks like identifying exactly when important events occur in conversations.
The gap matters because AI systems increasingly influence high-stakes decisions about people's lives. When an AI system scores a job applicant's video interview or summarizes a patient-doctor consultation, it operates as a black box to hiring managers and healthcare providers. Without rigorous testing, there is no way to know whether the system understood the candidate fairly, interpreted tone correctly, or performed consistently across people of different ages, genders, and racial backgrounds.
What Makes SONIC-O1 Different From Other AI Benchmarks?
Most evaluations of multimodal AI (systems that process multiple types of information, like audio and video together) focus on static images, short video clips, or text transcripts. SONIC-O1 takes a different approach by testing AI models on approximately 60 hours of real-world audio-video content drawn from 231 human-reviewed videos across 13 conversational topics. The benchmark spans five broader domains that reflect how AI is actually being deployed in society:
- Professional interactions: Job interviews and workplace meetings where AI may score candidate responses or summarize discussions
- Educational conversations: Parent-teacher conferences where AI might help document and analyze communication
- Legal and civic settings: Courtroom proceedings and community town halls where accurate understanding is legally critical
- Service-oriented interactions: Customer service calls, restaurant encounters, and housing tours where AI assists with documentation
- Community and public-health settings: Patient-doctor consultations, emergency response, mental-health counseling, and sports coverage where AI supports decision-making
The videos in SONIC-O1 range from short clips to conversations lasting up to an hour, giving researchers a much broader view of how AI performs in real conditions compared to datasets focused only on brief, highly edited media. The benchmark includes 4,958 human-verified annotations and metadata that allows researchers to analyze performance across observable demographic categories.
How Do Researchers Measure Whether AI Truly Understands an Interaction?
SONIC-O1 evaluates three connected capabilities that together reveal whether an AI system genuinely comprehends what is happening in an audio-video interaction. The first task asks models to produce a coherent summary of a full audio-video interaction, testing whether they can synthesize information across the entire conversation. The second task tests fine-grained understanding through multiple-choice questions based on short audio-video segments, which measures whether models can identify specific details. The third task, temporal localization with reasoning, asks models to identify precisely when an event happens in a video, such as when a particular goal is scored in a sports clip, when a speaker makes a key statement, or whether one event occurs before or after another.
This third task proved to be the most revealing. Models must predict the start and end time of a target event and explain the evidence supporting their answer. This requires not just understanding what happened, but understanding when it happened, a capability that separates systems that can describe video content from systems that truly comprehend temporal sequences.
What Did the Benchmark Reveal About Current AI Capabilities?
The results showed meaningful progress in multimodal AI, but also clear limitations. Closed-source models (proprietary systems like Google's Gemini) performed best across the benchmark, particularly on open-ended summarization and temporal localization tasks. The gap was smaller for multiple-choice questions, suggesting that current systems are relatively stronger when they can select from a fixed set of answers rather than generating original reasoning.
Temporal localization emerged as the most difficult task. Gemini 3.0 Pro achieved a 25.4% success rate on this task, compared with 2.8% for the strongest open-source model, Qwen3-Omni. This represents a 22.6 percentage point gap, showing that models can often describe what happened or answer a question about a clip, but still struggle to reliably identify when the relevant evidence appears in the video.
Performance also varied significantly across real-world settings. No model performed equally well across all 13 conversational domains tested. High-stakes interactions such as emergency response and mental-health counseling remain especially demanding because they require models to connect spoken language, visual context, timing, and subtle social cues simultaneously.
Where Are the Biggest Fairness Concerns?
Perhaps most concerning, group-wise analysis revealed the largest disparities in temporal localization, including a 21.4 percentage point gap for Gemini 3.0 Pro between Indigenous and Black participants. This finding demonstrates that overall performance averages can mask uneven reliability across demographic groups, a critical issue when AI systems are used to make decisions affecting people's opportunities.
This disparity suggests that AI systems trained on datasets that may underrepresent certain populations can perform significantly worse for those groups, even when they perform well on average. For hiring platforms, medical consultation reviews, and other high-stakes applications, such demographic disparities could perpetuate existing inequities if not carefully monitored and addressed.
Steps for Researchers and Developers to Improve Multimodal AI Systems
SONIC-O1 provides a shared framework that researchers and developers can use to investigate and improve multimodal AI understanding. The benchmark offers several practical pathways forward:
- Temporal reasoning focus: Developers should prioritize improving models' ability to identify precisely when events occur in videos, as this capability lags significantly behind object recognition and content description
- Domain-specific evaluation: Teams deploying AI in high-stakes settings like emergency response or mental-health counseling should test performance on domain-specific video content rather than relying on general benchmarks alone
- Demographic testing: Before deploying AI systems in hiring, education, or healthcare, organizations should conduct group-wise analysis to identify performance disparities across demographic groups and address them before deployment
- Longer-form content testing: Developers should evaluate their systems on longer conversations and interactions, not just short clips, to ensure performance holds up in real-world conditions
The Vector Institute has made SONIC-O1 openly available through a project page, research paper, dataset, GitHub repository, and public leaderboard, allowing the broader research community to contribute to improving multimodal AI understanding. This collaborative approach reflects growing recognition that benchmarking and transparency are essential for building AI systems that can be trusted in socially important settings.
As AI systems increasingly influence decisions about hiring, education, healthcare, and legal proceedings, the ability to understand not just what is said and shown, but when it happens and how it is perceived across different populations, becomes increasingly critical. SONIC-O1 provides researchers and practitioners with the tools to measure progress toward that goal and identify where today's systems still fall short.