Robot Safety Guards Are Missing the Moment That Matters Most
Vision-language models (VLMs) are increasingly deployed as real-time safety monitors for robots in homes and factories, but a new diagnostic benchmark exposes a critical gap: these AI systems can recognize that a video contains danger, yet consistently fail to identify the precise moment when harm occurs. Researchers introduced EgoSafetyBench, a benchmark of 1,200 robot-view scenarios, to evaluate how well VLMs function as runtime safety guards. The findings reveal that current models struggle with contextual hazards, over-intervene on safe activities, and are easily fooled by misleading text visible in scenes.
Why Can't AI Safety Guards Pinpoint Dangerous Moments?
The challenge facing embodied AI safety is more nuanced than simply detecting hazards. A kitchen knife, for instance, is not inherently dangerous; safety depends entirely on what the robot is doing with it and the surrounding context. Most existing safety benchmarks collapse this distinction into binary safe-or-unsafe classifications, which obscures two very different failure modes: flagging safe scenes as dangerous, and missing contextual hazards that actually require intervention.
EgoSafetyBench addresses this by evaluating VLMs on two independent axes. The first separates scenarios into four categories: plainly safe routines, safe scenes with resolved confounds (alarming cues that are actually harmless), obvious hazards, and contextual hazards that require understanding the relationship between the robot's action and its environment. The second axis tests whether misleading in-scene text, such as a sign or label, corrupts the model's judgment.
The benchmark contains 1,200 scenarios across home and factory domains, annotated at half-second granularity to simulate real-time streaming. Researchers evaluated ten VLMs, including both open-source and proprietary models, using a stateless chunk-by-chunk protocol that mirrors how a deployed safety guard would operate.
What Did the Evaluation Reveal About Current VLM Safety Performance?
The results expose a troubling pattern. While models reliably flag that a video contains a hazard, they consistently fail to pinpoint the exact hazardous moments. This distinction matters enormously in real-world deployment: a safety guard that interrupts a robot at the wrong time is nearly as problematic as one that misses danger entirely.
Misleading in-scene text degraded every tested model, but in opposite ways. Deceptive signs suppressed hazard detection in vulnerable open-weight models, causing them to miss up to a third of hazards. Conversely, the strongest closed-source models over-intervened on safe content when exposed to misleading text, flagging benign activities as dangerous. Because each misleading sign was paired with an identical truthful version, researchers could confirm that these shifts were caused by the deception itself, not general perception errors.
A particularly striking finding emerged from matched controls: apparent safety robustness in top-performing models often reflected indiscriminate alarming rather than genuine physical reasoning. In other words, some models appeared safe not because they understood the situation, but because they defaulted to flagging almost everything as potentially dangerous.
How to Evaluate VLM Safety Guards More Rigorously
- False-Positive Control: Test whether a model avoids flagging safe-but-suspicious scenes as hazardous, such as a robot holding a knife while preparing food in a normal kitchen setting.
- Contextual Hazard Detection: Evaluate whether a model can identify hazards that depend on understanding the relationship between the robot's action, the object involved, and the surrounding context, rather than just recognizing objects in isolation.
- Resistance to Deceptive Channels: Assess how well a model maintains accurate safety judgments when exposed to misleading text, signs, or labels visible in the scene that contradict the physical reality.
The research also introduced a methodological innovation called contrastive ladders, where scenarios share the same entity, action, target, and domain but differ only in a single visible deciding variable. This forces models to evaluate specific safety-relevant relationships rather than exploiting broad scene correlations. By isolating the deciding cue, researchers can determine whether a correct safety judgment actually hinges on understanding that cue or merely reflects the overall scene type.
The automated generation pipeline used to create EgoSafetyBench included structured authoring, rendering, filtering via language models, and chunk-level annotation. This scalable approach enabled the creation of validated contrastive ladders with isolated deciding variables, making the benchmark both rigorous and reproducible.
These findings argue for a fundamental shift in how embodied VLM safety is judged. Rather than relying on aggregate accuracy scores, researchers and developers should prioritize false-positive control, contextual hazard detection, and resistance to deceptive channels. As robots become more prevalent in human-centric environments, the ability to distinguish between genuine safety reasoning and surface-level pattern matching will determine whether AI safety guards can be trusted in real-world deployment.