Logo
FrontierNews.ai

Why AI Vision Models Are Failing to Spot Women's Safety Crimes in Real CCTV Footage

Vision language models (VLMs), the AI systems that can analyze both images and text, are failing to detect crimes against women in real-world surveillance footage because they were trained on datasets that don't represent these specific threats. Researchers have found that existing AI models misclassify women-centric crimes like chain snatching, stalking, and harassment in up to 65% of cases, a gap that could have serious consequences for public safety systems relying on these tools.

What's Wrong With Today's Video Anomaly Detection Models?

Video anomaly detection (VAD) is supposed to help security systems automatically flag unusual or dangerous behavior in surveillance footage. But the datasets used to train these models have a blind spot: they focus on high-resolution, well-lit videos of staged crimes, not the messy reality of actual CCTV recordings. Most existing datasets contain controlled footage shot in close-up under good lighting conditions, which bears little resemblance to the low-resolution, dimly lit, long-distance shots captured by real security cameras.

When researchers tested popular VLMs like Holmes-VAU on women-specific crimes, the results were alarming. The model incorrectly described a chain snatching incident as "a man in a white shirt and black pants riding a motorcycle and then stopping to talk to a woman in a blue dress, which is suspicious and unusual behavior." The actual crime involved two men on a motorcycle snatching a woman's chain and fleeing. Because the model lacked training data on this type of crime, it failed to recognize the threat.

How Researchers Are Fixing the Problem?

To address this gap, researchers have created a new benchmark dataset called ExtrAnom, specifically designed to teach AI models to recognize crimes against women. The dataset contains 1,001 videos with detailed text descriptions, split evenly between normal and anomalous events. What makes ExtrAnom different is its focus on real-world conditions: 8% of videos are shot in low-light conditions, 13% are low-resolution, and 15% are long-shot views that mimic actual CCTV footage.

  • Crime Categories Included: The dataset covers stalking (3.9% of videos), chain snatching (17.6%), kidnapping (7.3%), assassination (2.3%), and harassment (18.9%), providing the diversity of women-centric threats that existing datasets lack
  • Multi-Modal Annotations: Each video includes four text descriptions: one written by humans and three generated by large language models (LLMs), enabling AI systems to learn from both human expertise and machine-generated insights
  • Real-World Conditions: Unlike controlled datasets, ExtrAnom includes videos captured in daylight (64%), low-light environments, varying resolutions, and different camera angles that reflect how actual surveillance systems operate

The researchers benchmarked ExtrAnom against popular existing datasets like XD-Violence, UCF-Crime, and UCA, and the results confirmed the problem. Existing models trained on these older datasets performed poorly on women-centric anomalies. For example, when Holmes-VAU was tested on ExtrAnom videos, it misclassified approximately 62.45% of them, compared to only 37.80% misclassification on the UCF-Crime dataset it was trained on.

Why Does This Matter for Public Safety?

The stakes are high. According to a United Nations report from November 2024, approximately 85,000 women and girls died in 2023 due to abuse. The World Health Organization reports that roughly one in three women experiences violence in some form during her lifetime. Early detection of crimes like stalking could prevent escalation to more serious violence, making accurate AI detection systems a potentially life-saving tool.

The problem is particularly acute for subtle crimes like stalking, which can be difficult to distinguish from normal interactions without proper training data. A model that fails to recognize stalking patterns in surveillance footage leaves potential victims without the early warning that automated systems could provide. By creating a dataset specifically focused on women's safety, researchers are pushing the field toward AI systems that can actually protect vulnerable populations.

What About Medical AI? Similar Trust Problems Exist There Too

The issue of AI vision models failing to ground their answers in actual visual evidence extends beyond surveillance. In the medical field, researchers have discovered that leading medical VLMs often fabricate clinical findings from text alone, without genuinely analyzing the image provided. This represents a parallel crisis of trust in high-stakes domains.

A new evaluation framework called MedVIGIL tested 16 vision-capable medical AI models and found that they frequently produce fluent-sounding clinical answers even when the visual evidence is broken, corrupted, or contradicted by the question. Claude Opus 4.7, the strongest model tested, achieved a composite trustworthiness score of only 69.2 out of 100, leaving a 14.1-point gap between the model and an independent radiologist baseline of 83.3. The problem is that medical VLMs rely too heavily on language patterns from medical literature rather than actually analyzing the image in front of them.

When researchers deliberately corrupted images or embedded false premises in questions, many models continued to generate plausible-sounding answers without recognizing that the visual evidence had failed. This "silent failure" mode is particularly dangerous in medicine, where a model that confidently provides an incorrect diagnosis could lead to patient harm. The MedVIGIL benchmark includes 2,556 test cases supervised by four board-certified radiologists, making it one of the most rigorous evaluations of medical AI trustworthiness to date.

Both the surveillance and medical domains reveal a fundamental challenge with current VLMs: they excel at pattern matching and language generation, but they struggle with genuine visual grounding. The systems can sound authoritative while missing critical details that a human expert would immediately notice. As these models move from research labs into real-world deployment, ensuring they can recognize when their visual evidence has failed becomes as important as ensuring they produce correct answers when the evidence is intact.