Vision-Language Models Struggle With Real-World Web Tasks, New Benchmark Reveals
Vision-language models (VLMs) like GPT-4V and Gemini Vision can describe images with impressive accuracy, but they struggle significantly when asked to generate working websites from visual specifications. A new benchmark called WebRISE reveals that even the strongest models reach only 65.6% success on interactive web generation tasks, meaning roughly one-third of required website features fail to work as intended.
Why Can't Vision-Language Models Build Functional Websites?
The challenge goes deeper than visual design. Researchers from Tsinghua University, Huawei Noah's Ark Lab, and other institutions found that VLMs can create pages that look correct but don't actually function properly. A filter might leave the item list unchanged, or a shopping cart update might fail to update the total price. These aren't visual problems; they're behavioral failures that only become apparent when users interact with the page.
The WebRISE benchmark tested 14 different multimodal large language models (MLLMs) across 442 web-building tasks. The researchers discovered that existing evaluation methods miss these functional failures because they focus on static visual appearance rather than requirement-induced state transitions. In other words, traditional benchmarks check whether a page looks right, not whether it actually works.
What Makes This Benchmark Different From Previous Tests?
WebRISE introduces a novel evaluation approach using something called Interaction Contract Graphs (ICGs). Instead of just taking a screenshot and checking if it matches expectations, the benchmark specifies what observable states a page should reach, what transitions should occur when users interact with it, and what constraints must be satisfied across different parts of the interface. The benchmark then runs these tests in an actual browser to verify that the generated code behaves correctly.
The scale of WebRISE is substantial. It includes 5,495 transitions and 5,271 requirement checks across five different input modalities: text descriptions, Markdown specifications, sketches, images, and videos. This multimodal approach revealed an important finding: video input produces the strongest results, improving transition validity by 8.8 percentage points and requirement coverage by 8.3 percentage points compared to text-only specifications.
How to Improve Vision-Language Model Performance for Web Generation
- Use Multimodal Input: Provide specifications in multiple formats rather than text alone. Video input showed the strongest signal for helping models understand interactive requirements, improving coverage by over 10 percentage points compared to text-only approaches.
- Focus on Implicit Constraints: Explicitly define state-consistency requirements across components, such as filter-pagination synchronization and count updates after deletion. Models struggle most with these implicit constraints that aren't explicitly stated in requirements.
- Test Behavioral Conformance: Evaluate generated artifacts through actual browser execution and state transitions rather than visual inspection alone. This catches functional failures that visual assessment misses entirely.
The research revealed a critical insight: visual quality is not a reliable proxy for functional correctness. One model, Qwen3.6-35B-A3B, achieved 80.8% visual accuracy on Markdown-based tasks but only 15.5% transition validity. This dramatic gap shows that a page can look perfect while being completely non-functional.
What Are the Biggest Obstacles for Vision-Language Models?
Implicit state constraints emerged as the most persistent bottleneck across all tested models. While explicit requirements stated in task descriptions are relatively easy for VLMs to satisfy, implicit constraints like maintaining consistency across multiple interface elements remain challenging. These include edge cases, boundary conditions, error handling, and feedback mechanisms that users expect but aren't always explicitly mentioned.
The benchmark's defect injection analysis provides additional evidence of the evaluation method's effectiveness. When researchers intentionally introduced errors into correctly generated pages, the ICG-based evaluation detected state-related defects at 16 to 25 times the rate of traditional checkpoint-style evaluation methods. This demonstrates that requirement-induced state testing catches real problems that simpler evaluation approaches miss.
Even the strongest model tested, GPT-5.5, achieved only 66.3% requirement coverage under its best conditions. This suggests that web generation from visual specifications remains far from solved, despite significant advances in multimodal AI capabilities. The gap between visual understanding and functional implementation represents a fundamental challenge for current VLM architectures.
The WebRISE benchmark provides researchers and developers with a diagnostic tool to understand exactly where VLMs fail in web generation tasks. By separating visual quality from behavioral correctness and testing state transitions in actual browser environments, the benchmark offers a more realistic assessment of whether generated websites will actually work for real users. As VLMs continue to improve, this kind of rigorous, requirement-focused evaluation will be essential for building systems that don't just look good but function reliably.