Vision-Language Models Hit a Wall With Long Webpages: New Benchmark Reveals the Real Problem
Vision-language models can generate webpages that look correct but fail to function; LongWebBench exposes this critical gap across 490 real-world long.
111 articles
Vision-language models can generate webpages that look correct but fail to function; LongWebBench exposes this critical gap across 490 real-world long.
Multimodal AI hit websites, Android, and AR glasses this week, with one tool free to start and another open-source for XR developers.
Vision-language models now outperform traditional metrics at judging time series forecasts, aligning better with human preferences across finance.
AI is evolving from chatbot to digital colleague, using persistent workspaces and reusable skills to complete multi-step tasks autonomously rather than.
Neurovia AI compressed a 4K video by 96% while preserving machine learning quality, revealing data infrastructure as AI's real bottleneck.
AI video generation works best when filmmakers use custom-trained models and hybrid workflows, not generic prompts, Tribeca 2026 films reveal.
AI models miss one in three UX issues that experts catch instantly, but new research shows specialized training can boost design evaluation accuracy by.
AI-powered wearables now detect eating, smoking, and sun exposure in real-world conditions using cameras and sensors, moving beyond basic fitness tracking.
Perplexity's AI research system boosts accuracy from 40.7% to 83.8% by splitting complex questions across multiple specialized models instead of one.
New AI framework reads human intentions by processing speech, facial expressions, and body language simultaneously, achieving breakthrough social.
PixVerse Canvas launches as a visual workspace that organizes AI video projects on one canvas instead of scattered tabs and folders.
iPack uses vision-language models to teach robots human-like grocery packing skills, preventing damage by understanding product fragility without training.
AI video generation abandons consumer apps after Sora's $15M daily costs killed profitability, shifting to APIs and enterprise subscriptions instead.
Multimodal AI is moving beyond chatbots into supply chain operations, combining voice, images, and data to automate complex warehouse decisions.
Apple's new multimodal Siri AI can process voice, images, and text simultaneously, but arrives years after Google and Microsoft launched similar features.
Google's Gemini 3.5 Flash now powers search, Android, and shopping with 24/7 AI agents, making the assistant nearly inescapable across the web.
AI labs are adopting diffusion language models that generate text in parallel, delivering several-fold speedups over token-by-token systems like ChatGPT.
Google shrinks its Gemma 4 AI model to under 1GB using quantization-aware training, enabling powerful on-device AI that runs on phones and laptops.
xAI's Grok Imagine Video 1.5 claims the top AI video ranking with 52-point Elo jump, generating 15-second clips with synchronized native audio.
Vision-language models used for content moderation fail to detect 76% of unsafe images after just two simple photo edits, researchers found.
Cambridge researchers developed AI that detects when audio or video is missing from searches, boosting accuracy to 94.2% in broadcast archives.
Vision language models combine text and image processing to answer questions about pictures, transforming how AI systems understand visual content.
AI video creators are switching from loose text prompts to structured JSON format, cutting failed generations and saving credits on Sora and Veo.
Microsoft's seven new multimodal AI models can be customized with company data, cutting costs by 10x while matching GPT performance.
Google's Veo 3 and Gemini Omni target different creative workflows, with one built for cinematic polish and the other for iterative development.
AI watermarking companies Resemble AI and Steg.ai compete to solve deepfake fraud, which surged 2,000% with different embedding approaches.
New AI video benchmark reveals current models fail at multi-shot storytelling, with even advanced systems like Sora 2 struggling with director-level.
Audio-visual AI models fail to reliably match speech to video, with most open-source systems performing no better than random guessing.
New AI system solves hospitals' biggest challenge: understanding messy medical reports that vary wildly between institutions and doctors.
Kling AI reaches 60 million users with browser-based video generation that creates 10-minute professional videos without expensive hardware or software.
AI detectors drop from 90% to below 80% accuracy on text-rich forgeries like fake screenshots, creating dangerous gaps in misinformation detection.
Vision-language models can now reconstruct editable 3D scenes from single photos using staged workflows that mirror professional artists.
New benchmark reveals AI speech models can make requested edits but struggle to preserve unchanged content, with even top models failing badly.
New AI detector FakeVLM-R1 uses detective-like reasoning to spot fake images, reducing false positives by understanding physics instead of memorizing.
YouTube now automatically detects and labels AI-generated video content, forcing creators to prepare for EU compliance by August 2026.
LongCat-Video-Avatar 1.5 generates realistic AI humans that maintain consistent identity across minutes of video, not just 10-second clips like earlier.
Runway beats Sora 2 and Veo 3 in video production by integrating AI generation with editing tools, motion capture, and workflow control in one platform.
New AI framework NAVA generates perfectly synced video and audio using just 6.3 billion parameters, solving multimodal AI's trickiest challenge.
Vision language models are replacing traditional OCR by understanding document structure and context, boosting processing rates for messy inputs.
Researchers created an AI agent system that transforms one sentence into a complete short drama, solving major video generation consistency problems.
Researchers built unified AI that detects deepfakes across audio, video, and text simultaneously, closing gaps in current single-format detection tools.
AI vision models misclassify women-centric crimes like stalking and harassment in up to 65% of surveillance footage due to biased training data.
Multimodal AI catches 75% more brand crises than text-only tools by monitoring audio, video and images across 25+ platforms in real time.
Google's Omni Flash generates 10-second videos from text, images, and audio but deliberately caps duration for policy reasons, not technical limits.
UC Irvine researchers deployed multimodal AI systems in real dementia care facilities, using cameras, microphones, and sensors to detect agitation and prevent...
Researchers reveal how malicious prompts embedded in websites can hijack AI web agents.
AI can generate videos in seconds, but experts warn that faster production doesn't guarantee better storytelling.
Researchers built ReTool-Video, a system with 134 specialized tools that helps AI agents reason about videos by breaking down complex questions into executable...
New research reveals a critical gap in how vision language models justify their answers.
Google's unannounced Gemini Omni video model leaked before I/O 2026, showing features like video remixing, chat-based editing, and improved text rendering.