Alibaba's HappyHorse-1.0 Brings Synced Audio and Video to AI Content Creation
Alibaba's new HappyHorse-1.0 video model generates 1080p video with synchronized audio, lip-sync across seven languages, and realistic detail.
69 articles
Alibaba's new HappyHorse-1.0 video model generates 1080p video with synchronized audio, lip-sync across seven languages, and realistic detail.
AI video generation is shifting from single-model reliance to multi-model workflows.
NVIDIA's Nemotron 3 Nano Omni unifies vision, audio, and language in a single AI model, delivering 9x faster processing for agents.
Researchers solved a critical problem in robot AI: teaching vision-language models to control robots without erasing their reasoning abilities.
OpenAI discontinued its viral Sora video generator in March 2026 after deepfakes flooded the internet.
Audio-visual AI systems are breaking down barriers for people with disabilities. New research shows how multimodal AI, combining speech and vision, is enabling...
Vision language models are moving off cloud servers onto local devices, with Nota AI's edge-optimized system winning 2026 industry awards.
Developers building with GPT-4 Vision face a critical compatibility issue: fine-tuned models lose image processing abilities entirely, forcing teams to choose...
Chinese tech companies are cracking a fundamental AI problem: getting vision, speech, and text to work as one unified system instead of separate tools.
Open-weight vision language models like Qwen 3.6 and Gemma 4 are narrowing the gap with premium services like Claude Opus 4.7, excelling at document work and...
Researchers are combining audio and video AI to detect respiratory diseases at home while preserving privacy.
MLX-VLM brings vision-language model inference and fine-tuning to Apple Silicon Macs, while Narwal's Flow 2 robot vacuum demonstrates VLMs entering consumer...
Google Gemini reached 750 million monthly active users by early 2026, driven by deep integration into existing products rather than raw model performance.
Adobe redesigned color grading in Premiere Pro and integrated Kling 3.0 video generation, addressing how modern editors now handle tasks once requiring...
AI researchers are building foundation models that learn physics and physical interaction from real-world data, not internet text.
Google launches native Gemini app for Mac with instant access via keyboard shortcut, letting AI understand your screen context without switching windows.
Most enterprise video remains unsearchable despite being a goldmine of data. Here's why multimodal AI is finally making video as accessible as text.
YouTube is rolling out Google Veo-powered AI Avatar video generation in 2026, letting creators produce 5-10x more content by uploading a photo and script.
ShengShu launched Vidu Q3 Reference-to-Video, an AI model that generates high-quality videos with synchronized audio, cinematic effects, and consistent...
An international research team has formally defined what constitutes a true AI world model, explicitly excluding text-to-video generators like Sora and Veo.
Google launches Veo 3.1 Lite, a cost-effective video generation model that costs less than 50% of its faster counterpart while maintaining the same speed,...
Alibaba leads $290M investment in Shengshu's Vidu to build AI that understands the physical world, not just text.
Kling 3.0, Seedance 2.0, Sora 2 Pro, and Veo 3.1 launched within weeks of each other in early 2026, fundamentally reshaping video generation with native audio,...
Sand AI's Magi-1 model generates infinite-length videos using autoregressive technology, a capability closed-source rivals like Sora and Kling cannot match.
LG AI Research released EXAONE 4.5, a multimodal AI model that outperforms OpenAI's GPT-5-mini and Google's Gemma 4 on visual reasoning benchmarks.
Vision language models are capturing retiring workers' expertise before it's lost forever, transforming quality control in aerospace and automotive...
HappyHorse-1.0, a fully open-source AI video generator, has topped the world's most authoritative blind-test leaderboard, outperforming ByteDance's Seedance...
Despite 87% of enterprises using AI, a new benchmark study reveals only 19% are data-ready and 79% report no measurable financial impact.
OpenAI's Sora shutdown is reshaping the video generation market. New platforms like Cannon Studio are consolidating fragmented workflows, while competitors...
Vision language models are evolving beyond image recognition into multimodal systems that convert designs into code and enable real-time voice conversations.
Google's new Gemma 4 family adds audio, video, and image processing to open-source models small enough to run on consumer laptops and smartphones, shifting AI...
Vision language models like Claude, GPT-4V, and Gemini are moving beyond object detection to handle complex document analysis and real-world problem-solving.
Vision language models struggle with structured documents because they treat parsing as reasoning.
Researchers achieved 91.7% accuracy using multimodal AI to analyze text, images, and audio together, transforming how institutions manage historical records...
Google's Veo 3 Ultra can now generate 60-second videos in 4K resolution with advanced camera control and spatial audio.
A new safety evaluation reveals Kimi K2.5, a powerful open-weight AI model with 3.5 million downloads, was released without safety testing.
Researchers created a new AI system that uses psychology-inspired tasks to detect mental disorders with greater accuracy by capturing disorder-specific...
Vision language models are transforming how chemists interpret complex Markush structures in drug patents, combining image recognition with text analysis to...
Vision language models like GPT-4V and Claude now handle tasks that once required multiple specialized AI tools, but cost optimization and knowing when NOT to...
Vision language models fail to catch hateful memes because they miss the interplay between text and images.
Most AI video prompts fail because creators skip four critical layers beyond scene description.
Z.ai launched GLM-5V-Turbo, a specialized multimodal AI that reads design mockups, screenshots, and videos to generate code directly.
Microsoft AI released three foundational models for speech, voice, and video generation at significantly lower costs than competitors, signaling an aggressive...
Google's Gemma 4 and Anthropic's Claude Opus 4.6 bring multimodal AI and autonomous workflows to everyday devices.
Sony AI introduces SAVGBench, the first benchmark for spatially aligned audio-visual generation, addressing a critical gap in multimodal AI that could reshape...
YouTube creators are replacing expensive stock footage with AI-generated B-roll using tools like Google Veo 3, cutting production costs from hundreds to $10-30...
Multimodal AI fusion, a technique from autonomous vehicles, now detects emotional inconsistencies in mental health conversations by analyzing text, video, and...
Resemble AI launches free deepfake detection tools and reveals 1,567 verified incidents in 2025, with nearly $1.3 billion in confirmed fraud losses tied to...
Veo 3 outperforms five rival AI video tools at generating product demos, excelling with complex items like boots and cosmetics.
A new Google DeepMind Fellow is building AI systems for languages spoken by billions but largely ignored by tech.