How AI Is Learning to Generate Videos Tailored to Each User's Taste
Instead of showing users videos from an existing library, a new AI system generates personalized videos directly from what it learns about each person's interests, delivering measurable business results at massive scale. Researchers have developed a framework called Recommendation-as-Generation (RaG) that bridges two traditionally separate AI tasks: understanding what users want and creating videos that match those preferences. The system was tested on an industrial-scale platform with over 400 million daily active users and showed up to 1.87% improvement in advertising revenue compared to traditional recommendation approaches.
What's the Problem With Today's Video Recommendation Systems?
Current short-video platforms rely on a content-first approach: videos are produced offline, and recommendation algorithms simply retrieve and rank the best matches from a fixed pool. This creates a fundamental limitation. If a user's interests fall outside what's already been produced, the system can't satisfy that preference. Modern users have increasingly dynamic, long-tailed, and diverse interests, meaning many people's tastes don't fit neatly into pre-made content. Traditional recommendation models can only show users the best available video from what exists, even when that video doesn't truly match what they want.
The challenge becomes even more severe on platforms with hundreds of millions of users. Each person has unique preferences, and creating custom content for everyone using traditional methods would be prohibitively expensive and time-consuming. This is where generative AI offers a new path forward.
How Does the New System Actually Work?
The RaG framework uses a clever intermediary called Disentangled Semantic IDs (D-SIDs) to connect recommendation and video generation. Think of these as a universal translator between what users want and what the system creates. A multimodal large language model (LLM), which can process both text and images, analyzes each video and breaks it down into two separate components: content semantics (entities, topics, and subject matter) and creative semantics (style, rhythm, atmosphere, and artistic choices). These are then converted into discrete IDs that the system can work with.
On the recommendation side, a generative recommendation model predicts which D-SIDs match a user's interests. On the generation side, those predicted IDs are decoded back into actual videos. This unified approach lets the system understand fine-grained user preferences and translate them into controllable video generation.
To actually create videos at scale, the system uses Video Generation Agents (VGAs). Rather than relying on expensive, monolithic video generation pipelines that require manual prompting and extensive post-processing, VGAs adopt a hierarchical planning approach. An instruction model first translates user-interest IDs into structured generation blueprints. Three specialized agents then work together, each handling different aspects: visual composition, audio alignment, and artistic effects. All three agents share a single LLM backbone and are trained together end-to-end, differentiated only through prompts and tool access.
After the agents complete their work, a bounded reflection loop (capped at two iterations) refines cross-modal consistency, balancing output quality with generation efficiency. This design enables the system to serve recommendation requests for hundreds of millions of users without prohibitive computational costs.
Steps to Understanding How Personalized Video Generation Scales
- Semantic Encoding: A multimodal language model analyzes each video and extracts two types of information: what the video is about (content) and how it's presented (creative style), converting these into discrete IDs that machines can process efficiently.
- Interest Prediction: A generative recommendation model learns user preferences by predicting which semantic IDs align with each person's viewing history and behavior, enabling fine-grained interest modeling without retrieving from a fixed pool.
- Hierarchical Generation: Video Generation Agents use the predicted semantic IDs to create structured blueprints, then deploy specialized agents to handle visual composition, audio alignment, and artistic effects in parallel, reducing latency and computational overhead.
- Quality Optimization: A synergistic cross-domain reward learning mechanism balances user feedback, interest alignment, and video quality assessment, ensuring generated videos satisfy both users and business metrics like ad revenue.
What Results Did the Real-World Test Show?
The system was deployed on a production platform serving over 400 million daily active users in a revenue-critical advertising scenario. Online A/B tests, which randomly assign users to either the new system or the traditional approach, showed that RaG delivered up to 1.87% improvement in ad revenue compared to a strong production generative recommendation baseline. This may sound modest, but at the scale of hundreds of millions of users, even small percentage improvements translate to substantial business impact.
The improvement came from the system's ability to generate videos that more precisely matched user interests. Because the system wasn't limited to a fixed pool of pre-produced content, it could create videos that better satisfied long-tail user preferences. Users who saw personalized generated videos were more engaged, which translated to higher advertising performance.
Why Does This Matter Beyond Advertising?
The research demonstrates a broader principle: closed-loop generative systems, where user feedback continuously improves both recommendation and generation, represent a promising new paradigm for AI applications. Rather than treating recommendation and content creation as separate problems, integrating them into a unified framework enables more efficient and effective systems.
This approach has implications beyond short-video platforms. Any application that needs to match users with content, whether that's music streaming, news feeds, e-commerce product discovery, or educational content, could potentially benefit from similar closed-loop generative systems. The key insight is that when you can generate content on demand, you're no longer constrained by what was produced in the past.
The technical innovations also matter. Video Generation Agents represent a more efficient way to orchestrate complex AI tasks. By using a shared LLM backbone with specialized prompts and tool access rather than building separate monolithic models, the system reduces computational costs while maintaining quality. The SID-indexed cache that amortizes generation cost across users further demonstrates how industrial-scale AI systems must balance capability with efficiency.
As AI-generated content becomes increasingly sophisticated, the question shifts from "Can we generate high-quality videos?" to "How do we generate the right videos for each person, at scale, efficiently?" This research provides one answer: by unifying recommendation and generation through a shared semantic interface and using hierarchical agent-based planning to manage complexity.