AI Learns Human Judgment From Just a Few Videos, Reshaping How Robots Will Work
Researchers at South Korea's KAIST have solved a fundamental challenge in physical AI: teaching machines to understand human judgment criteria using only a handful of video examples instead of thousands of hours of human feedback. The breakthrough, called VOTP (Video-based Optimal TransPort Preference), was selected for an oral presentation at the International Conference on Machine Learning (ICML) 2026, an honor given to only the top 0.7% of submitted papers.
Why Does This Matter for Robots and Autonomous Systems?
Until now, building AI systems that understand what humans want has required an enormous investment of time and resources. When a surgical robot performs suturing or an autonomous vehicle navigates a complex intersection, the AI must choose the most appropriate action from numerous options. To do this effectively, engineers had to manually evaluate thousands to tens of thousands of action examples, with humans rating each one to create what's called a "reward function" that reflects human preferences.
The KAIST team focused on how humans themselves learn new tasks: by watching just a few demonstrations of good and bad examples. VOTP applies this same principle to artificial intelligence. The algorithm allows machines to understand human-preferred action patterns from only a small number of videos, without requiring humans to evaluate vast amounts of data individually.
"The core of physical AI is making machines understand human intentions and choose the correct actions. Since VOTP can learn human judgment criteria with only a small number of videos, it is a core technology that will accelerate the era of robots making human-like judgments," said Professor Chang D. Yoo.
Professor Chang D. Yoo, School of Electrical Engineering at KAIST
How to Apply This Technology Across Industries
- Robot Arm Control: Manufacturing robots can learn precise movement patterns and handling techniques by observing a few expert demonstrations, reducing training time from months to weeks.
- Autonomous Vehicles: Self-driving systems can learn safe navigation and decision-making in complex traffic scenarios by studying video examples of human drivers handling similar situations.
- Surgical Robots: Medical robots performing delicate procedures can understand surgeon preferences and optimal techniques from limited video training data rather than requiring extensive manual programming.
- Smart Factories and Drones: Industrial automation systems can adapt to facility-specific workflows and safety protocols by learning from minimal demonstration footage.
- AI Agents: Software agents that directly operate computers can learn user preferences and workflow patterns from observing just a few examples of desired behavior.
The research team validated VOTP's effectiveness across various environments and tasks, demonstrating that the algorithm generalizes well beyond its training examples. This means a robot trained on a few videos of a task in one setting can apply what it learned to similar tasks in different environments.
What Are the Real-World Implications?
The practical impact could be transformative. Since robots, autonomous vehicles, and industrial machinery can now learn actions that meet human expectations with only a small number of examples, development time and costs could drop dramatically. Companies won't need to invest months collecting and labeling thousands of training examples. Instead, they can deploy systems faster and more affordably.
The research was conducted with support from South Korea's Institute for Information and Communication Technology Planning and Evaluation (IITP) and the National Research Foundation of Korea (NRF), funded by the Ministry of Science and ICT. The paper, titled "Video-Based Optimal Transport for Feedback-Efficient Offline Preference-Based Reinforcement Learning," will be presented at ICML 2026, which is being held in Seoul this July.
This advancement represents a significant step toward what researchers call "physical AI," which moves beyond text and image generation into systems that control actual machines and act in the real world. As these technologies mature, the ability to learn human intentions efficiently will become increasingly critical for deploying robots, autonomous vehicles, and medical devices safely and effectively in real-world settings.