The Forgetting Problem: How Robots Are Learning to Act Without Losing Their Minds
When artificial intelligence systems learn to control robots, they typically forget how to reason about the world. Researchers at Princeton University have discovered why this happens and developed a solution that preserves a model's foundational knowledge while teaching it to act in the physical world. The breakthrough, presented at ICLR 2026, addresses what scientists call "catastrophic forgetting" and could fundamentally change how robots learn from foundation models.
Why Do AI Models Forget When They Learn to Act?
Vision-language models (VLMs) like GPT-4V and Gemini Vision are trained on massive datasets of images and text from the internet. They develop broad understanding of the world, recognizing objects, interpreting scenes, and following instructions in multiple languages. Naturally, researchers thought these models would make excellent foundations for teaching robots to perform physical tasks. But something unexpected happened during the fine-tuning process, which is when researchers adapt a general model for a specific job.
When scientists fine-tuned VLMs to control robots, creating what's called a vision-language-action model (VLA), the models lost performance on their original skills. Visual reasoning abilities declined, multilingual instruction-following degraded, and the models struggled with open-world queries they could previously handle. This phenomenon, known as catastrophic forgetting, became a central obstacle to adapting foundation models for embodied use, meaning models that can interact with the physical world.
The root cause turned out to be a mismatch in how information is represented. VLMs are pretrained to reason and answer in natural language. Robot policies, however, must output continuous motor commands, which are numerical vectors describing movement and gripper state. Most existing approaches bridge this gap by assigning special tokens to represent actions or by adding separate action-generation modules. Both strategies introduce a distribution shift, meaning the model sees fundamentally different types of data during robot training than it encountered during its original internet-scale pretraining.
What If Robots Spoke the Same Language as AI Models?
Researchers at Princeton, led by Asher Hancock, proposed a simple but elegant solution: instead of changing the model to accommodate robot actions, change how actions are represented. Rather than predicting high-dimensional motor vectors, the model generates text describing what the robot should do. For example, instead of outputting numerical coordinates, the model might generate: "To complete the task, the robot must move forward and slightly left, then move significantly downward before closing the gripper to grasp the object," before producing the corresponding low-level commands to physically control the robot.
This approach, called VLM2VLA, keeps action representation in language-space, which is what the model already understands from its pretraining. Because the model is working in familiar territory, fine-tuning can be done using LoRA, a parameter-efficient method that updates small low-rank weight matrices instead of modifying the entire network. The model learns to act without overwriting what it already knows.
Control is structured hierarchically: the model predicts a subtask, describes a spatial plan in language, and then generates the low-level action, all as text. Researchers automatically relabeled existing robot trajectories into this language-aligned format, converting demonstrations into training data that the model could understand.
How to Preserve AI Knowledge While Teaching Robot Skills
- Use Language-Based Action Representation: Express robot actions directly in natural language rather than special tokens or separate modules, keeping the model working in its native representation space.
- Apply Parameter-Efficient Fine-Tuning: Use LoRA or similar methods that update only small portions of the network, preventing the catastrophic forgetting that occurs when modifying the full model.
- Automatically Relabel Training Data: Convert existing robot demonstrations into language-aligned formats so the model learns from data that matches its pretraining distribution.
What Do the Real-World Results Show?
The results were striking. Across twelve multimodal understanding benchmarks, VLM2VLA retained over 85% of the base model's performance after fine-tuning. In contrast, conventional VLAs showed substantial drops after the same process. On more than 800 real-world trials with a 6-degree-of-freedom robotic arm, VLM2VLA matched baseline performance on standard manipulation tasks like picking up and placing objects.
The real payoff appeared in out-of-distribution settings, meaning tasks the robot was never trained on. When researchers tested multilingual instruction-following, they asked the robot to pick up a carrot using commands in Spanish, Mandarin, and Hindi. VLM2VLA significantly outperformed all baselines, correctly translating the instruction and identifying the target object among distractors.
In open-world semantic reasoning tests, researchers instructed the robot to "pick up the item above Ash Ketchum," requiring the system to recognize the famous Pokemon character, reason about spatial relationships, and manipulate the correct object. VLM2VLA achieved a 60% success rate; baselines performed near zero. This demonstrates that preserving pretrained knowledge directly translates to stronger embodied generalization, meaning the robot can handle situations it hasn't explicitly encountered before.
"By describing robot data in natural language, it becomes possible to add control capability without sacrificing multimodal understanding," explained Asher Hancock, lead researcher on the project.
Asher Hancock, Researcher at Princeton Laboratory for Artificial Intelligence Research
To isolate the role of language-based action representation, the team trained an otherwise identical model that encodes actions using low-likelihood reserved tokens instead of natural language. Both models used LoRA and trained on the same data. On simple tasks, performance was similar. But as reasoning demands increased, the language-based model pulled ahead. On the Ash Ketchum task, it achieved roughly twice the success rate of its token-based counterpart, suggesting that representation choice itself plays a key role in connecting world knowledge to physical action.
Why Does This Matter Beyond Robotics?
The implications extend far beyond robot manipulation. VLM2VLA shows that a representational shift in the data can be enough to add new capabilities without sacrificing existing ones. By describing robot data in natural language, it becomes possible to seamlessly mix robot interaction data with standard VLM corpora, enabling models that reason, communicate, and act within a unified representation space.
This breakthrough addresses one of the fundamental challenges in AI development: how to build systems that are both capable and general-purpose. Rather than creating specialized models for each task, researchers can now adapt foundation models to new domains while preserving their broad knowledge. As embodied AI systems become more prevalent in manufacturing, healthcare, and research settings, the ability to maintain reasoning capabilities while adding physical control becomes increasingly valuable.