How AI Models Learn to See and Plan: A New Training Method Bridges the Vision Gap
A new training method called MGSD helps AI vision models understand spatial planning tasks 19.3% better by teaching them to recover task-relevant structures from images before reasoning about actions. The approach, developed by researchers at Tsinghua University and The Hong Kong University of Science and Technology, tackles a fundamental challenge in AI: the gap between visual perception and logical reasoning.
Why Do Vision Models Struggle With Spatial Planning?
Vision-language models, the same AI systems that power image understanding in tools like ChatGPT, excel at general multimodal tasks. However, they consistently underperform when asked to plan movements through space, navigate grids, or interact with objects in visual scenes. The problem isn't a lack of intelligence; it's a structural mismatch in how the models process information.
When a model plans using symbolic data, it receives explicit information: "object A is at position X, object B is at position Y, and the goal is at position Z." The model can reason directly over these clear facts. But when planning from images, the model must first figure out what objects exist and where they are, then reason about how to move between them. This two-step process creates what researchers call a "perception-reasoning modality gap." Errors in the first step cascade into the second, making planning unreliable.
How Does MGSD Train Models Differently?
MGSD, which stands for modality-gap-aware self-distillation, uses a two-stage training approach that separates the perception and reasoning problems. The framework first teaches the visual model to accurately recover task-relevant state structures from pixels, minimizing early perception errors. Then, it uses a symbolic teacher to guide the visual student model on planning decisions.
The key innovation is that symbolic data is used only during training. At inference time, the model operates purely on visual input, making it practical for real-world deployment. This design combines the reliability of symbolic planning with the flexibility of visual understanding.
Steps to Understanding How MGSD Improves Visual Planning
- Cold-Start Perception Alignment: The visual student model is trained with supervised fine-tuning to recover task-relevant state structures from images, producing initial predictions that align better with symbolic representations and reducing early perception noise.
- On-Policy Self-Distillation: The student generates its own rollout trajectories from images, while a frozen symbolic teacher provides dense token-level supervision based on the student's actual visual outputs, not fixed reference demonstrations.
- Inference-Time Efficiency: After training, symbolic data is discarded entirely, and the model operates on visual input alone, making the approach practical for deployment in real planning environments.
What Do the Experimental Results Show?
The researchers tested MGSD on visual planning benchmarks covering three types of tasks: safe grid navigation, topology-aware path finding, and embodied object interaction. The improvements were substantial across different model sizes.
For a 4-billion-parameter model, MGSD raised the macro average success rate from 11.2% to 30.5%. For an 8-billion-parameter model, the improvement was from 17.2% to 35.6%. These gains represent meaningful progress toward closing the gap between visual planning and symbolic planning, where models have access to explicit state information.
Diagnostic analysis confirmed that improvements came from both better visual state recovery and stronger optimal-path reasoning. This suggests that MGSD successfully addresses both sides of the perception-reasoning modality gap, rather than fixing just one bottleneck.
Why Does This Matter for AI Development?
Visual spatial planning is increasingly important as AI systems move beyond text and into embodied tasks. Robots, autonomous vehicles, and AI agents that interact with physical environments all need to understand visual scenes and plan movements. Current vision-language models struggle with these tasks, limiting their practical applications.
The MGSD approach offers a practical solution that doesn't require new architectural innovations or massive increases in model size. Instead, it uses a smarter training strategy to extract better performance from existing models. The framework also demonstrates that leveraging symbolic supervision during training, while maintaining visual-only inference, can be a powerful way to improve AI reasoning.
The researchers have made their code publicly available, which means other teams can build on this work and apply the method to their own visual planning challenges. This openness accelerates progress across the field and suggests that modality-gap-aware training could become a standard technique for improving multimodal AI systems.