DeepSeek R1 Gives AI a 'Cyber Finger' to Point at Objects, Solving a Problem Everyone Else Missed
DeepSeek has identified and solved a critical blind spot in how AI models understand images: they can see clearly, but they struggle to consistently refer to the same object while reasoning through complex visual tasks. Rather than following competitors who focus on making images sharper and higher-resolution, DeepSeek took a different path. The company developed a technique that gives AI models a "cyber finger" to point at specific locations in images, embedding spatial coordinates directly into the reasoning process.
Why Can't AI Models Just Use Words to Describe What They See?
When humans look at a crowded image, they can point and say, "That person right there." Language alone falls short for AI. If you tell a model "the dog on the left" in a photo with a dozen dogs, the model struggles to maintain a consistent reference to which specific dog you mean as it reasons through the problem. This referential ambiguity compounds in complex spatial reasoning tasks like maze navigation or object counting, where the model can easily lose track of what it has already processed.
Most cutting-edge multimodal models from competitors like OpenAI, Claude, and Google have tackled this challenge by increasing image resolution and introducing high-resolution cropping and multi-scale processing. The assumption has been straightforward: if the model can see more pixels and more detail, its visual reasoning will naturally improve. However, DeepSeek identified that this approach misses the root problem. Even with crystal-clear vision, logical breakdowns still occur when models must maintain spatial awareness across multiple reasoning steps.
How Does DeepSeek's "Visual Primitives" Approach Work?
DeepSeek's solution elevates two fundamental spatial markers from computer vision, bounding boxes and points, to the status of basic thinking units. The company calls this mechanism "point while it reasons." Instead of using boxes and coordinates only as a final output to show what the model found, DeepSeek embeds them directly into the reasoning chain itself. When the model thinks through a problem, it outputs not just language descriptions but also explicit spatial coordinates anchored to the image.
For example, when navigating a maze, the model starts from the starting point, explores different paths, backtracks, and tries again. As it reasons, it outputs a complete string of coordinate paths, with each coordinate corresponding to a specific point it has visited. This creates a traceable, verifiable reasoning process where every visual object has a clear spatial anchor point. The model cannot become confused about what it is referring to because each reference is tied to explicit coordinates in the image.
- Explicit Spatial Anchoring: Coordinates and bounding boxes become part of the reasoning text itself, not just auxiliary tools, making visual references unambiguous throughout the thinking process
- Transparent Reasoning: Unlike OpenAI's approach where visual processing happens internally, DeepSeek deliberately makes intermediate visual anchors explicit, allowing users to see and verify the complete reasoning chain
- Trainable Feedback Signals: The explicit coordinate outputs make it easier to design reward signals and provide detailed feedback on whether spatial reasoning is correct, such as whether a path covers all necessary points in a maze
How Does This Compare to Competitors' Visual AI Strategies?
OpenAI's approach, highlighted in models like o3 and o4-mini, emphasizes "thinking with images." This means the model can incorporate images into its reasoning chain and manipulate them through cropping, zooming, and rotation. The focus is on making the image itself part of the thinking process, with the model generating and modifying images during reasoning. This direction prioritizes general capabilities across vision, code, search, and file handling, creating a powerful "visual workbench".
DeepSeek's route is more symbolic and explicit. By allowing coordinates to enter the thinking chain, the model writes bounding box and point coordinates directly into its reasoning text, turning visual objects into reusable anchors. This creates a fundamental difference in transparency: OpenAI's visual reasoning occurs internally with only the final answer visible to users, while DeepSeek deliberately exposes the intermediate visual anchors, making the entire reasoning process transparent and verifiable.
A particularly significant detail in DeepSeek's technical report reveals an efficiency advantage. When processing images, DeepSeek's model uses far fewer tokens than competing cutting-edge models. This means the model can handle visual reasoning tasks more efficiently, reducing computational overhead while maintaining reasoning quality.
What Does This Mean for Developers and AI Users?
For developers building applications that require precise visual understanding, DeepSeek's approach offers practical advantages. The explicit coordinate outputs make it easier to integrate visual reasoning into workflows where accuracy and traceability matter. Tasks like quality control, medical imaging analysis, or spatial planning benefit from reasoning that can be audited and verified step by step.
The efficiency gains also matter for cost-conscious deployments. Since the model uses fewer tokens to process images, inference costs drop compared to models that require higher token counts for the same visual reasoning tasks. This makes sophisticated visual AI more accessible to organizations with tighter budgets.
Meanwhile, the broader AI community continues to explore different paths to visual reasoning. Some developers prefer local, open-source models for privacy and cost control. For instance, developers running models like Qwen3.6-35B locally report significant cost savings and the ability to maintain full control over sensitive documents and logs without relying on cloud-based AI services. DeepSeek's innovation in visual reasoning could eventually be adapted to these local models as well, expanding the benefits beyond cloud-based systems.
The fundamental insight DeepSeek has highlighted is that the challenge of multimodal AI is not simply about perception, but about maintaining consistent, verifiable references during reasoning. By making spatial coordinates explicit and central to the thinking process, DeepSeek has addressed a problem that competitors have largely overlooked in their race to increase image resolution and processing power.