From Reasoning to Agents: Why Alibaba's Qwen Chief Left to Reshape AI's Next Era
Alibaba's Qwen project is undergoing a fundamental shift in philosophy: moving away from pure reasoning models toward AI agents that can plan, act, and learn from real-world feedback. Junyang Lin, who stepped down as technical lead of Qwen on March 3, 2026, now publishes as an independent researcher and has outlined why this transition matters and what it requires from AI infrastructure.
What's the Difference Between Reasoning Models and Agentic AI?
For the past year, the AI field has been obsessed with reasoning models like OpenAI's o1 and DeepSeek-R1, which spend extra computational effort thinking through problems step-by-step before answering. These models excel at math, code, and logic puzzles because their success is measured by the quality of their internal reasoning.
Agentic thinking, by contrast, is judged by whether an AI system makes sustained progress while acting in an environment. An agent formulates plans, decides when to stop thinking and take action, chooses which tools to use, reads feedback from the environment, and revises its approach based on what it learns. The goal is not a perfect internal monologue, but real-world results.
"The next era is agentic thinking: thinking in order to act. An agent formulates plans, decides when to act, uses tools, reads environment feedback, and revises," explained Junyang Lin, former technical lead of Alibaba's Qwen project.
Junyang Lin, Former Technical Lead, Alibaba Qwen
Lin frames this as a clean break from the reasoning era. Where reasoning models are rewarded for spending more tokens on hard problems, agentic systems must handle challenges that pure reasoning can sidestep entirely.
What Challenges Do Agents Face That Reasoning Models Don't?
Agentic thinking must contend with real-world complexity that reasoning models never encounter. Lin lists the core challenges agents must solve:
- Deciding when to stop thinking: An agent must know when deliberation ends and action begins, rather than reasoning until a perfect answer emerges.
- Tool orchestration: Agents must choose which tool to invoke, in what order, and handle the consequences of each choice.
- Noisy observations: Environment feedback is often incomplete or contradictory, unlike the clean verification signals used to train reasoning models.
- Plan revision: Agents must adapt when initial plans fail, incorporating new information from failed attempts.
- Multi-turn coherence: Agents must maintain consistent goals and reasoning across many tool calls and environment interactions.
These demands reshape what engineers must optimize for. In the reasoning era, teams focused on data diversity and verification systems. In the agent era, Lin argues that environment quality becomes paramount: stability, realism, coverage, and resistance to reward hacking.
How Does Qwen3 Implement Hybrid Thinking?
Qwen3, Alibaba's latest model family, introduces hybrid thinking modes that let users toggle between step-by-step reasoning and near-instant responses. The model ranges from 0.6 billion parameters for lightweight tasks up to 235 billion parameters for complex reasoning, with quantized formats available under the Apache 2.0 open-source license.
The hybrid approach was harder to build than it appears. Lin explained that thinking mode and instruction-following mode pull in opposite directions: a strong instruction model is rewarded for brevity and low latency, while a strong thinking model is rewarded for spending more tokens on hard problems. Merging them carelessly degrades both.
Qwen3 solved this through a four-stage post-training pipeline that included long-chain-of-thought initialization, reasoning reinforcement learning, and a "thinking mode fusion" step. Later in 2025, the team shifted strategy and shipped separate Instruct and Thinking variants instead, treating it as a data problem rather than a model architecture problem.
The implementation is straightforward for developers. Users can enable thinking mode via a simple flag in the chat template, or append "/think" or "/no_think" to individual messages to control reasoning per turn. Qwen3 also supports dynamic thinking budgets, allowing callers to cap how much the model reasons before responding.
What Infrastructure Changes Are Required for Agent Training?
The shift from reasoning to agents demands a fundamental rethinking of AI infrastructure. In reasoning reinforcement learning, rollouts are mostly self-contained trajectories with clean evaluators. In agentic reinforcement learning, the policy lives inside a harness of tool servers, browsers, terminals, and sandboxes.
This creates a critical bottleneck: training and inference must be cleanly decoupled. Without separation, rollout throughput collapses. A coding agent waiting on live test execution stalls inference and starves training, causing GPU utilization to drop well below what reasoning reinforcement learning achieves.
Lin highlights three practical applications where agentic thinking reshapes how systems work:
- Coding agents: Rather than emitting a single code patch from a stack trace, an agentic system runs the test harness, reads the real error, revises, and re-runs until the test suite passes. Thinking here should focus on codebase navigation, error recovery, and tool orchestration.
- Deep research: Instead of writing a long answer from memory, an agentic system breaks the question into sub-queries, calls search, filters weak sources, and returns grounded citations. Qwen's own Deep Research demo exemplifies this approach.
- Multi-agent orchestration: Lin expects "harness engineering" to become critical. An orchestrator plans and routes work while specialized sub-agents execute narrower tasks and help prevent context pollution.
How Are Chinese AI Labs Approaching Open-Weight Models?
Qwen3 is released in multiple quantized formats, including GGUF, GPTQ, AWQ, and MLX, all under the Apache 2.0 open-source license. This reflects a broader trend among Chinese AI labs to release capable open-weight models that researchers and developers can run locally or customize.
The model family expanded multilingual support from 29 to 119 languages and dialects in Qwen3, signaling ambition to serve global users. The presentation compares Qwen models against contemporaries including DeepSeek-R1, Grok 3 Beta, Gemini 2.5 Pro, and OpenAI's o-series, demonstrating competitive performance across reasoning and instruction-following benchmarks.
Lin's departure and independent research status underscore a broader pattern: senior researchers at major AI labs are increasingly publishing their thinking publicly, shaping the field's direction outside of corporate constraints. His framing of the reasoning-to-agents transition has already influenced how the industry thinks about the next phase of AI development.
What Should AI Teams Focus on Now?
Lin's core argument is that the optimization target must shift. In the reasoning era, teams obsessed over rollout throughput, verification systems, and stable policy updates. In the agent era, the bottleneck moves to environment quality and reward hacking prevention.
The hardest problem, Lin notes, is reward hacking. Tool access enlarges the attack surface for spurious optimization, where an agent learns to game the reward signal rather than solve the actual task. This requires rethinking how environments are built, sandboxed, and monitored.
For teams building or deploying AI systems, the message is clear: the next generation of AI progress will not come from larger models or longer reasoning traces, but from systems that can plan, act, learn from failure, and maintain coherence across many turns of interaction with real environments.