A New Training Method Lets Companies Build Smarter AI Models for a Fraction of the Cost
A new training approach called Reinforcement Learning with Self-Distillation (RLSD) allows enterprises to build custom reasoning AI models using significantly less computing power and training time than current methods. Researchers at JD.com and academic institutions found that RLSD outperforms traditional reinforcement learning and distillation techniques, achieving 56.18% average accuracy across multiple benchmarks while requiring roughly half the training steps of competing approaches.
Why Current AI Training Methods Fall Short for Most Companies?
Building AI models that can reason through complex problems has traditionally required either massive computational resources or compromises in performance. Most enterprise teams face a difficult choice: distill knowledge from expensive, large AI models, or rely on reinforcement learning techniques that provide only sparse feedback about whether the model got the final answer right or wrong.
The standard method, called Reinforcement Learning with Verifiable Rewards (RLVR), works by having an automated verifier check if a model's answer is correct, providing a simple binary reward of 0 or 1. However, this approach has a critical weakness. When a model generates thousands of tokens to solve a problem, it receives only a single reward signal for the entire reasoning chain. This means the model cannot learn which specific steps actually led to success or failure.
"Standard GRPO has a signal density problem. A multi-thousand-token reasoning trace gets a single binary reward, and every token inside that trace receives identical credit, whether it's a pivotal logical step or a throwaway phrase," explained Chenxu Yang, co-author of the research paper.
Chenxu Yang, Co-author, JD.com Research
An alternative approach called On-Policy Distillation (OPD) pairs a smaller student model with a larger teacher model, providing token-by-token feedback. But this method requires keeping a massive teacher model running throughout training, roughly doubling the computing power needed. Additionally, the teacher and student must share identical vocabulary structures, which limits flexibility for enterprises running different model architectures or multilingual systems.
How Does RLSD Solve the Training Efficiency Problem?
RLSD works by decoupling two distinct signals that guide how a model learns. The researchers realized that the signal determining the direction of learning (whether to reinforce or penalize a behavior) must be perfectly reliable, even if sparse. However, the signal determining how much credit each step deserves benefits from being extremely dense and detailed.
In RLSD, the verifiable environmental feedback from RLVR strictly determines the direction of learning. The model only receives overall reinforcement if the final answer is objectively correct. Meanwhile, a self-teacher version of the same model provides token-by-token assessment to determine the magnitude of the update, distributing credit or blame across individual reasoning steps. Crucially, the self-teacher cannot dictate what the model should generate; it only sharpens credit allocation.
This differs fundamentally from traditional self-distillation approaches, which force models to copy the exact wording and phrasing of a teacher, often causing hallucinations. RLSD instead tells the model which of its own tokens were actually doing the work on the path it chose.
Steps to Understand How RLSD Improves Model Training
- Decoupled Learning Signals: RLSD separates the direction of learning (correct or incorrect) from the magnitude of learning (how much credit each step deserves), allowing sparse but reliable signals to guide overall direction while dense signals refine step-level understanding.
- Self-Teacher Architecture: The same model serves as both student and teacher, eliminating the need for a separate large teacher model and reducing computational overhead to just one extra forward pass per training example.
- Credit Allocation Without Hallucination: Instead of forcing the model to imitate hidden solutions, RLSD distributes credit based on which tokens actually contributed to correct reasoning, preventing the model from inventing references to information it won't have during deployment.
What Do the Benchmark Results Show About RLSD Performance?
Researchers tested RLSD by training the Qwen3-VL-8B vision-language model, an open-weight AI system with 8 billion parameters, and evaluated it on five visual reasoning benchmarks. These included MMMU for college-level multi-discipline questions, MathVista, MathVision, WeMath, and ZeroBench, a stress-test benchmark designed to be nearly impossible for current frontier models.
RLSD significantly outperformed every competing method. It achieved the highest average accuracy of 56.18% across all five benchmarks, beating the base model with no post-training by 4.69 percentage points and outperforming standard RLVR by 2.32 percentage points. The gains were most pronounced in complex mathematical reasoning tasks, where RLSD outperformed standard RLVR by 3.91% on the MathVision benchmark.
Beyond accuracy improvements, RLSD delivers massive efficiency gains. According to Yang, RLSD trained for 200 steps already beats GRPO trained for 400 steps, representing roughly a 2x convergence speedup. Cost-wise, the only computational overhead beyond a normal GRPO pipeline is one extra forward pass per response, making it far more efficient than maintaining a separate teacher model.
For enterprise teams building custom reasoning models tailored to specific business logic, RLSD lowers both the technical and financial barriers to entry. The approach eliminates the need to train complex auxiliary reward networks, manually annotate step-by-step training data, or maintain massive external teacher models. This makes advanced AI reasoning capabilities accessible to organizations without frontier-scale computing budgets.