Why AI Assistants Need a Second Training Stage: The Alignment Problem Nobody Talks About
AI alignment is the work of steering a capable language model toward the goals and values its makers intend, teaching it to be helpful, harmless, and honest. The reason this matters is surprisingly simple: the way AI models are first built doesn't naturally produce trustworthy behavior. A base model learns to predict the next word in text, which makes it fluent and knowledgeable, but never teaches it to follow instructions or refuse a dangerous request.
Think of it like hiring a brilliant new employee who has read almost everything but has never been told what the job actually is. They can talk endlessly and convincingly, yet without guidance they won't reliably do what you need. A fresh language model works the same way. Capability and alignment are two separate things, which is why a model can be enormously capable and still not aligned.
What's the Difference Between Capability and Alignment?
Capability is how much a model can do: how fluently it writes, how much it knows, how hard a task it can attempt. Alignment is whether it actually pursues the goals and values its developers intend while it does those things. These two ideas sound similar but they come apart in practice. Adding more capability does not, on its own, make a model more helpful, harmless, or honest.
A base model learns one thing during pretraining: predict the next token across a large body of writing. That single goal builds fluency and broad knowledge, but it never points the model at following instructions or refusing a harmful request. So the behavior we want from an assistant is not a free side effect of pretraining; it has to be added through later stages such as supervised fine-tuning and reinforcement learning from human feedback, or RLHF.
How Do Researchers Turn a Base Model Into a Trustworthy Assistant?
Turning a base model into an assistant usually happens in three distinct stages, each building on the last. Understanding these stages reveals why alignment requires deliberate engineering rather than emerging naturally from scale.
- Pretraining: The model learns to predict the next word over a vast corpus, soaking up most of its raw knowledge, but not yet how to behave as a helpful assistant.
- Supervised Fine-Tuning (SFT): The model is shown demonstrations of good answers and learns to imitate them, so it starts responding in the style and manner developers want.
- Reinforcement Learning from Human Feedback (RLHF): The model is refined using people's judgments about which responses are better, narrowing the gap between "can predict text" and "behaves like a helpful assistant."
Each stage narrows the gap between raw text prediction and aligned behavior. You cannot fine-tune a model you have not pretrained, and RLHF refines the model that supervised fine-tuning produced. Pretraining supplies knowledge, SFT teaches the basic behavior by imitation, and RLHF then sharpens that behavior using human preferences.
How Does RLHF Actually Work in Practice?
RLHF operates as a loop with several moving parts. The model generates multiple candidate answers to a prompt. A person then ranks which response is better. A reward model learns to predict those preferences, turning human judgments into a scoring system. Finally, the policy, which is the language model being trained, is tuned using an algorithm called Proximal Policy Optimization, or PPO, to produce answers the reward model scores more highly.
Why ask people for preferences instead of perfect answers? Comparing two responses is easier and more reliable than writing an ideal answer for every prompt. People rank which candidate is better, and the reward model learns from those rankings. That preference signal, repeated across many comparisons, is what the policy is then tuned to satisfy. The result is better-aligned responses, and the loop can run again on fresh comparisons.
What Are the Three Goals of Alignment?
Alignment research centers on three core objectives that define trustworthy AI behavior. These are the short summary of what alignment aims for.
- Helpful: The model genuinely serves the person asking, providing useful and relevant responses to their needs.
- Harmless: The model avoids harmful or dangerous outputs, refusing requests that could cause injury or damage.
- Honest: The model stays truthful rather than deceptive, providing accurate information and acknowledging uncertainty.
A highly capable base model is not guaranteed to be any of the three, which is the gap alignment sets out to close. This is why researchers cannot simply scale up models and expect better behavior; they must deliberately steer capability toward these three values.
Are There Simpler Alternatives to Traditional RLHF?
Classic RLHF has moving parts, a separate reward model and a reinforcement-learning loop, and researchers have since found ways to simplify or supplement it. Direct Preference Optimization, or DPO, skips the separate reward model and the reinforcement-learning loop, tuning the model directly on the same preference comparisons. Constitutional AI takes a different approach: instead of relying heavily on people to label harmful outputs, it gives the model a written set of principles, called a "constitution," and uses AI-generated feedback, an approach often called RLAIF, or reinforcement learning from AI feedback.
Both DPO and Constitutional AI still learn from preferences; they just change where the preferences come from and how they're applied. This diversity of approaches shows that alignment research is not locked into a single method but continues to evolve as researchers discover more efficient ways to steer model behavior toward intended goals.