Build a Frontier Model
🎯
Step 5 of 11

Post-Training & Alignment

Where a base model becomes useful — or harmful.

Post-training in 2026 is a multi-stage pipeline: SFT on high-quality instructions → preference optimization (DPO and variants) → RLVR (Reinforcement Learning from Verifiable Rewards) on math/code/agentic tasks → final safety polish. Constitutional AI and deliberative alignment shape behavior; distillation from R1/o-class reasoners injects long-CoT capabilities into smaller models.

Why it matters

  • DeepSeek-R1-Zero showed pure RL on verifiable rewards elicits emergent long chain-of-thought — no SFT required.
  • DPO replaced PPO/RLHF as the dominant preference-optimization algorithm at most labs (cheaper, more stable).
  • S1 (Stanford, Jan 2025) and LIMO showed ~1,000 high-quality reasoning traces can match much larger SFT datasets.
  • Anthropic's Constitutional Classifiers (Jan 2025) brought principle-driven safety into deployed inference, not just training.

State of the art

2025-2026
  • GRPO (Group Relative Policy Optimization, from DeepSeekMath) replaced PPO as the dominant on-policy RL algo.
  • RLVR (Reinforcement Learning from Verifiable Rewards) — train on math/code where correctness is mechanically checkable.
  • Iterated DPO (Llama 3.x, Tülu 3) — multiple rounds of preference data collection + DPO refinement.
  • Deliberative alignment (OpenAI, Dec 2024) — model reasons over the safety spec at inference time before answering.
  • On-policy distillation: R1-Distill-Qwen-32B and similar pipelines transfer reasoning to smaller dense models.

The recipe

A frontier-grade implementation, in order.

1

SFT on diverse, high-quality data

10k-1M instruction-response pairs covering chat, code, math, tool use, refusal. Quality > quantity (LIMO showed 1k can suffice for reasoning).

2

Preference optimization

DPO is the safe default. SimPO if you want simpler hyperparameters. KTO if you only have binary feedback. Iterate 2-4 times.

3

RLVR for capabilities

Math: AIME-style problems with numeric answer checking. Code: unit tests. Agents: task completion via tool execution. GRPO or PPO.

4

Constitutional / safety pass

Self-critique against a written constitution (Anthropic-style). Or train a Constitutional Classifier on inputs/outputs.

5

Deliberative alignment

For high-stakes deployments, train the model to reason over the safety spec inside its CoT before responding.

6

Red-team & RLHF on failures

Recruit professional red-teamers. Convert successful jailbreaks into preference data. Loop until stable.

⚠️

Common pitfalls

Reward hacking: any verifier you write will be gamed. RLVR works best when correctness is mechanically checkable (compiler, unit tests, math evaluator).
DPO 'over-optimization' compresses output diversity — anneal beta upward late in training, or use SimPO/IPO.
Capability/safety tax: aggressive refusal training degrades helpfulness. Anthropic's HH-RLHF papers document this trade.
SFT data leakage: if your SFT set overlaps benchmarks, you're contaminating evaluations. Decontaminate against MMLU, GPQA, AIME, etc.