Post-Training & Alignment

Where a base model becomes useful — or harmful.

Post-training in 2026 is a multi-stage pipeline: SFT on high-quality instructions → preference optimization (DPO and variants) → RLVR (Reinforcement Learning from Verifiable Rewards) on math/code/agentic tasks → final safety polish. Constitutional AI and deliberative alignment shape behavior; distillation from R1/o-class reasoners injects long-CoT capabilities into smaller models.

Why it matters

DeepSeek-R1-Zero showed pure RL on verifiable rewards elicits emergent long chain-of-thought — no SFT required.
DPO replaced PPO/RLHF as the dominant preference-optimization algorithm at most labs (cheaper, more stable).
S1 (Stanford, Jan 2025) and LIMO showed ~1,000 high-quality reasoning traces can match much larger SFT datasets.
Anthropic's Constitutional Classifiers (Jan 2025) brought principle-driven safety into deployed inference, not just training.

State of the art

2025-2026

GRPO (Group Relative Policy Optimization, from DeepSeekMath) replaced PPO as the dominant on-policy RL algo.
RLVR (Reinforcement Learning from Verifiable Rewards) — train on math/code where correctness is mechanically checkable.
Iterated DPO (Llama 3.x, Tülu 3) — multiple rounds of preference data collection + DPO refinement.
Deliberative alignment (OpenAI, Dec 2024) — model reasons over the safety spec at inference time before answering.
On-policy distillation: R1-Distill-Qwen-32B and similar pipelines transfer reasoning to smaller dense models.

The recipe

A frontier-grade implementation, in order.

SFT on diverse, high-quality data

10k-1M instruction-response pairs covering chat, code, math, tool use, refusal. Quality > quantity (LIMO showed 1k can suffice for reasoning).

Preference optimization

DPO is the safe default. SimPO if you want simpler hyperparameters. KTO if you only have binary feedback. Iterate 2-4 times.

RLVR for capabilities

Math: AIME-style problems with numeric answer checking. Code: unit tests. Agents: task completion via tool execution. GRPO or PPO.

Constitutional / safety pass

Self-critique against a written constitution (Anthropic-style). Or train a Constitutional Classifier on inputs/outputs.

Deliberative alignment

For high-stakes deployments, train the model to reason over the safety spec inside its CoT before responding.

Red-team & RLHF on failures

Recruit professional red-teamers. Convert successful jailbreaks into preference data. Loop until stable.

⚠️

Common pitfalls

Reward hacking: any verifier you write will be gamed. RLVR works best when correctness is mechanically checkable (compiler, unit tests, math evaluator).

DPO 'over-optimization' compresses output diversity — anneal beta upward late in training, or use SimPO/IPO.

Capability/safety tax: aggressive refusal training degrades helpfulness. Anthropic's HH-RLHF papers document this trade.

SFT data leakage: if your SFT set overlaps benchmarks, you're contaminating evaluations. Decontaminate against MMLU, GPQA, AIME, etc.