Where a base model becomes useful — or harmful.
Post-training in 2026 is a multi-stage pipeline: SFT on high-quality instructions → preference optimization (DPO and variants) → RLVR (Reinforcement Learning from Verifiable Rewards) on math/code/agentic tasks → final safety polish. Constitutional AI and deliberative alignment shape behavior; distillation from R1/o-class reasoners injects long-CoT capabilities into smaller models.
A frontier-grade implementation, in order.
10k-1M instruction-response pairs covering chat, code, math, tool use, refusal. Quality > quantity (LIMO showed 1k can suffice for reasoning).
DPO is the safe default. SimPO if you want simpler hyperparameters. KTO if you only have binary feedback. Iterate 2-4 times.
Math: AIME-style problems with numeric answer checking. Code: unit tests. Agents: task completion via tool execution. GRPO or PPO.
Self-critique against a written constitution (Anthropic-style). Or train a Constitutional Classifier on inputs/outputs.
For high-stakes deployments, train the model to reason over the safety spec inside its CoT before responding.
Recruit professional red-teamers. Convert successful jailbreaks into preference data. Loop until stable.
In rough order of foundational → recent. Click any title to open the arXiv abstract.
Ouyang et al. (OpenAI) · 2022
Rafailov et al. · 2023
Meng et al. · 2024
Ethayarajh et al. · 2024
Hong, Lee, Thorne · 2024
Bai et al. (Anthropic) · 2022
DeepSeek-AI · 2024
DeepSeek-AI · 2025
Allen AI · 2024
Muennighoff et al. · 2025
Ye et al. · 2025
OpenAI · 2024
Anthropic · 2025