When you can't make the model bigger, make it think longer.
The 2025 breakthrough: train models with RL on verifiable rewards, give them an 'unbounded' scratchpad, and emergent long chain-of-thought appears. OpenAI's o1/o3, DeepSeek-R1, Gemini 2.5 Pro thinking, Claude Opus 4.x with extended thinking, and Kimi k1.5 all train this way. Inference now spends 1k-100k+ thinking tokens before producing a final answer.
A frontier-grade implementation, in order.
SFT on a few thousand high-quality reasoning traces (distilled from R1 or o-series, or human-written). LIMO/S1 showed 1k can be enough.
Math problems with numeric answers, code with unit tests, formal proofs. GRPO with rule-based reward — no learned reward model.
Train a PRM on step-level human or LLM-judge labels (Math-Shepherd-style). Use for tree search at inference.
Self-consistency (majority vote over N samples) → Best-of-N with PRM → Tree-of-Thoughts → MCTS. Each step adds compute and quality.
Train on truncated/extended traces. Expose a 'thinking_effort' or token budget knob. Calibrate against deployment latency targets.
In rough order of foundational → recent. Click any title to open the arXiv abstract.
Wei et al. · 2022
Wang et al. · 2022
Yao et al. · 2023
Lightman et al. (OpenAI) · 2023
Wang et al. · 2023
Snell et al. · 2024
DeepSeek-AI · 2025
Microsoft · 2025
Moonshot AI · 2025
Muennighoff et al. · 2025
Zelikman et al. · 2022
Zelikman et al. · 2024