Build a Frontier Model
🧠
Step 6 of 11

Reasoning & Test-Time Compute

When you can't make the model bigger, make it think longer.

The 2025 breakthrough: train models with RL on verifiable rewards, give them an 'unbounded' scratchpad, and emergent long chain-of-thought appears. OpenAI's o1/o3, DeepSeek-R1, Gemini 2.5 Pro thinking, Claude Opus 4.x with extended thinking, and Kimi k1.5 all train this way. Inference now spends 1k-100k+ thinking tokens before producing a final answer.

Why it matters

  • GPT-5 scores 94.6% on AIME 2025 (vs ~13% for GPT-4 a year before) almost entirely from test-time compute scaling.
  • Snell et al. (2024) formalized the trade-off: more inference compute can match much larger pretrained models for the same total cost.
  • rStar-Math showed a 7B + MCTS + Process Reward Model can match o1-preview on AIME — without frontier-scale pretraining.
  • Reasoning models become the substrate for agents, scientific assistants, and provable-correctness use cases.

State of the art

2025-2026
  • OpenAI o3 + GPT-5 (2025) — RL-trained long CoT with budget control. GPT-5 Pro scores 88.4% on GPQA without tools.
  • DeepSeek-R1 (Jan 2025) — open-weight reasoning model rivaling o1 at $5-6M training cost.
  • rStar-Math (Microsoft, Jan 2025) — 7B + MCTS + Process Reward Model matches o1-preview on AIME.
  • Process Reward Models (PRMs) score every step of a CoT — replace outcome-only rewards for stable RL.
  • Budget-aware decoding (S1's 'Wait' injection, OpenAI's reasoning_effort) lets users dial latency vs accuracy.

The recipe

A frontier-grade implementation, in order.

1

Cold-start reasoning

SFT on a few thousand high-quality reasoning traces (distilled from R1 or o-series, or human-written). LIMO/S1 showed 1k can be enough.

2

RLVR on verifiable tasks

Math problems with numeric answers, code with unit tests, formal proofs. GRPO with rule-based reward — no learned reward model.

3

Process Reward Model (optional)

Train a PRM on step-level human or LLM-judge labels (Math-Shepherd-style). Use for tree search at inference.

4

Inference search

Self-consistency (majority vote over N samples) → Best-of-N with PRM → Tree-of-Thoughts → MCTS. Each step adds compute and quality.

5

Budget control

Train on truncated/extended traces. Expose a 'thinking_effort' or token budget knob. Calibrate against deployment latency targets.

⚠️

Common pitfalls

Reasoning models 'overthink' simple queries — wrap with a router that picks short-CoT vs long-CoT mode.
PRMs reward-hack: the model learns to write convincing-looking but wrong steps. Validate with held-out outcome accuracy.
Long CoT leaks intermediate hypotheses — strip <think> tags before user-facing output if confidentiality matters.
Test-time compute scales sublinearly above ~10k thinking tokens. Budget accordingly; don't burn 100k tokens for 1% gain.