🧠

Step 6 of 11

Reasoning & Test-Time Compute

When you can't make the model bigger, make it think longer.

The 2025 breakthrough: train models with RL on verifiable rewards, give them an 'unbounded' scratchpad, and emergent long chain-of-thought appears. OpenAI's o1/o3, DeepSeek-R1, Gemini 2.5 Pro thinking, Claude Opus 4.x with extended thinking, and Kimi k1.5 all train this way. Inference now spends 1k-100k+ thinking tokens before producing a final answer.

Why it matters

GPT-5 scores 94.6% on AIME 2025 (vs ~13% for GPT-4 a year before) almost entirely from test-time compute scaling.
Snell et al. (2024) formalized the trade-off: more inference compute can match much larger pretrained models for the same total cost.
rStar-Math showed a 7B + MCTS + Process Reward Model can match o1-preview on AIME — without frontier-scale pretraining.
Reasoning models become the substrate for agents, scientific assistants, and provable-correctness use cases.

State of the art

2025-2026

OpenAI o3 + GPT-5 (2025) — RL-trained long CoT with budget control. GPT-5 Pro scores 88.4% on GPQA without tools.
DeepSeek-R1 (Jan 2025) — open-weight reasoning model rivaling o1 at $5-6M training cost.
rStar-Math (Microsoft, Jan 2025) — 7B + MCTS + Process Reward Model matches o1-preview on AIME.
Process Reward Models (PRMs) score every step of a CoT — replace outcome-only rewards for stable RL.
Budget-aware decoding (S1's 'Wait' injection, OpenAI's reasoning_effort) lets users dial latency vs accuracy.

The recipe

A frontier-grade implementation, in order.

Cold-start reasoning

SFT on a few thousand high-quality reasoning traces (distilled from R1 or o-series, or human-written). LIMO/S1 showed 1k can be enough.

RLVR on verifiable tasks

Math problems with numeric answers, code with unit tests, formal proofs. GRPO with rule-based reward — no learned reward model.

Process Reward Model (optional)

Train a PRM on step-level human or LLM-judge labels (Math-Shepherd-style). Use for tree search at inference.

Inference search

Self-consistency (majority vote over N samples) → Best-of-N with PRM → Tree-of-Thoughts → MCTS. Each step adds compute and quality.

Budget control

Train on truncated/extended traces. Expose a 'thinking_effort' or token budget knob. Calibrate against deployment latency targets.

⚠️

Common pitfalls

Reasoning models 'overthink' simple queries — wrap with a router that picks short-CoT vs long-CoT mode.

PRMs reward-hack: the model learns to write convincing-looking but wrong steps. Validate with held-out outcome accuracy.

Long CoT leaks intermediate hypotheses — strip <think> tags before user-facing output if confidentiality matters.

Test-time compute scales sublinearly above ~10k thinking tokens. Budget accordingly; don't burn 100k tokens for 1% gain.

Papers to read

In rough order of foundational → recent. Click any title to open the arXiv abstract.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

2201.11903

Wei et al. · 2022

Self-Consistency Improves Chain of Thought Reasoning

2203.11171

Wang et al. · 2022

Tree of Thoughts: Deliberate Problem Solving with LLMs

2305.10601

Yao et al. · 2023

Let's Verify Step by Step (Process Reward Models)

2305.20050

Lightman et al. (OpenAI) · 2023

Math-Shepherd: Verify and Reinforce LLMs Step-by-step Without Human Annotations

2312.08935

Wang et al. · 2023

Scaling LLM Test-Time Compute Optimally Can Be More Effective than Scaling Model Parameters

2408.03314

Snell et al. · 2024

DeepSeek-R1: Incentivizing Reasoning via RL

2501.12948

DeepSeek-AI · 2025

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

2501.04519

Microsoft · 2025

Kimi k1.5: Scaling Reinforcement Learning with LLMs

2501.12599

Moonshot AI · 2025

S1: Simple Test-Time Scaling

2501.19393

Muennighoff et al. · 2025

STaR: Self-Taught Reasoner

2203.14465

Zelikman et al. · 2022

Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

2403.09629

Zelikman et al. · 2024

Previous · Step 5

← Post-Training & Alignment

Next · Step 7