Build a Frontier Model
🤖
Step 7 of 11

Tool Use & Agents

From chatbot to autonomous worker.

Agents are LLMs in a loop: observe, plan, call a tool, observe the result, repeat. The 2025-2026 stack centers on function calling, the Model Context Protocol (MCP), browser/computer-use models (Anthropic Computer Use, OpenAI Operator, Google Project Mariner), and verifiable evaluation harnesses (SWE-Bench, OSWorld, Tau-Bench). Single strong agents with good tools beat elaborate multi-agent orchestration for most tasks.

Why it matters

  • GPT-5.5 hits 78.7% on OSWorld-Verified and 82.7% on Terminal-Bench 2.0 — agents are crossing the usable threshold.
  • Claude Opus 4.x leads MCP-Atlas at 77.3%, showing tool-use is now a primary axis of frontier competition.
  • MCP (Anthropic, Nov 2024) became the de facto integration protocol — adopted by OpenAI, Google, and most IDEs.
  • Coding agents now resolve 70-87% of real GitHub issues on SWE-Bench Verified, up from <5% just two years ago.

State of the art

2025-2026
  • Claude Opus 4.7 (Apr 2026) — 87.6% SWE-Bench Verified, 70% CursorBench, leads tool use benchmarks.
  • OpenAI Operator (Jan 2025) and Computer Use models — agents that drive a browser/desktop directly.
  • Anthropic's 'Building effective agents' framework (Dec 2024) argued against premature multi-agent orchestration.
  • MCP servers as a standard ecosystem — file systems, databases, APIs, IDEs all expose tools via JSON-RPC.
  • Long-running agent loops with checkpointing/resumption — Devin-class systems run for hours autonomously.

The recipe

A frontier-grade implementation, in order.

1

Pick the loop

ReAct (Reason→Act→Observe) is the universal substrate. Reflexion adds self-correction. Plan-and-execute for long horizons.

2

Tool design

Few well-designed tools beat many narrow ones. Each tool: clear name, JSON schema, error messages that teach the model.

3

Adopt MCP

Wrap tools as MCP servers. Get IDE integrations, sharing across agents, and inspector/debugger for free.

4

Train, don't just prompt

Frontier agents are RL-trained on tool use, not just prompted. Generate trajectories → score completions → GRPO/DPO.

5

Test-time search

Best-of-N with a judge, or tree search over tool calls. Pays off on long-horizon tasks (SWE-Bench, OSWorld).

6

Evaluate ruthlessly

SWE-Bench Verified for code, OSWorld for desktop, Tau-Bench for customer-service workflows, MCP-Atlas for tool use.

⚠️

Common pitfalls

Multi-agent orchestration adds variance faster than capability. Start single-agent.
Long context decay: agents forget early-loop instructions after 100k tokens. Use scratchpads + summarization.
Tool injection attacks (prompt injection via tool output) are unsolved — sandboxing > trusting model output.
Cost explodes super-linearly with loop depth. Set hard step/budget caps in production.