🤖

Step 7 of 11

Tool Use & Agents

From chatbot to autonomous worker.

Agents are LLMs in a loop: observe, plan, call a tool, observe the result, repeat. The 2025-2026 stack centers on function calling, the Model Context Protocol (MCP), browser/computer-use models (Anthropic Computer Use, OpenAI Operator, Google Project Mariner), and verifiable evaluation harnesses (SWE-Bench, OSWorld, Tau-Bench). Single strong agents with good tools beat elaborate multi-agent orchestration for most tasks.

Why it matters

GPT-5.5 hits 78.7% on OSWorld-Verified and 82.7% on Terminal-Bench 2.0 — agents are crossing the usable threshold.
Claude Opus 4.x leads MCP-Atlas at 77.3%, showing tool-use is now a primary axis of frontier competition.
MCP (Anthropic, Nov 2024) became the de facto integration protocol — adopted by OpenAI, Google, and most IDEs.
Coding agents now resolve 70-87% of real GitHub issues on SWE-Bench Verified, up from <5% just two years ago.

State of the art

2025-2026

Claude Opus 4.7 (Apr 2026) — 87.6% SWE-Bench Verified, 70% CursorBench, leads tool use benchmarks.
OpenAI Operator (Jan 2025) and Computer Use models — agents that drive a browser/desktop directly.
Anthropic's 'Building effective agents' framework (Dec 2024) argued against premature multi-agent orchestration.
MCP servers as a standard ecosystem — file systems, databases, APIs, IDEs all expose tools via JSON-RPC.
Long-running agent loops with checkpointing/resumption — Devin-class systems run for hours autonomously.

The recipe

A frontier-grade implementation, in order.

Pick the loop

ReAct (Reason→Act→Observe) is the universal substrate. Reflexion adds self-correction. Plan-and-execute for long horizons.

Tool design

Few well-designed tools beat many narrow ones. Each tool: clear name, JSON schema, error messages that teach the model.

Adopt MCP

Wrap tools as MCP servers. Get IDE integrations, sharing across agents, and inspector/debugger for free.

Train, don't just prompt

Frontier agents are RL-trained on tool use, not just prompted. Generate trajectories → score completions → GRPO/DPO.

Test-time search

Best-of-N with a judge, or tree search over tool calls. Pays off on long-horizon tasks (SWE-Bench, OSWorld).

Evaluate ruthlessly

SWE-Bench Verified for code, OSWorld for desktop, Tau-Bench for customer-service workflows, MCP-Atlas for tool use.

⚠️

Common pitfalls

Multi-agent orchestration adds variance faster than capability. Start single-agent.

Long context decay: agents forget early-loop instructions after 100k tokens. Use scratchpads + summarization.

Tool injection attacks (prompt injection via tool output) are unsolved — sandboxing > trusting model output.

Cost explodes super-linearly with loop depth. Set hard step/budget caps in production.

Papers to read

In rough order of foundational → recent. Click any title to open the arXiv abstract.

ReAct: Synergizing Reasoning and Acting in Language Models

2210.03629

Yao et al. · 2022

Reflexion: Language Agents with Verbal Reinforcement Learning

2303.11366

Shinn et al. · 2023

Toolformer: Language Models Can Teach Themselves to Use Tools

2302.04761

Schick et al. (Meta) · 2023

Gorilla: Large Language Model Connected with Massive APIs

2305.15334

Patil et al. · 2023

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

2307.16789

Qin et al. · 2023

WebArena: A Realistic Web Environment for Building Autonomous Agents

2307.13854

Zhou et al. · 2023

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks

2404.07972

Xie et al. · 2024

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

2310.06770

Jimenez et al. · 2024

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

2308.08155

Wu et al. · 2023