From chatbot to autonomous worker.
Agents are LLMs in a loop: observe, plan, call a tool, observe the result, repeat. The 2025-2026 stack centers on function calling, the Model Context Protocol (MCP), browser/computer-use models (Anthropic Computer Use, OpenAI Operator, Google Project Mariner), and verifiable evaluation harnesses (SWE-Bench, OSWorld, Tau-Bench). Single strong agents with good tools beat elaborate multi-agent orchestration for most tasks.
A frontier-grade implementation, in order.
ReAct (Reason→Act→Observe) is the universal substrate. Reflexion adds self-correction. Plan-and-execute for long horizons.
Few well-designed tools beat many narrow ones. Each tool: clear name, JSON schema, error messages that teach the model.
Wrap tools as MCP servers. Get IDE integrations, sharing across agents, and inspector/debugger for free.
Frontier agents are RL-trained on tool use, not just prompted. Generate trajectories → score completions → GRPO/DPO.
Best-of-N with a judge, or tree search over tool calls. Pays off on long-horizon tasks (SWE-Bench, OSWorld).
SWE-Bench Verified for code, OSWorld for desktop, Tau-Bench for customer-service workflows, MCP-Atlas for tool use.
In rough order of foundational → recent. Click any title to open the arXiv abstract.
Yao et al. · 2022
Shinn et al. · 2023
Schick et al. (Meta) · 2023
Patil et al. · 2023
Qin et al. · 2023
Zhou et al. · 2023
Xie et al. · 2024
Jimenez et al. · 2024
Wu et al. · 2023