What an AI Agent Actually Is
Strip away the marketing and an agent is a surprisingly small idea: an LLM running in a loop with access to tools, working toward a goal until it decides it is done.
context = [system_prompt, user_goal]
while not done:
response = llm(context, tools)
if response.has_tool_calls:
results = execute(response.tool_calls) # search, run code, query DB...
context += results # feed observations back
else:
done = True # model chose to answer
return response
Every framework — however elaborate — is this loop plus opinions about state, safety, and orchestration. The useful spectrum to keep in mind runs from less to more autonomy:
- Workflow (not an agent): a fixed pipeline where an LLM fills predefined steps — classify, extract, draft. The developer decides the control flow. Predictable, cheap, testable.
- Router: the LLM picks which branch of a predefined flow to take.
- Tool-using agent: the LLM decides which tools to call, in what order, and when to stop — the loop above. Control flow is emergent.
- Multi-agent system: several specialized agents (researcher, coder, reviewer) coordinating, usually through a supervisor or a shared task queue.
The most common architecture mistake of 2025–2026 is starting at the bottom of that list. The teams that ship tend to follow one rule: use the least autonomous design that solves the problem, and add autonomy only where the task genuinely cannot be enumerated in advance.
Tools and MCP: How Agents Touch the World
An LLM by itself can only emit text. Tools are what turn text into action: each tool is a typed function the model can request — search_web(query), run_sql(query), send_email(to, body) — with the runtime executing the call and returning results into context.
The integration problem this created (every app × every data source = custom glue code) is what the Model Context Protocol (MCP) was built to solve. Introduced by Anthropic in late 2024 and since adopted across the major providers and IDEs, MCP standardizes how tool servers expose capabilities to any compliant client — write one MCP server for your database or ticketing system, and every MCP-capable agent can use it. It plays the role USB played for peripherals, and its ecosystem (thousands of public servers) is a major reason agent adoption accelerated through 2025–2026.
Hard-won lessons about tool design that show up in every serious postmortem:
- Fewer, better tools beat many overlapping ones. Ambiguous tool sets measurably degrade tool-selection accuracy.
- Tool descriptions are prompts. The model chooses tools based on their descriptions; vague descriptions produce wrong calls.
- Return errors the model can act on. "Permission denied: token expired, call refresh_auth first" lets the loop self-correct; a bare stack trace does not.
- Treat tool results as untrusted input. A web page or document fetched by a tool can contain adversarial instructions (prompt injection) — the loop happily obeys text it reads unless you sandbox, scope permissions, and require confirmation for consequential actions.
The 2026 Framework Landscape
The framework space consolidated substantially through 2025. The current major families and what they are actually good for:
| Framework | Model of the world | Strongest fit |
|---|---|---|
| LangGraph | Explicit state machine / graph | Complex workflows needing checkpoints, retries, human-in-the-loop gates |
| OpenAI Agents SDK | Lightweight agents + handoffs | Teams committed to the OpenAI stack wanting minimal abstraction |
| Claude Agent SDK | The agent loop with tools, MCP-native | Coding and computer-use agents, deep MCP integration |
| CrewAI | Role-based multi-agent crews | Fast prototyping of collaborative role-play patterns |
| Microsoft (AutoGen / Semantic Kernel → Agent Framework) | Conversation-driven multi-agent | .NET/Azure estates, enterprise governance requirements |
| No framework (direct API) | You own the loop | Production systems wanting full control and no abstraction tax |
Two honest observations. First, the framework matters less than the harness around it — evals, observability, and permissioning decide success far more often than the choice between LangGraph and CrewAI. Second, a growing minority of production teams use no framework at all: the loop is ~50 lines, and owning it outright removes a dependency whose abstractions can fight you at debugging time. Frameworks earn their keep primarily at the workflow-orchestration layer (durable state, retries, parallel branches), not at the loop itself.
The Production-Readiness Gap: Why 40% of Projects Will Die
The adoption statistics tell a story of enthusiasm outrunning engineering. Roughly 79% of enterprises report adopting agents in some form, but only about 11% run them in production; Gartner projects over 40% of agentic AI projects will be cancelled by end of 2027, citing cost, unclear value, and risk controls. Median time-to-value on deployments that do work is around 5 months, with reported average ROI near 170% — but roughly one in five deployments never reaches payback.
The recurring failure modes, from public postmortems and industry surveys:
- Agent-washing. Rebranding a chatbot or an RPA script as an "agent" to ride the budget wave. Gartner estimated only a small fraction of vendors marketing "agentic AI" offer genuine agentic capability. These projects fail at the definition stage.
- Demo-to-production chasm. A demo needs to work once; production needs 99%+ across thousands of messy real inputs. Compounding error is brutal arithmetic: a 10-step agent whose steps each succeed 95% of the time completes correctly only ~60% of runs. Shipping teams shorten chains, add checkpoints, and design for recovery rather than perfection.
- No evals. Teams iterate on vibes — change the prompt, try three inputs, ship. Without a graded test set of realistic tasks, every prompt change is a blind bet. Evals are to agents what unit tests are to code, and their absence is the most reliable predictor of cancellation.
- Unbounded costs. Loops that resend growing context every iteration produce bills that scale superlinearly with task length. Production agents need budgets: max iterations, max tokens, context compaction, and per-task cost tracking.
- Security as an afterthought. An agent with broad tool permissions plus exposure to untrusted content (email, web, documents) is a prompt-injection incident waiting to happen. Least-privilege tools, sandboxed execution, and human confirmation for irreversible actions are table stakes.
A Playbook for Agents That Actually Ship
Condensing what the successful ~11% do differently:
- 1. Pick a narrow, measurable job. "Resolve tier-1 password-reset tickets end-to-end" ships; "an AI employee for support" does not. The successful deployments of 2025–2026 are overwhelmingly task-specific.
- 2. Start as a workflow, graduate to an agent. Enumerate the steps first. Only the parts that genuinely resist enumeration deserve autonomy. Many "agent" projects discover the whole task was a workflow — that is a success, not a failure: workflows are cheaper, faster, and testable.
- 3. Build the eval set before the agent. 30–100 real task instances with graded expected outcomes. Run them on every change. Track task completion, not token-level metrics.
- 4. Instrument everything. Log every tool call, decision, and token count. When (not if) the agent does something strange, you need the trace. Observability tooling is the least glamorous and highest-ROI part of the stack.
- 5. Put a human gate on irreversible actions. Sending money, deleting data, emailing customers — the agent proposes, a human (or a strict policy engine) approves. Autonomy is earned per-action-type as reliability data accumulates.
- 6. Budget the loop. Max iterations, max cost per task, compaction thresholds, and a defined failure behavior (escalate to human with a summary — never silently give up).
- 7. Measure against the baseline. The agent competes with a human process that has a real cost per task. If the agent plus its review overhead does not beat that number at production reliability, it is a research project — budget it honestly as one.
Agents in 2026 are where web apps were around 2000: the technology is real, the winners are being built right now, and most of the losses come not from the technology but from skipping the boring engineering around it.