Why agents are a different class of risk
A chatbot that gives a bad answer wastes your time; an agent that takes a bad action sends the email, deletes the record, or pushes the commit. The defining danger is the fusion of a probabilistic, manipulable reasoning layer with a deterministic, real-world execution layer. Because LLMs have no reliable boundary between instructions and data, any text the agent reads — a web page, a support ticket, a code comment, a calendar invite — can carry hidden commands.
This is indirect prompt injection, and NIST's 2025 update to its adversarial machine learning taxonomy (AI 100-2) was the first to explicitly name autonomous agents as a security threat surface. Because the model is the planner, you cannot fully patch this at the model layer; you must assume the reasoning can be subverted and design the surrounding system so that subversion is contained.
Excessive agency and excessive permissions
OWASP ranks Excessive Agency as LLM06 in its 2025 Top 10, and breaks it into three root causes worth designing against separately. Excessive functionality means the agent can reach tools beyond its task scope. Excessive permissions means those tools run with broader privileges than the work requires — write access where read would do, an admin token where a scoped one would do. Excessive autonomy means high-impact actions proceed with no human checkpoint.
The canonical example is an email assistant granted both read and send: an injected instruction in an incoming message convinces it to forward the entire inbox to an attacker. With read-only access, the attack is impossible. The mitigation is unglamorous and effective: apply least privilege to every tool, prefer narrow purpose-specific tools over open-ended ones, and never let an agent act through a privileged or shared service account — use caller-scoped, short-lived identities so the blast radius is bounded by who actually invoked it.
The lethal trifecta
Simon Willison's "lethal trifecta" is the clearest mental model for when an agent becomes acutely dangerous. The three ingredients are access to private data, exposure to untrusted content, and the ability to communicate externally. Any agent that holds all three at once can be talked into reading your secrets and shipping them to an attacker — and the agent cannot reliably tell it is being manipulated.
The practical discipline is to break the trifecta: an agent that processes untrusted input should not also hold sensitive data and an open egress path in the same context. If a workflow genuinely needs all three, split it across isolated steps with trust boundaries between them, strip or quarantine untrusted content before it reaches a privileged tool, and treat any data path that can leave your environment — an outbound request, a webhook, even a rendered image URL — as an exfiltration channel.
Sandboxing the execution layer
Because you cannot trust the agent's intent under injection, constrain its capability. Anthropic's guidance for computer use and Claude Code is explicit: run agents inside sandboxes, virtual machines, or containers with no access to sensitive data, and enforce two boundaries at the OS level. Filesystem isolation limits the agent to specific directories so a compromised agent cannot read credentials or modify system files. Network isolation restricts outbound connections to an allowlist of approved hosts, which directly defeats data exfiltration and malware download.
Shell access deserves the same treatment — scoped to a disposable workspace, ideally non-root, with destructive commands gated. This containment approach scales better than per-action approval, because the agent is simply unable to reach what it should not touch.
Human-in-the-loop for irreversible actions
Human approval remains essential, but it is a precision instrument, not a blanket policy. Anthropic's telemetry is sobering: users approved roughly 93% of permission prompts, and the more prompts they saw, the less attention they paid — the rubber-stamp problem. So reserve confirmation for actions with meaningful, irreversible consequences: financial transactions, external communications, deleting data, deploying code.
Make those prompts specific and legible — show exactly what will happen — so the human can actually exercise judgment. Default to dry-run or staged execution where possible, and design irreversible operations to be reversible (soft-deletes, drafts requiring a second action) so a single bad approval is recoverable.
Loop control, cost limits, and auditability
Agents fail in ways traditional software does not: they loop, retry, and burn resources without a human noticing. Enforce hard limits — maximum tool calls per task, wall-clock and token/cost budgets, and recursion or depth caps — so a stuck or hijacked agent halts instead of running unbounded.
Just as important is observability. Log every tool invocation with its inputs, outputs, and the triggering context, tied to the caller's scoped identity, so you can reconstruct what an agent did and why. This audit trail is your detection and forensics capability; without it you cannot tell a successful injection from normal operation. OWASP's Agentic Security Initiative treats traceability and resource governance as first-class controls, not afterthoughts.
You cannot stop an agent's reasoning from being manipulated, so secure the system around it: least privilege, broken trifectas, real sandboxes, human gates on irreversible actions, and an audit trail for everything.