MCP Tool Annotations Are Not Security

Last week the MCP project published a blog post on tool annotations, recapping the current state: four boolean hints (readOnlyHint, destructiveHint, idempotentHint, openWorldHint) that servers attach to their tools. The post is honest about what annotations can and can’t do. I want to build on that honesty and say something directly: tool annotations are not security. They were never designed to be. And stacking more annotations on top won’t change that.

The spec says it plainly

The MCP specification is explicit: annotations are not guaranteed to faithfully describe tool behavior. Clients must treat them as untrusted unless they come from a trusted server. During the original proposal review, MCP co-creator Justin Spahr-Summers asked the question that still hangs over every annotation proposal:

I think the information itself, if it could be trusted, would be very useful, but I wonder how a client makes use of this flag knowing that it’s not trustable.

Basil Hosmer pushed even harder:

Clients should ignore annotations from untrusted servers.

These are the people who built the protocol. They know annotations can’t be trusted, and they shipped them anyway as a UX convenience. That was a reasonable choice. The problem starts when people treat these hints as a security boundary.

Five SEPs, same problem

Right now there are five open Specification Enhancement Proposals trying to add more annotations: trust levels, sensitivity markers, secret handling hints, unsafe output flags, governance metadata. Some come from GitHub and OpenAI, based on real gaps they hit running MCP in production.

I get the impulse. You’re building an agent platform, you want to know what a tool does before you run it, and annotations feel like the obvious place to put that information. But adding more metadata to an untrusted channel doesn’t make the channel trusted. A tool that says secretHint: true is making a promise. Who verifies that promise? The tool itself. That’s circular.

More annotations give you a richer vocabulary for describing risk. That’s useful for building better confirmation dialogs and audit logs. But it’s not enforcement. It’s labeling.

The lethal trifecta needs a runtime answer

Simon Willison named the lethal trifecta: an agent with access to private data, exposure to untrusted content, and the ability to exfiltrate. When all three are present, you’re one prompt injection away from data theft. Researchers have demonstrated this with a malicious Google Calendar event, an MCP calendar server, and a local code execution tool.

The MCP blog post acknowledges this. But then it frames annotations as part of the solution, helping clients reason about which tools contribute to which legs of the trifecta. And sure, if openWorldHint is set correctly, a client could theoretically refuse to combine that tool with one that reads private data.

Here’s the problem: openWorldHint is self-reported. A malicious server won’t set it. A lazy server author won’t think about it. A compromised server might have set it correctly last week but not today. You can’t build a security model on “the tool describes itself honestly.” That’s asking a burglar to wear a name tag.

The lethal trifecta has to be broken at the runtime level: the container, the sandbox, the execution environment. If a tool can’t reach the network, it doesn’t matter what openWorldHint says. If a tool can only read files in /tmp/workspace, it can’t exfiltrate your emails regardless of how it describes itself. The enforcement has to be about what the tool can do, not what it says it does.

Annotations are UX, not policy

I’m not saying annotations are useless. They’re good for UX. If a tool says it’s destructive, showing a confirmation dialog before running it is smart. If a tool says it’s read-only, maybe you skip the confirmation. That’s a better user experience.

But UX is not policy. A confirmation dialog doesn’t prevent exfiltration. It just makes the user click “OK” first. In an agentic workflow where the LLM makes tool calls autonomously, there might not even be a user in the loop to click anything.

The distinction matters because I see people conflating the two. “We have tool annotations, so we have a security story.” No. You have a hint system. Your security story is whatever runs underneath: process isolation, network policies, filesystem restrictions, capability-based permissions. Without those, annotations are a sign on the door that says “please don’t steal anything.”

What enforcement actually looks like

Real enforcement means the runtime environment constrains what a tool can do, independent of what the tool claims about itself:

Network isolation. A tool that processes documents doesn’t need outbound HTTP. Don’t give it outbound HTTP.
Filesystem scoping. Mount only what the tool needs. If it’s a code formatter, it gets the source directory. Not your home folder.
Capability dropping. No raw socket access. No process spawning unless explicitly needed.
Ephemeral environments. Spin up a container for the tool invocation. Tear it down after. State doesn’t persist unless you explicitly allow it.

None of this requires the tool’s cooperation. None of it depends on annotations being accurate. The tool doesn’t get to vote on its own permissions.

This is the direction the industry is converging on. The protocol layer is the wrong place to solve trust. The runtime layer is where you actually have leverage.

The meta-question

Every time a new annotation SEP shows up, I ask the same question: who enforces this?

If the answer is “the client reads the annotation and decides what to do,” that’s UX. If the answer is “the runtime environment constrains the tool regardless of annotations,” that’s security. The MCP community is pouring energy into the first category while the second category still has no standardized answer.

I’d love to see that energy redirected. Give me a standard way to declare what capabilities a tool needs. Give me a runtime spec that hosts can implement. Give me a container profile format that tool authors ship alongside their servers. That would be worth five SEPs.

Annotations tell you what a tool says about itself. Sandboxes tell you what a tool can actually do. I know which one I’m betting on.

I write about building with AI agents, the stuff that actually works and the stuff that breaks. Follow me on LinkedIn for more.

Share on

X Facebook LinkedIn Bluesky

MCP Tool Annotations Are Not Security

Alexander Sklar

The spec says it plainly

Five SEPs, same problem

The lethal trifecta needs a runtime answer

Annotations are UX, not policy

What enforcement actually looks like

The meta-question

Share on

You May Also Enjoy

I Built a World Cup Map as a Copilot Canvas Extension

MXC: The Missing Piece for Agent Containment on Windows

18 Minutes, One Extension, Full Access

I Built a Compiler with Agent Fleets. Here’s What Broke.