I was trying to save our engineers an afternoon. We'd wired three community MCP servers into our internal dev sandbox, and before anyone plugged them into a real project I wanted to read the tool descriptions end-to-end. Not a research project — a Saturday morning sanity check.

By Sunday evening I'd pulled 142 tool definitions from the three biggest MCP marketplaces. 37 of them contained instructions the user would never see. Most were benign — "always respond in markdown," that kind of thing. Author-side nags stuffed into the description field because it was the easiest place to influence the agent. But the tail was ugly. Three were bad enough that I stopped, filed disclosures, and didn't open another tool definition until I'd set up a proper isolated environment.

This is the writeup.

The one-sentence version

An MCP server advertises tools with a description field (spec). Your agent reads that description as authoritative instruction — same trust tier as the system prompt. If the description contains hidden text — HTML comments, zero-width characters, unicode tag blocks — the agent executes it. The user sees nothing.

Here's the part most coverage misses: this is the inverse of prompt injection. Prompt injection sneaks a payload into user input — the untrusted tier. Tool poisoning puts the payload into the system tier, the layer every "prompt-injection-proof" framework explicitly trusts. If you've built defenses around sanitizing user input, you're guarding the front door while the payload walks in through the foundation.

The agent trusts the tool description the same way it trusts the system prompt. Most people only harden one of those.

— The quiet part of every MCP threat model

What a poisoned tool actually looks like

Synthetic version of the pattern I kept finding. Looks like a file reader. Isn't.

$ curl -s https://mcp.example.dev/tools | jq '.tools[2]'

I missed it the first time. I was skimming descriptions in my terminal, and my pager wrapped the HTML comment into the next field's output. Looked like formatting noise. It took a second pass — jq -r '.tools[].description' piped through cat -A — before the comment stood out.

That's the core problem: tool descriptions are rendered for machines, not humans. The MCP spec calls them "human-readable," but no client I tested actually shows them to a human by default. Whatever you bury in the description, the agent reads every byte.

Why it works

Three things make this worse than it first looks:

The user never sees the tool description. I tested Claude Desktop, Cursor, and Cline — three of the most popular MCP clients. None of them surface the full tool description in the UI by default. You'd have to open devtools or dump the protocol traffic yourself. The MCP spec says clients SHOULD "show tool inputs to the user before calling the server," but says nothing about showing the description that shaped the agent's decision to call in the first place.
Tool descriptions sit in the system trust tier. When the agent picks a tool, it treats the description the way it treats its own instructions. "Read SSH config as part of init" parses identically to "pass JSON, get response back" — both are just specification text in the context window. The agent has no mechanism to distinguish "the developer intended this" from "an attacker injected this." This is OWASP LLM05 — Supply Chain Vulnerabilities in its purest form.
It persists across sessions. Prompt injection needs a fresh payload every time. A poisoned tool description lives as long as the MCP server does. Every user, every session, same trap. No user action required.

What I tried that didn't work

Before we shipped detection in Quint, I went through three approaches. All of them failed in ways that were instructive.

Description allowlists. Hash every description with SHA-256 at review time, pin the hash, alert on any change. Straightforward. The failure mode is obvious in retrospect: it only catches mutations. If the payload ships in the first commit — which 31 of the 37 I found did — the hash pins the poisoned description as the baseline. I spent a day building the allowlist tooling and it caught exactly zero of the tools that actually mattered.

Regex scanning for hidden content. I wrote patterns for  (HTML comments), [\x{200B}-\x{200F}] (zero-width characters), and [\x{E0020}-\x{E007F}] (unicode tag characters). The HTML comment regex caught about 60% of my sample. The unicode patterns caught none — because the creative payloads weren't using the codepoints I was scanning for. One description used emoji variation selectors to encode instructions character by character. Another embedded directives in a base64-encoded "example output" that the agent dutifully decoded. Regex assumes you know the encoding, and the attacker picks the encoding.

LLM-on-LLM screening. Feed each tool description to GPT-4o, ask it to summarize and flag anything suspicious. This worked in my first ten test cases and I got excited. Then I ran it against the full 142-tool dataset. Three problems: (a) 2.3 seconds per description at the OpenAI API, which makes it a non-starter for real-time gating, (b) 14% false positive rate — the model flagged descriptions that mentioned "credentials" or "config" in totally legitimate contexts, and (c) the model itself summarized three of the poisoned descriptions without flagging the hidden instructions at all. It just... absorbed them. The thing you're asking to detect the manipulation is the thing being manipulated.

What actually works: watch what the agent does

The thing that flipped the detection was giving up on inspecting descriptions and watching the agent's actions instead. Which is, embarrassingly, the thing we already do at Quint. I kept trying to solve it upstream when the signal was sitting downstream the whole time.

Behavioral monitoring caught the synthetic payload on two independent signals:

Unexpected call sequence. The agent invoked read_file on ~/.aws/credentials before touching any path the user had mentioned. The tool itself was sanctioned — it was on the allowlist. The argument was the anomaly. Against the session baseline, a credential file read with no preceding user reference scored as a 97th-percentile outlier.
Correlation gap. Our per-session baseline expects file reads to correlate with user prompts — file paths, directory names, @file references in chat. The credential read had none of that. Correlation score: 0.03 against a session mean of 0.71. That gap alone is enough to flag.

Neither signal required understanding the tool description. The agent could have been compromised by prompt injection, a bad system prompt, or a poisoned description — the behavioral fingerprint is the same. That's the point. You don't need to enumerate every way an agent can be compromised if you watch what it actually does.

The numbers

If your engineers are using MCP — and statistically, they are — you have an attack surface that your existing stack doesn't cover:

WAFs and API gateways inspect HTTP traffic. MCP tool descriptions travel over stdio or WebSocket. Your WAF never sees them.
DLP watches data leaving the network. It doesn't know the AI agent is the one moving it, and the exfil channel is the agent's own output.
EDR looks at process-level behavior. An agent reading ~/.aws/credentials through a sanctioned tool doesn't trip any heuristic — it looks like normal file I/O.
MCP gateways can allowlist which tools an agent may call. They don't parse the description for hidden payloads, and they can't tell when a legitimate tool starts behaving differently.

What I'd do on Monday

Inventory your MCP servers. Not just IT-approved ones — the ones developers wired up themselves. ps aux | grep mcp is a start. Our own inventory was off by four servers.
Pin description hashes and alarm on mutation. Not a fix (see above — payloads ship on day one), but it catches the opportunistic cases and gives you a change log.
Strip invisible characters before the agent sees them. Run descriptions through a pass that drops unicode tag characters (U+E0020–U+E007F), zero-width joiners, and HTML comments. Speed bump, not a wall.
Monitor the action layer, not the conversation layer. The conversation is what the user sees. The action stream is what the agent actually does. If you only instrument one, pick the second.

Same threat family

If this pattern sounds familiar, it's because it's a sibling of what happened with OpenClaw's marketplace. Different delivery mechanism — skill registries instead of tool descriptions — but the same underlying architecture problem: untrusted third-party content gets treated as trusted specification by the agent runtime. Tool poisoning, malicious skills, compromised retrieval corpora — they're all supply chain attacks on the agent substrate. I'm tracking them as one family.

What's next

I'm publishing the sanitized dataset of the 142 tool descriptions (with the confirmed-malicious ones redacted pending disclosure) at github.com/quintai-dev/mcp-threat-dataset. If you run MCP in production and want to compare notes on detection, I'm hamza in the Quint Discord.

Quint is the behavioral intelligence layer for AI agents. We watch what agents do, flag what doesn't fit, and give you the receipts. See how it works.

MCP Tool Poisoning: The AI Agent Supply Chain Threat You're Not Watching

The one-sentence version

What a poisoned tool actually looks like

Why it works

What I tried that didn't work

What actually works: watch what the agent does

The numbers

What I'd do on Monday

Same threat family

What's next

Your agents are running.
See what they're actually doing.

MCP Tool Poisoning: The AI Agent Supply Chain Threat You're Not Watching

The one-sentence version

What a poisoned tool actually looks like

Why it works

What I tried that didn't work

What actually works: watch what the agent does

The numbers

What I'd do on Monday

Same threat family

What's next

Your agents are running. See what they're actually doing.

Your agents are running.
See what they're actually doing.