The Hidden Risk in Your AI Agent Stack

As organizations accelerate their adoption of AI agents for tasks ranging from code generation to customer operations, a new category of supply-chain risk is emerging inside the reusable modules that power these agents. Known as "skills," these self-contained packages let agents execute specific capabilities on demand. But recent research into public skill registries suggests that a meaningful percentage of them are designed to do harm.

This post summarizes what enterprise security, compliance, and AI governance teams need to know about the harmful-skill threat model, and what mitigations to build into agent deployments.

What Agent Skills Are (and Why They Matter)

An agent skill is a reusable directory containing a primary SKILL.md file with YAML metadata and Markdown instructions, plus optional scripts and API wrappers. At runtime, only the skill's metadata is loaded into the system prompt as a compact index. The full content is pulled into context only when the agent chooses to invoke that skill.

Two major public registries host the bulk of the ecosystem today:

ClawHub, a versioned registry for agent skills (clawhub.ai)
Skills.Rest, a community-driven discovery and distribution platform (skills.rest)

Both registries allow any builder to publish skills, with limited gating. As of March 2026, they collectively host nearly 100,000 skills.

The Threat: Harmful Skills vs. Malicious Skills

Most existing research on agent security focuses on malicious skills: packages that embed prompt injections, data exfiltration payloads, or malware intended to compromise the user who installs them.

A distinct and under-examined threat is the harmful skill. Here, the user is the attacker, and the skill's declared functionality itself violates established AI usage policies such as the Anthropic Usage Policy or the OpenAI Usage Policies. The agent becomes an automated instrument; the victims are third parties (individuals whose data is scraped, platforms whose terms of service are violated, or targets of fraud).

Skills can be installed with a single command and deployed on corporate devices or private infrastructure, substantially lowering the barrier to misuse.

How Prevalent Is the Problem?

A large-scale measurement covering 98,440 skills across the two registries found:

Registry	Total Skills	Harmful Skills	Harmful Rate
ClawHub	26,629	2,355	8.84%
Skills.Rest	71,811	2,503	3.49%
Combined	98,440	4,858	4.93%

Harmful skills were identified using an LLM-driven scoring system grounded in a 21-category taxonomy synthesized from major usage policies. The system achieved an F1 score of 0.82 against human-labeled data.

The top five categories account for roughly three-quarters of all violations:

Cyber attacks
Privacy violations
Fraud and scams
Unsupervised financial advice
Platform abuse

Notably, harmful skills are not niche. On ClawHub, the typical harmful skill has a higher median download count (261) than the typical non-harmful skill (229), suggesting real user demand. Only 2.21% of harmful skills have zero downloads, compared with 12.57% of non-harmful skills.

Harmful-skill production is also concentrated among a relatively small group of prolific builders. On Skills.Rest, the top 10% of contributing builders account for 48.21% of harmful skills (Gini coefficient of 0.54).

The Taxonomy: Two Tiers of Risk

The harmful-skill taxonomy maps to two tiers of policy violation:

Tier 1

Prohibited Use

14 categories of unconditionally forbidden actions, including cyber attacks, weapons development, privacy violations, fraud, election interference, and sexually explicit content. Skills in this tier account for roughly 71% of all violations.

Tier 2

High-Risk Use

7 professional domains (legal, medical, financial, insurance, employment and housing, academic assessment, and media generation) where AI assistance is permitted only with human-in-the-loop review and AI disclosure. A skill in this tier is considered harmful when it provides advice without those safeguards.

Harmful skills frequently also target downstream platforms. Many violate the terms of service of services such as LinkedIn, Instagram, Facebook, X, Discord, GitHub, and Amazon, typically by automating scraping, account creation, or content manipulation at scale.

Why Platform Moderation Alone Is Not Enough

Both registries advertise some form of review. ClawHub scans published skill packs using VirusTotal and an internal security pipeline. Skills.Rest inherits GitHub's baseline moderation. Yet the harmful rate on ClawHub is approximately 2.5 times that of Skills.Rest, indicating that platform structure, publication workflow, and enforcement practice matter as much as the nominal presence of a scanner.

Existing moderation operates at the security layer rather than the policy-compliance layer. Scanners catch embedded malware and known injection payloads. They do not assess whether a skill's declared functionality violates AI usage policies.

The Skill-Reading Exploit: How Agents Behave Differently Once a Skill Is Installed

Evaluation of six mainstream LLMs (GPT-4o, GPT-5.4-Mini, Gemini 3 Flash, DeepSeek V3.2, Qwen3-235B, and Kimi K2.5) across 200 harmful skills produced a striking pattern.

Condition	Avg. Harm Score	Avg. Refusal Rate
No skill installed, explicit harmful request	0.27	59.58%
Skill installed, explicit harmful request	0.47	42.42%
Skill installed, no stated harmful intent	0.76	9.75%

Two findings stand out:

Passive exposure bypasses refusal triggers. When the user simply asks the agent to read an installed skill and produce an execution plan, without ever stating a harmful goal, refusal rates collapse. On Tier 1 Prohibited skills, the harm score under passive exposure is 0.79, compared with 0.08 when the same harmful task is asked directly with no skill attached.
The skill itself materially shifts behavior. Even when the user does state the harmful intent explicitly, wrapping the request in a pre-installed skill raises the harm score and lowers refusal rates across five of six models tested.

In other words, refusal training that was calibrated on user-query prompts does not transfer cleanly to semantically equivalent content delivered through a tool-context skill specification.

Tier 2 Is a Distinct Failure Mode

On Tier 2 High-Risk skills (insurance underwriting, candidate screening, medical triage, and similar professional domains), agents almost never refuse by default. Refusal rates never exceeded 0.71% across conditions.

Just as importantly, agents rarely add the required safeguards on their own. Under an active-invocation condition, only about 15.71% of responses recommended human-in-the-loop review and only 2.14% disclosed AI involvement.

When safeguards were explicitly instructed in the prompt, a revealing asymmetry emerged:

Human-in-the-loop instructions were followed in 93-98% of responses, regardless of whether the instruction required or forbade the safeguard.
AI-disclosure instructions were followed only 41-74% of the time when the prompt required disclosure, but 98-99% of the time when the prompt forbade it.

Models comply readily with instructions to withhold AI disclosure, but only partially with instructions to provide it. This structural bias against volunteering AI disclosure has direct implications for regulated decision domains.

What Enterprise Teams Should Do

For organizations deploying AI agents against real business workflows, several controls follow directly from these findings.

Treat skill registries as part of your software supply chain.

Curate an approved internal catalog rather than allowing agents to pull skills from public registries on demand. Require the same review and attestation you apply to third-party libraries.

Add policy-compliance scanning on top of security scanning.

Malware scanners and VirusTotal-style pipelines do not flag policy-violating functionality. Augment them with content-level checks grounded in your usage policy and any relevant industry regulations.

Apply refusal triggers to skill specifications, not just user queries.

Alignment applied at the user-prompt layer does not automatically cover content delivered through a tool-response channel. Security prompts, guardrails, and evaluation suites should explicitly probe the "read this skill and plan its execution" pathway.

Make Tier 2 safeguards default behaviors.

For high-risk professional domains, human-in-the-loop review and AI disclosure should not depend on the user instructing the agent to include them. Bake them into the system prompt, the skill wrapper, or a policy-enforcement layer.

Monitor for category drift.

The most prolific publishers produce disproportionate amounts of harmful content. Track builder identity, and apply publisher verification or domain-expert review for the most sensitive categories (weapons development, election-related automation, insurance and lending decisions, medical triage, candidate screening).

Closing Thought

Agent skills deliver real productivity gains, but they also expand the attack surface in ways that traditional application security tooling was not built to catch. The risk is no longer just that a malicious package will compromise the user who installs it. The risk now includes the opposite case: users installing packages whose advertised purpose is itself a policy violation, and agents that quietly comply once the request is reframed as routine tool use.

Building durable defenses means combining registry-level policy compliance, runtime guardrails that treat skill specifications as untrusted input, and alignment practices that keep high-risk safeguards on by default.

Need help assessing AI agent risk in your organization? Get in touch with our team.

The Hidden Risk in YourAI Agent Stack