Claude Cowork Safety: Documented Attacks and 10 Solutions

Published on
May 6, 2026
Updated on
May 6, 2026
Claude Cowork Safety: Documented Attacks and 10 Solutions
Researchers exposed an invisible-text attack days after Cowork launched, silently stealing files. Here's what documented incidents teach.

Two days after Anthropic released Claude Cowork to the public, security researchers at PromptArmor published a working proof of concept. A Word document with text set to 1-point font, white on white, with line spacing at 0.1 contained instructions that were invisible to any human reader. When a user uploaded the file and asked Cowork to process their financial documents, Cowork located those files and silently uploaded them, including partial Social Security numbers, to an attacker's Anthropic account.

The attack completed without triggering any approval prompt, and the victim had no way to know it had occurred.

The underlying vulnerability had been reported to Anthropic three months before Cowork launched, acknowledged by the team, and still not patched when the product shipped.

In the months after launch, researchers at PromptArmor, Zenity Labs, and LayerX Security published detailed attack analyses. Four of them are worth understanding in detail.

What Cowork Can Access

Cowork runs with more access than most desktop software you've installed. What it can do within its VM sandbox, and under what conditions, is the foundation for understanding any security risk associated with using it.

Cowork runs in a sandboxed virtual machine on your computer. Within that sandbox, it has significant reach:

  • Read, write, and permanently delete files in the folders you approve
  • Take screenshots to navigate applications on your screen
  • Browse the web through the Claude in Chrome extension
  • Run code and terminal commands
  • Connect to external tools and services through MCP (Model Context Protocol) servers
  • Execute scheduled tasks while you are not present

When you send a message from your phone, Cowork executes it using your desktop's full granted permissions. The more access you've enabled, the more damage any successful attack can do.

For a detailed breakdown of how Claude Cowork vs Claude Code differ in terms of environment, permissions, and appropriate use cases, our post on Claude Cowork vs Claude Code covers those comparisons in full. The short version: Cowork is a desktop app for knowledge workers running in a VM sandbox; Claude Code is a terminal tool for developers running with full filesystem permissions. What they share is the attack vector. Both process untrusted content, and in both tools, content that contains hidden instructions can redirect Claude's behavior.

Four Incidents That Show Where the Exposure Sits

That list of capabilities is what Cowork is designed to do. The incidents below show what happens when those same capabilities are used against you.

File Exfiltration via Hidden Prompt Injection (January 2026)

The PromptArmor attack shows exactly how prompt injection scales when an AI tool can take real actions. All it required was getting a user to process one external file. The malicious instruction was hidden in the document's formatting properties, invisible in any standard document viewer.

Blocking the exfiltration at the network level was harder than it sounds. Cowork's VM restricts most outbound connections, but api.anthropic.com is on the whitelist, because Cowork has to communicate with Anthropic to function at all. The attacker's API key turned that trusted channel into an exfiltration path.

Security researcher Simon Willison, who coined the term "prompt injection" in 2022, pointed out a fundamental tension in Anthropic's response: "I do not think it is fair to tell regular non-programmer users to watch out for 'suspicious actions that may indicate prompt injection.'" Cowork is marketed to office workers, not security analysts.

The Chrome Extension "Lethal Trifecta" (December 2025)

When the Claude in Chrome extension is connected to Cowork, it adds a browser with access to every site the user is logged into. Researchers at Zenity Labs identified what they described as a "lethal trifecta": the extension can access personal account data, act on that data, and be influenced by instructions embedded in web pages it visits.

In one demonstration, a fake employer email containing hidden instructions successfully directed Claude to delete emails from the user's inbox. The email appeared completely normal, and the deletion happened without triggering any approval step. Once the extension is active, there is no session toggle, meaning any web page Claude visits during a task is a potential injection source.

The Desktop Extension RCE (February 2026)

LayerX Security discovered that Claude Desktop Extensions (DXTs) run without sandboxing and with full system-level privileges, unlike browser extensions which operate within restricted permission boundaries. A malicious Google Calendar event could trigger arbitrary code execution on a user's machine when Claude was asked to handle calendar tasks.

The vulnerability received a CVSS severity score of 10/10, the maximum possible. Anthropic's documented response was that the flaw "falls outside our current threat model."

The MCP Ecosystem Supply Chain Problem

MCP servers extend Cowork's reach: Google Drive, Slack, calendars, databases, and hundreds of third-party tools. Each connection also adds attack surface. A February 2026 audit by Snyk of the AI agent skills ecosystem found that 36.82% of available skills contain at least one security flaw. 13.4% have critical-level issues, including malware distribution, embedded prompt injection payloads, and exposed secrets. Of the malicious skills identified, 91% combined prompt injection techniques with traditional malware, a combination designed to bypass both AI safety mechanisms and conventional security tooling.

This mirrors the early state of npm and PyPI package ecosystems. The difference is that AI skills have direct access to credentials, file systems, and APIs.

What Anthropic Has Built In, and Where the Gaps Are

Anthropic isn't ignoring the security problem. The company has invested more in this than most AI vendors have at a comparable stage, most meaningfully in model training, where Claude Opus 4.5 brought successful prompt injection attacks down to roughly 1% in adversarial testing, down from 30 to 40% for earlier models. The table below maps what's been addressed against what's still open.

Defense in place Known gap
Model training (Opus 4.5 ~1% injection success rate) No runtime behavioral monitoring for anomalous tool use mid-task
Content classifiers on untrusted input No transitive trust analysis when multiple MCP servers are chained
Deletion protection (explicit approval required) No sanitization layer when data crosses tool boundaries (Chrome to Cowork to MCP)
Computer use safeguards (per-application permission grants) Desktop Extension RCE (CVSS 10) classified as outside the threat model
VM sandbox limiting most outbound connections File exfiltration vulnerability shipped months after internal disclosure

An ex-NSA security expert who analyzed Anthropic's approach in detail summarized it as: "Anthropic is ahead of the pack. But 'ahead of the pack' in agentic AI security is a low bar."

10 Rules Drawn from Documented Incidents

Model-level defenses get you part of the way. The rest depends on how you configure the tool and what habits you build around it. For a tool with this level of access, those choices do a substantial portion of the security work.

1. Treat every external file as a potential attack vector

The PromptArmor attack used a Word document that looked completely normal. Other documented attacks have used PDFs, Markdown files, and integration guides. Any file from an untrusted source, whether a forum download, an email attachment, or a shared file from a contact, is a potential injection carrier. Create a dedicated working folder for Cowork and keep external files out of it until you have confirmed what they contain.

2. Understand what the VM sandbox does and does not protect

Cowork's virtual machine limits most outbound network traffic, but the Anthropic API is whitelisted. That is not a design flaw, it is a necessary function. It means that prompt injection attacks using api.anthropic.com as an exfiltration channel are structurally harder to catch at the network layer, which is exactly how the PromptArmor attack succeeded.

3. Treat Claude Cowork vs Claude Code MCP risks differently

Claude Code and Cowork share the same MCP attack surface, but the Claude Cowork vs Claude Code difference in user profile matters here. Developers using Claude Code are more likely to read code before running it. Cowork is marketed to non-technical users who may install MCP connectors the same way they install browser extensions: quickly, without reviewing what the code does. Before connecting any MCP server, apply the same judgment you would to a software dependency:

  • Check the source repository and the publisher
  • Read exactly which permissions the server requests
  • Verify the project has recent maintenance activity
  • Start with read-only access and expand only if needed

4. Be specific about which folders Cowork can access

Broad folder access amplifies the consequence of any successful injection. Financial documents, credentials, legal files, and anything containing personal data should stay in separate directories that Cowork cannot reach. The PromptArmor victim had connected Cowork to a folder containing confidential real estate files with partial SSNs. That access level determined the damage.

5. Reserve "Act without asking" mode for work you have prepared carefully

This mode removes the per-step approval that would give you a chance to catch unexpected behavior mid-task. The Zenity Labs email deletion occurred in a context where no intermediate approval step interrupted the injection. Use "Act without asking" only when the source files are trusted, the task scope is fully defined, and you are actively present to intervene.

6. Think carefully before connecting the Chrome extension

The extension keeps every logged-in site within Claude's reach. Any web page Claude visits during a task can contain instructions that redirect its behavior. If you do use it, keep the set of permitted sites small and keep sensitive browser sessions (banking, healthcare, email) out of the same window while Cowork is active.

7. Do not install Desktop Extensions without reviewing the permission model

Desktop Extensions run without sandboxing and with full system privileges. The LayerX RCE demonstrated that a malicious calendar event could execute arbitrary code at the OS level. Unlike browser extensions, DXTs have no permission boundary between their code and your machine. A CVSS 10 vulnerability in this category was considered outside Anthropic's threat model. Your defense cannot depend on the vendor's.

8. Keep scheduled tasks narrowly scoped and actively reviewed

Scheduled tasks run without supervision. A task with access to email, calendar, and file storage can cause significant problems if an injection planted in an earlier session activates when the scheduled task runs. Practical constraints that reduce the blast radius:

  • Limit each scheduled task to one data source
  • Define a bounded, predictable output format
  • Avoid downstream actions that are hard to reverse (deletions, sends, transfers)
  • Review outputs after every run before allowing the next one to execute

9. Know the compliance gap before your organization deploys this

Cowork activity is not captured in Anthropic's audit logs, Compliance API, or data exports. As NineTwoThree's Cowork vs Claude Code guide notes, organizations subject to SOC 2 Type II, HIPAA, or PCI-DSS cannot demonstrate a complete record of what Cowork accessed without deploying additional infrastructure at the workstation level, typically an OpenTelemetry gateway feeding into a SIEM. Plan that cost before deployment.

In the r/cybersecurity community, IT practitioners have already raised this exact problem: fund managers demanding Cowork installations with no guidance on what security guardrails are actually available. The answer, right now, is that most of those controls must be built externally.

10. Report unexpected behavior instead of rationalizing it

If Claude accesses files or services you did not ask it to, if the scope of a task expands without instruction, or if it requests sensitive information out of context: stop the task and report it using the in-app feedback button. PromptArmor published their research specifically because they wanted users to be able to recognize what an attack in progress actually looks like, and to create pressure on Anthropic to treat disclosure reports as patching priorities.

Why These Incidents Keep Happening

Prompt injection isn't a bug in Cowork specifically. It's a property of how language models process information, and it applies to any tool in this category. A trusted input channel, whether a file, a web page, or a calendar event, carries content that Claude processes as data but that secretly contains instructions. Claude reads them and follows them.

This is prompt injection, and there is no architectural fix for it yet. The closest parallel in traditional security is SQL injection, which plagued web applications for over a decade until parameterized queries created a structural separation between code and data. Prompt injection is still waiting for an equivalent. Current defenses are filters, and filters can be bypassed. The drop from 30 to 40% attack success to roughly 1% in Opus 4.5 is real progress. It is not zero.

The ten rules above won't eliminate all risk. They make targeted attacks significantly harder to execute and limit the damage if one gets through.

For Organizations Deploying This Across a Team

Individual use is manageable. Rolling Cowork out across an organization is a different problem, and individual safety habits don't cover it.

The audit trail gap matters most in regulated industries. Without additional infrastructure, you can't answer "what did Cowork access last Tuesday?" after the fact. Build that logging before deployment. Not after something goes wrong.

MCP governance also gets harder as your team grows. Every MCP server your organization approves is a shared attack surface. An explicit allowlist of approved connectors, with a source code review for anything community-built, is the minimum before rollout.

For organizations working through the broader governance picture, our post on Shadow AI covers the infrastructure and policy decisions that come up when AI tools with significant access start running across a team.

If you want a practical framework for putting guardrails in place before deploying any GenAI tool across your team, download our guide Effective Guardrails for Your GenAI Apps.

For teams that need AI agents running inside security boundaries your organization controls, NineTwoThree builds custom architectures with data sovereignty, audit logging, and access controls built from the start. Talk to us about what that looks like for your workflows.

Two days after Anthropic released Claude Cowork to the public, security researchers at PromptArmor published a working proof of concept. A Word document with text set to 1-point font, white on white, with line spacing at 0.1 contained instructions that were invisible to any human reader. When a user uploaded the file and asked Cowork to process their financial documents, Cowork located those files and silently uploaded them, including partial Social Security numbers, to an attacker's Anthropic account.

The attack completed without triggering any approval prompt, and the victim had no way to know it had occurred.

The underlying vulnerability had been reported to Anthropic three months before Cowork launched, acknowledged by the team, and still not patched when the product shipped.

In the months after launch, researchers at PromptArmor, Zenity Labs, and LayerX Security published detailed attack analyses. Four of them are worth understanding in detail.

What Cowork Can Access

Cowork runs with more access than most desktop software you've installed. What it can do within its VM sandbox, and under what conditions, is the foundation for understanding any security risk associated with using it.

Cowork runs in a sandboxed virtual machine on your computer. Within that sandbox, it has significant reach:

  • Read, write, and permanently delete files in the folders you approve
  • Take screenshots to navigate applications on your screen
  • Browse the web through the Claude in Chrome extension
  • Run code and terminal commands
  • Connect to external tools and services through MCP (Model Context Protocol) servers
  • Execute scheduled tasks while you are not present

When you send a message from your phone, Cowork executes it using your desktop's full granted permissions. The more access you've enabled, the more damage any successful attack can do.

For a detailed breakdown of how Claude Cowork vs Claude Code differ in terms of environment, permissions, and appropriate use cases, our post on Claude Cowork vs Claude Code covers those comparisons in full. The short version: Cowork is a desktop app for knowledge workers running in a VM sandbox; Claude Code is a terminal tool for developers running with full filesystem permissions. What they share is the attack vector. Both process untrusted content, and in both tools, content that contains hidden instructions can redirect Claude's behavior.

Four Incidents That Show Where the Exposure Sits

That list of capabilities is what Cowork is designed to do. The incidents below show what happens when those same capabilities are used against you.

File Exfiltration via Hidden Prompt Injection (January 2026)

The PromptArmor attack shows exactly how prompt injection scales when an AI tool can take real actions. All it required was getting a user to process one external file. The malicious instruction was hidden in the document's formatting properties, invisible in any standard document viewer.

Blocking the exfiltration at the network level was harder than it sounds. Cowork's VM restricts most outbound connections, but api.anthropic.com is on the whitelist, because Cowork has to communicate with Anthropic to function at all. The attacker's API key turned that trusted channel into an exfiltration path.

Security researcher Simon Willison, who coined the term "prompt injection" in 2022, pointed out a fundamental tension in Anthropic's response: "I do not think it is fair to tell regular non-programmer users to watch out for 'suspicious actions that may indicate prompt injection.'" Cowork is marketed to office workers, not security analysts.

The Chrome Extension "Lethal Trifecta" (December 2025)

When the Claude in Chrome extension is connected to Cowork, it adds a browser with access to every site the user is logged into. Researchers at Zenity Labs identified what they described as a "lethal trifecta": the extension can access personal account data, act on that data, and be influenced by instructions embedded in web pages it visits.

In one demonstration, a fake employer email containing hidden instructions successfully directed Claude to delete emails from the user's inbox. The email appeared completely normal, and the deletion happened without triggering any approval step. Once the extension is active, there is no session toggle, meaning any web page Claude visits during a task is a potential injection source.

The Desktop Extension RCE (February 2026)

LayerX Security discovered that Claude Desktop Extensions (DXTs) run without sandboxing and with full system-level privileges, unlike browser extensions which operate within restricted permission boundaries. A malicious Google Calendar event could trigger arbitrary code execution on a user's machine when Claude was asked to handle calendar tasks.

The vulnerability received a CVSS severity score of 10/10, the maximum possible. Anthropic's documented response was that the flaw "falls outside our current threat model."

The MCP Ecosystem Supply Chain Problem

MCP servers extend Cowork's reach: Google Drive, Slack, calendars, databases, and hundreds of third-party tools. Each connection also adds attack surface. A February 2026 audit by Snyk of the AI agent skills ecosystem found that 36.82% of available skills contain at least one security flaw. 13.4% have critical-level issues, including malware distribution, embedded prompt injection payloads, and exposed secrets. Of the malicious skills identified, 91% combined prompt injection techniques with traditional malware, a combination designed to bypass both AI safety mechanisms and conventional security tooling.

This mirrors the early state of npm and PyPI package ecosystems. The difference is that AI skills have direct access to credentials, file systems, and APIs.

What Anthropic Has Built In, and Where the Gaps Are

Anthropic isn't ignoring the security problem. The company has invested more in this than most AI vendors have at a comparable stage, most meaningfully in model training, where Claude Opus 4.5 brought successful prompt injection attacks down to roughly 1% in adversarial testing, down from 30 to 40% for earlier models. The table below maps what's been addressed against what's still open.

Defense in place Known gap
Model training (Opus 4.5 ~1% injection success rate) No runtime behavioral monitoring for anomalous tool use mid-task
Content classifiers on untrusted input No transitive trust analysis when multiple MCP servers are chained
Deletion protection (explicit approval required) No sanitization layer when data crosses tool boundaries (Chrome to Cowork to MCP)
Computer use safeguards (per-application permission grants) Desktop Extension RCE (CVSS 10) classified as outside the threat model
VM sandbox limiting most outbound connections File exfiltration vulnerability shipped months after internal disclosure

An ex-NSA security expert who analyzed Anthropic's approach in detail summarized it as: "Anthropic is ahead of the pack. But 'ahead of the pack' in agentic AI security is a low bar."

10 Rules Drawn from Documented Incidents

Model-level defenses get you part of the way. The rest depends on how you configure the tool and what habits you build around it. For a tool with this level of access, those choices do a substantial portion of the security work.

1. Treat every external file as a potential attack vector

The PromptArmor attack used a Word document that looked completely normal. Other documented attacks have used PDFs, Markdown files, and integration guides. Any file from an untrusted source, whether a forum download, an email attachment, or a shared file from a contact, is a potential injection carrier. Create a dedicated working folder for Cowork and keep external files out of it until you have confirmed what they contain.

2. Understand what the VM sandbox does and does not protect

Cowork's virtual machine limits most outbound network traffic, but the Anthropic API is whitelisted. That is not a design flaw, it is a necessary function. It means that prompt injection attacks using api.anthropic.com as an exfiltration channel are structurally harder to catch at the network layer, which is exactly how the PromptArmor attack succeeded.

3. Treat Claude Cowork vs Claude Code MCP risks differently

Claude Code and Cowork share the same MCP attack surface, but the Claude Cowork vs Claude Code difference in user profile matters here. Developers using Claude Code are more likely to read code before running it. Cowork is marketed to non-technical users who may install MCP connectors the same way they install browser extensions: quickly, without reviewing what the code does. Before connecting any MCP server, apply the same judgment you would to a software dependency:

  • Check the source repository and the publisher
  • Read exactly which permissions the server requests
  • Verify the project has recent maintenance activity
  • Start with read-only access and expand only if needed

4. Be specific about which folders Cowork can access

Broad folder access amplifies the consequence of any successful injection. Financial documents, credentials, legal files, and anything containing personal data should stay in separate directories that Cowork cannot reach. The PromptArmor victim had connected Cowork to a folder containing confidential real estate files with partial SSNs. That access level determined the damage.

5. Reserve "Act without asking" mode for work you have prepared carefully

This mode removes the per-step approval that would give you a chance to catch unexpected behavior mid-task. The Zenity Labs email deletion occurred in a context where no intermediate approval step interrupted the injection. Use "Act without asking" only when the source files are trusted, the task scope is fully defined, and you are actively present to intervene.

6. Think carefully before connecting the Chrome extension

The extension keeps every logged-in site within Claude's reach. Any web page Claude visits during a task can contain instructions that redirect its behavior. If you do use it, keep the set of permitted sites small and keep sensitive browser sessions (banking, healthcare, email) out of the same window while Cowork is active.

7. Do not install Desktop Extensions without reviewing the permission model

Desktop Extensions run without sandboxing and with full system privileges. The LayerX RCE demonstrated that a malicious calendar event could execute arbitrary code at the OS level. Unlike browser extensions, DXTs have no permission boundary between their code and your machine. A CVSS 10 vulnerability in this category was considered outside Anthropic's threat model. Your defense cannot depend on the vendor's.

8. Keep scheduled tasks narrowly scoped and actively reviewed

Scheduled tasks run without supervision. A task with access to email, calendar, and file storage can cause significant problems if an injection planted in an earlier session activates when the scheduled task runs. Practical constraints that reduce the blast radius:

  • Limit each scheduled task to one data source
  • Define a bounded, predictable output format
  • Avoid downstream actions that are hard to reverse (deletions, sends, transfers)
  • Review outputs after every run before allowing the next one to execute

9. Know the compliance gap before your organization deploys this

Cowork activity is not captured in Anthropic's audit logs, Compliance API, or data exports. As NineTwoThree's Cowork vs Claude Code guide notes, organizations subject to SOC 2 Type II, HIPAA, or PCI-DSS cannot demonstrate a complete record of what Cowork accessed without deploying additional infrastructure at the workstation level, typically an OpenTelemetry gateway feeding into a SIEM. Plan that cost before deployment.

In the r/cybersecurity community, IT practitioners have already raised this exact problem: fund managers demanding Cowork installations with no guidance on what security guardrails are actually available. The answer, right now, is that most of those controls must be built externally.

10. Report unexpected behavior instead of rationalizing it

If Claude accesses files or services you did not ask it to, if the scope of a task expands without instruction, or if it requests sensitive information out of context: stop the task and report it using the in-app feedback button. PromptArmor published their research specifically because they wanted users to be able to recognize what an attack in progress actually looks like, and to create pressure on Anthropic to treat disclosure reports as patching priorities.

Why These Incidents Keep Happening

Prompt injection isn't a bug in Cowork specifically. It's a property of how language models process information, and it applies to any tool in this category. A trusted input channel, whether a file, a web page, or a calendar event, carries content that Claude processes as data but that secretly contains instructions. Claude reads them and follows them.

This is prompt injection, and there is no architectural fix for it yet. The closest parallel in traditional security is SQL injection, which plagued web applications for over a decade until parameterized queries created a structural separation between code and data. Prompt injection is still waiting for an equivalent. Current defenses are filters, and filters can be bypassed. The drop from 30 to 40% attack success to roughly 1% in Opus 4.5 is real progress. It is not zero.

The ten rules above won't eliminate all risk. They make targeted attacks significantly harder to execute and limit the damage if one gets through.

For Organizations Deploying This Across a Team

Individual use is manageable. Rolling Cowork out across an organization is a different problem, and individual safety habits don't cover it.

The audit trail gap matters most in regulated industries. Without additional infrastructure, you can't answer "what did Cowork access last Tuesday?" after the fact. Build that logging before deployment. Not after something goes wrong.

MCP governance also gets harder as your team grows. Every MCP server your organization approves is a shared attack surface. An explicit allowlist of approved connectors, with a source code review for anything community-built, is the minimum before rollout.

For organizations working through the broader governance picture, our post on Shadow AI covers the infrastructure and policy decisions that come up when AI tools with significant access start running across a team.

If you want a practical framework for putting guardrails in place before deploying any GenAI tool across your team, download our guide Effective Guardrails for Your GenAI Apps.

For teams that need AI agents running inside security boundaries your organization controls, NineTwoThree builds custom architectures with data sovereignty, audit logging, and access controls built from the start. Talk to us about what that looks like for your workflows.

Alina Dolbenska
Alina Dolbenska
Content Marketing Manager
Alina Dolbenska
color-rectangles

Subscribe To Our Newsletter