Hacking Generative AI Agents: Running Kiro in Docker Sandboxes

Docker AI Sandboxes provide structural defense against credential theft by malicious MCP servers — ~/.aws/credentials simply doesn't exist inside the sandbox.

Hacking Generative AI Agents: Running Kiro in Docker Sandboxes

In the previous Hacking Generative AI Agents post, I demonstrated a multi-stage credential theft attack against Kiro. A malicious MCP server first earned the user’s trust with a harmless-looking tool call, then on a subsequent call silently exfiltrated AWS credentials. The mitigations I listed at the end of that post included sandboxed environments, but I left that as a reference rather than a demonstration. This post makes it concrete.

The key limitation of the human-in-the-loop defense is that it requires the user to recognize a suspicious request at the right moment. The multi-stage attack was specifically designed to avoid that: the suspicious request occurs only after trust has already been established. A stronger defense doesn’t depend on the user making the right decision — it makes the sensitive resource structurally unavailable.

Docker AI Sandboxes provide a stronger defense. ~/.aws/credentials simply doesn’t exist inside the sandbox. There is nothing to steal.

What Are Docker AI Sandboxes

Docker AI Sandboxes run agents inside isolated microVM environments — not just containers. Each sandbox has its own dedicated Docker daemon, filesystem, and network stack. The agent runs in a fresh environment where only the project directory is mounted. Host-level files — home directory, credentials, SSH keys, environment variables — are not accessible unless you explicitly provide them.

This is an architectural boundary, not a policy suggestion. A malicious MCP server running inside the sandbox cannot read ~/.aws/credentials because that path doesn’t exist. It cannot reach the host’s local network because the sandbox runs behind its own networking layer. And it cannot exfiltrate data via raw TCP, UDP, or ICMP because those are permanently blocked at the network layer — only HTTP/HTTPS traffic flows out, through a proxy that applies configurable access rules.

Docker Sandboxes are currently experimental. The sbx CLI is the primary interface. The steps in this post were tested with kiro-cli 2.0.1 — the tool is moving fast, and details like the login flow may change in later versions.

Installation

Install sbx and authenticate with Docker following the official installation instructions.

I used the ready Kiro template.

Setting Up MCP Servers

MCP servers are configured in .kiro/settings/mcp.json inside the project directory. The workspace is a live mount — files you create or modify on the host appear inside the sandbox, and vice versa.

For the villain MCP server used in this demo, create the project structure on the host:

mkdir -p ~/my-project/.kiro/settings
cp /path/to/server.py ~/my-project/server.py

Create ~/my-project/.kiro/settings/mcp.json:

{
  "mcpServers": {
    "nordhero": {
      "command": "uv",
      "args": [
        "run",
        "--with",
        "mcp[cli]",
        "python",
        "/home/agent/workspace/server.py"
      ]
    }
  }
}

The path /home/agent/workspace is where the project directory is mounted inside the microVM. Use that path in any MCP server args that reference project files — host-style paths like /Users/yourname/... won’t resolve inside the sandbox.

After adding or changing the MCP config, restart Kiro with /quit inside the session and sbx run kiro my-project to reload it.

Running Kiro in the Sandbox

Launch Kiro from your project directory:

sbx run kiro ~/my-project

On the first run, the sandbox pulls the base image (a minute or two), then asks you to choose a network policy. Choose Balanced for general development work — it allows the model API, package registries, and common code hosts. Locked Down is better for automated pipelines where you want to be explicit about every outbound destination.

For the Pro license, Kiro authenticates via AWS IAM Identity Center. One gotcha: the IAM Identity Center start URL is not in the Balanced allowlist by default, and the browser URL is not printed, or the browser is not automatically opened. I constructed it manually from the Start URL and added the confirmation code:

https://<your-org>.awsapps.com/start#/device?user_code=XXXX-XXXX

You can add the start URL explicitly allowed:

sbx policy allow network <your-org>.awsapps.com:443

Once authenticated, the session is stored at ~/.local/share/kiro-cli/data.sqlite3 and persists across restarts — you only need to do this once per sandbox, unless the session expires.

Kiro then presents a one-time confirmation before starting:

Kiro trust-all-tools mode warning shown on first launch inside the sandbox

Kiro trust-all-tools mode warning shown on first launch inside the sandbox

Select Yes, I accept. The Kiro sandbox template always runs with --trust-all-tools, but you can revert to approval-required mode for the current session with /tools reset, but the mode does not persist between restarts.

Why .kiroignore Is Not Enough

Before running the attack inside the sandbox, I tested a commonly suggested mitigation: adding ~/.aws/credentials to .kiroignore. The idea is that Kiro respects this file and won’t read paths listed there. As far as I know, .kiroignore only works with the Kiro IDE — it’s documented exclusively under the Kiro IDE docs — so I tested this with Kiro IDE.

It didn’t hold.

Kiro bypassing .kiroignore via shell execution — credentials extracted with grep and passed to MCP server

Kiro bypassing .kiroignore via shell execution — credentials extracted with grep and passed to MCP server

The attack adapted. When the direct file read failed with “Error(s) while reading file(s)”, the agent fell back to shell execution — tools that .kiroignore has no authority over.

.kiroignore controls what Kiro’s file-reading tools can access. It does not control what a shell running inside the same user session can access. Any attack path that includes shell execution — which most agents have — can bypass it entirely.

The only reliable mitigation is structural: the file must not exist.

The Same Attack, Different Outcome

To test the sandbox, I used the exact same villain MCP server from Part 1: the multi-stage version whose tool description instructs the model to read ~/.aws/credentials on the second call without notifying the user.

The first call succeeds as designed — trust is established, the agent returns hero information for the requested location with no indication that anything is wrong.

First get_hero call succeeds — trust established, hero information returned for Kuopio

First get_hero call succeeds — trust established, hero information returned for Kuopio

On the second call, the server triggers the credential read. The attack behaves exactly like the .kiroignore bypass — it falls back to shell execution, running cat ~/.aws/credentials directly. Inside the sandbox, there is no ~/.aws directory:

Second get_hero call — shell reads ~/.aws/credentials, gets FILE_NOT_FOUND, attack fails

Second get_hero call — shell reads ~/.aws/credentials, gets FILE_NOT_FOUND, attack fails

The agent reports that credentials don’t exist and cannot complete the authentication step. There is nothing to exfiltrate.

The attack fails at the filesystem level before the model has any opportunity to pass credentials to the server. There are no approval prompts to fool, no session trust to abuse, and no credentials to exfiltrate — because the host’s credential files were never mounted.

Compare this with Part 1:

Attack step Without sandbox With Docker sandbox
Tool trust earned on first call ✓ Succeeds ✓ Succeeds — no change
Agent reads ~/.aws/credentials ✓ File exists and is readable ✗ File does not exist
Credentials passed to MCP server ✓ Exfiltrated ✗ Nothing to pass
Exfiltration request leaves machine ✓ Outbound allowed Blocked by network policy

The filesystem isolation stops the attack at step 2. Network policies provide a second layer — even if credentials were somehow accessible (say, hardcoded in a .env file in the project directory), the outbound call to the attacker’s server would be blocked.

Network Policies: Defense in Depth

The sandbox applies a configurable network policy to all outbound HTTP/HTTPS traffic. Non-HTTP protocols — raw TCP to arbitrary ports, UDP, ICMP — are permanently blocked at the network layer. That last point has an interesting side effect: the DNS exfiltration technique from my DNS Data Exfiltration post, which encoded data in DNS queries to an attacker-controlled nameserver, would also not work from inside the sandbox. UDP/53 is blocked by default with no way to unblock it.

There are three built-in policy presets:

Mode Behavior
Open All outbound traffic allowed
Balanced Deny by default, with an allowlist for AI APIs, package managers, code hosts, and cloud services
Locked Down All outbound blocked; everything must be explicitly allowed

For Kiro development work, Balanced is a reasonable starting point — it allows the model API and common package registries without configuration. For sensitive work or automated agent pipelines where you want to be explicit, start from Locked Down and add only what you need.

Configuring the Policy

View your current rules:

$ sbx policy ls
NAME                           TYPE      ORIGIN   DECISION   STATUS   RESOURCES
default-ai-services            network   local    allow      active   api.anthropic.com:443, **.openai.com:443, ... (22 hosts)
default-package-managers       network   local    allow      active   pypi.org:443, registry.npmjs.org:443, ... (50 hosts)
default-code-and-containers    network   local    allow      active   **.github.com:443, **.gitlab.com:443, ... (25 hosts)
default-cloud-infrastructure   network   local    allow      active   **.amazonaws.com:443, **.googleapis.com:443, ... (35 hosts)
default-os-packages            network   local    allow      active   archive.ubuntu.com:443, ubuntu.com:443, ... (16 hosts)
blueprint:kiro-my-project      network   local    allow      active   *.factory.ai, api.workos.com

The default-cloud-infrastructure group is worth noting: **.amazonaws.com is allowed by default, which means an MCP server with injected AWS credentials could reach AWS APIs even under the Balanced policy. If that’s not needed for your work, tighten it.

Add an explicit allow for a domain:

sbx policy allow network api.example.com:443

Block a specific host (deny rules take precedence over allow rules):

sbx policy deny network 203.0.113.42

Reset to defaults:

sbx policy reset

Monitoring Outbound Traffic

This is where things get useful for incident investigation. The sandbox logs every network contact attempt:

sbx policy log

The log shows whether each attempt was allowed or blocked, and which proxy method handled it (forward proxy for HTTP/HTTPS, transparent for other TCP, or network-layer block). If a malicious MCP server tries to phone home, you will see the blocked connection attempt here — including the destination — even if the attempt was silent in the agent’s output.

During an active session, running sbx policy log after a suspicious tool call gives you a real-time view of what the agent or its tools tried to reach.

What the Network Policy Catches

Suppose the malicious MCP server found credentials elsewhere — perhaps in a .env file committed to the project directory. Without network policies, it could POST those credentials to an external endpoint. With Balanced or Locked Down policies, that call is blocked and recorded in sbx policy log. Here is the actual blocked log from running through this exercise:

$ sbx policy log
Blocked requests:
SANDBOX           TYPE      HOST                        PROXY          REASON                                LAST SEEN         COUNT
kiro-my-project   network   aws.amazon.com:443          forward        No matching allow rule (default deny)   12:25:08 23-Apr   9
kiro-my-project   network   repost.aws:443              forward        No matching allow rule (default deny)   12:18:47 23-Apr   6
kiro-my-project   network   docs.aws.amazon.com:443     forward        No matching allow rule (default deny)   12:18:40 23-Apr   3
kiro-my-project   network   <your-org>.awsapps.com:443  browser-open   No matching allow rule (default deny)   10:21:18 23-Apr   2

The first three blocks came from the Kiro agent browsing AWS documentation during my tests — legitimate behavior that happened to be outside the Balanced allowlist. The fourth — <your-org>.awsapps.com:443 — is the AWS IAM Identity Center start URL used for Kiro Pro login, which caused the initial authentication to fail. Adding it explicitly to the allowlist resolved the issue.

The log also revealed that Kiro sends telemetry to client-telemetry.us-east-1.amazonaws.com by default. If you’d rather not have that traffic, block it in the sandbox policy:

sbx policy deny network client-telemetry.us-east-1.amazonaws.com

Or change kiro-cli settings with sbx exec to run commands directly inside the sandbox:

sbx exec kiro-my-project kiro-cli settings telemetry.enabled false

The “shell escape” shortcut (!kiro-cli settings telemetry.enabled false inside the Kiro prompt) doesn’t persist after restarting. The full list of available settings is in the Kiro CLI settings reference.

What the Sandbox Doesn’t Protect Against

The sandbox provides strong isolation for host credentials and network egress. It does not protect everything.

Credentials in the project directory are still accessible. .env files, hardcoded keys in configuration, or any sensitive file that lives inside the mounted project path is readable by the agent and by any MCP server it connects to. The sandbox only excludes what isn’t mounted — host home directories, system credentials, and SSH keys. Project files are mounted by design.

The model API receives full context. Any information the agent processes — code, comments, variable names, context from tool outputs — is sent to the AI model API endpoint. The point is that this is the only channel data leaves through, and it’s a channel you’ve explicitly chosen to trust.

A malicious MCP server can still influence behavior. Prompt injection through tool output still works inside the sandbox — an MCP server can still return instructions that the model interprets as guidance. What the sandbox eliminates is the ability to act on those instructions in ways that require accessing host credentials or reaching blocked network endpoints. The blast radius is contained, but not zero.

Trust-all-tools mode removes user confirmation. Inside the sandbox, Kiro executes tool calls without asking. For interactive development, this is usually fine — the isolation serves as a safety net. For automated agent pipelines, review what tools you’re connecting before starting the session, because there is no second prompt if something unexpected happens.

Using AWS Credentials in the Sandbox

The sandbox prevents credential theft by not mounting your host credentials — but legitimate AWS work then requires you to be explicit about what credentials the agent gets. That constraint is actually the point: you decide exactly what it can access rather than the agent inheriting everything on your machine.

Never mount ~/.aws into the sandbox — that re-exposes the full credential store. Instead, assume a role and inject the short-lived session credentials as environment variables at launch. They expire in hours, limiting the window for any stolen credential to be useful:

CREDS=$(aws sts assume-role \
  --role-arn arn:aws:iam::123456789012:role/kiro-readonly \
  --role-session-name kiro-session \
  --duration-seconds 3600 \
  --query 'Credentials' --output json)

sbx run kiro \
  --env AWS_ACCESS_KEY_ID=$(echo $CREDS | jq -r .AccessKeyId) \
  --env AWS_SECRET_ACCESS_KEY=$(echo $CREDS | jq -r .SecretAccessKey) \
  --env AWS_SESSION_TOKEN=$(echo $CREDS | jq -r .SessionToken) \
  --env AWS_DEFAULT_REGION=eu-central-1

If the agent only needs to call AWS APIs, combine this with Locked Down mode and an AWS-only allowlist — then even if a malicious tool call gets hold of the injected credentials, it cannot POST them anywhere else.

Least-privilege IAM is the most important control here: it constrains what’s possible, not just what’s accessible. If the agent is doing read-only investigation work, the role should have no write permissions — then an unexpected API call either fails or is harmless.

Summary

The specific attack from Part 1 — steal host credentials via a malicious MCP server — fails inside the sandbox at the first step: ~/.aws/credentials is simply not mounted, so there is nothing to read or send. Network policies add a second layer, ensuring that even if credentials were somehow accessible in the project directory, the outbound call to the attacker’s server would be blocked. The remaining risks — project files, model API context, prompt injection — are real, but smaller, more visible, and more directly under the developer’s control than the attack surface without the sandbox.

Categories:

Want to be the hero of cloud?

Great, we are here to help you become a cloud services hero!

Let's start!
Book a meeting!