Skip to main content
  1. Documentation/

Research Foundations

5 mins
Table of Contents

Carpenter’s security model draws on several lines of research in agentic AI security. This page summarizes the key ideas and how they informed the design.

The Dual LLM Pattern
#

Simon Willison, 2023The Dual LLM pattern for building AI assistants that can resist prompt injection

Willison proposed splitting an AI assistant into two separate models: a privileged LLM that handles user instructions and has access to tools, and a quarantined LLM that processes untrusted data (emails, web content) without any tool access. Crucially, raw content handled by the quarantined model is never exposed to the privileged model — instead, it populates opaque references like $email-summary-1 that the privileged model can display without seeing the underlying tokens.

This architectural separation prevents untrusted data from hijacking the planning loop. The insight is fundamental: if the model that makes decisions never sees attacker-controlled content, prompt injection has no attack surface.

In Carpenter: The review arc system implements this pattern. When an arc processes untrusted data (taint_level=tainted), a separate REVIEWER agent evaluates the output in isolation. The primary agent receives only structured metadata — status, byte count, exit code — never the raw tainted content. This is the dual-LLM pattern applied at the arc level, with the addition of a JUDGE for authoritative verdicts.

CaMeL: Defeating Prompt Injections by Design
#

Debenedetti, Shumailov, Fan, Hayes, Carlini, Fabian, Kern, Shi, Terzis, Tramèr (Google DeepMind, 2025)arXiv:2503.18813

CaMeL (CApability-Mediated Language) creates a protective system layer around the LLM. It explicitly extracts control flow and data flow from the trusted user query, then enforces that untrusted data retrieved during execution can never impact program flow. A capability system prevents data exfiltration by enforcing security policies at tool-call boundaries.

The key advance over the dual-LLM pattern: CaMeL tracks data provenance through the entire execution, tagging every value with its trust origin. This makes it possible to enforce policies like “untrusted data may be displayed but never used as a tool argument” — a guarantee that doesn’t depend on the model’s cooperation.

In Carpenter: Carpenter’s taint tracking system assigns trust levels to arcs and enforces access control at tool boundaries — clean arcs get HTTP 403 on untrusted data tools, and tainted arc output is never returned to the chat agent’s context. The code sanitization step in the review pipeline serves a similar function to CaMeL’s capability checks: it strips payload content so the reviewer judges structure and intent, not data.

FIDES: Securing AI Agents with Information-Flow Control
#

Costa, Köpf, Kolluri, Paverd, Russinovich, Salem, Tople, Wutschitz, Zanella-Béguelin (Microsoft Research, 2025)arXiv:2505.23643

FIDES applies classical information-flow control (IFC) to agent security. Every piece of data carries confidentiality and integrity labels that propagate through computation via lattice joins. The system deterministically enforces two policies: trusted actions require high-integrity inputs (P-T), and data flows must respect label constraints (P-F). A novel “variable hiding” primitive selectively removes low-integrity data from the agent’s context when integrity matters.

Where CaMeL focuses on preventing untrusted data from influencing control flow, FIDES provides a more general formal framework — it can also enforce confidentiality (preventing data exfiltration) and offers mathematical guarantees (noninterference for integrity, explicit secrecy for confidentiality).

In Carpenter: The verification system (carpenter/verify/) implements FIDES-style analysis directly on Python code generated by agents. A three-level integrity lattice (TRUSTED ⊑ CONSTRAINED ⊑ UNTRUSTED) mirrors FIDES’s label system, and static taint propagation walks the AST to assign integrity labels to every expression. Labels propagate through operations via lattice joins — the same formal mechanism FIDES uses.

Carpenter extends this analysis in one key direction: exhaustive path exploration. When code contains CONSTRAINED data in conditions, Carpenter enumerates representative test values for each constrained input (based on policy-typed literals like EmailPolicy, Domain, Url), computes the cross-product, and executes the code with tracked value wrappers for every combination. If the total execution tree is bounded (≤150 paths by default), every reachable path is exercised. Code that passes all paths is auto-approved without AI review.

This allows Carpenter to verify workflows that FIDES would conservatively reject. FIDES uses static variable hiding — it removes low-integrity data from context to prevent tainted control flow. Carpenter instead proves safety dynamically: if every possible execution path through the constrained branches is safe, the code is approved regardless of how complex the branching is. A whitelist restricts code to a verifiable Python subset (no while loops, no function definitions) to guarantee the path space is bounded.

At the arc level, the taint system provides a coarser-grained layer — arcs carry clean, tainted, or review labels. The tool partitioning (read/ vs act/) mirrors FIDES’s policy enforcement at tool boundaries. The trust audit log provides the traceable decision record that FIDES’s formal model requires.

The Common Thread
#

All three approaches share a conviction: you cannot solve prompt injection by hoping the model will be careful. Security must be architectural — enforced by the system layer, not requested of the model. The specific mechanisms differ (model separation, capability tracking, information-flow labels), but the principle is the same: deterministic enforcement at boundaries, not probabilistic cooperation from the AI.

Carpenter’s “measure twice, cut once” approach sits in this tradition. The review pipeline is a concrete implementation of the principle: every agent action passes through a deterministic inspection chain where code is sanitized, parsed, and structurally reviewed before it can execute. For high-stakes operations, the chain extends — multiple reviewers, a judge, separation-of-powers verification — “measure N times, cut once.”