Comware · Agent Security

01 — Agent Security in Depth Framework (v0.2.2)

A native-primitives defence-in-depth model for AI agent security in regulated enterprise environments.

v0.2.2 status. Editorial errata pass on v0.2.1, applying the eight items the third TriLLM cycle (01-review-v0.2.1.md) unanimously recommended. No structural changes; this is the locked architectural baseline. The pattern v0.1 → v0.2 → v0.2.1 → v0.2.2 (locked) → v0.3 (parallel non-blocking) mirrors NIST CSF 1.0 → 1.1 → 2.0 versioning. v0.3 (Constitution for Framework Revision + six-axis falsification instrumentation) is now a parallel non-blocking track. v0.2.1 archived at archive/01-immunosec-framework-v0.2.1.md.

v0.2.1 → v0.2.2 errata (editorial only). (1) J's type ambiguity clarified — composition is the native risk; J is the architectural overlay. (2) F's "highest tier permitted" wording corrected to "minimum tier sufficient for the control objective, capped by maximum tier permitted by Primitive D classification". (3) D's magnitude/aggregation handling made explicit via E cross-reference. (4) §7 residual "Primitive I" wording swept and updated to I_arch / I_commercial. (5) §6.2 cross-tenant aggregation residue swept. (6) 73.2→8.7 empirical anchor cleaned. (7) "Conformance-falsifiable" language normalised throughout §8 and §10. (8) F/G/H reconnaissance blind-spot framing sharpened in §9 from "research-grade open problem" to "named structural coverage gap for low-D-class reconnaissance." See § 11.7 for the v0.2.1 review response.

v0.2.1 → v0.2.2 substantive content (carry-over). v0.2.1 was the patches in response to the second TriLLM debate — typed spine + dependency graph (§3.0); F0–F4 reconstruction tiers (§3.6); I split into I_arch + I_commercial (§3.9); J recast as bounded capability-graph assurance overlay (§3.10); §5.3 prioritisation calculus added; §8 measurement contracts replace bare falsification conditions; §9 open problems expanded honestly. § 11.6 records the v0.2 critique response.

v0.2 status (further carry-over). v0.2 was the rewrite of v0.1's immune-system-derived framework, applying Action A from the v0.1 debate synthesis. The immune-system narrative survives as Appendix A — reading guide and mnemonic, not architecture. v0.1 preserved at archive/01-immunosec-framework-v0.1.md. Section 11 records which v0.1 critique findings were accepted, qualified, or pushed back; §§ 11.6 / 11.7 do the same for the v0.2 and v0.2.1 cycles respectively.


1. Why this framework exists

Three shifts make AI agent security categorically different from prior dev-tool security and from prior application security:

  1. The agent has the human's credentials. Git, gh tokens, SSH keys, cloud CLI sessions, secret-manager handles, sometimes production access. The agent acts under a human identity with no MFA prompt because the human authenticated this morning. Audit logs say "human did this" when an agent did it.
  2. Untrusted text becomes executable intent. Every file the agent reads — a README, an issue comment, a transpiled .min.js, a webpage, an MCP tool result, a system-reminder block — is potential instruction. There is no syntactic boundary between data and command. Large language models cannot reliably distinguish the two; this is an open research problem.
  3. YOLO / autonomy mode removes the human-in-the-loop. Background execution means the failure mode of #1 + #2 isn't "user sees weird suggestion and declines" — it is "agent silently does it and the diff lands in main an hour later."

Classic security frameworks were built for stable boundaries (network perimeters), enumerable enemies (signature-driven detection), and prevention as the primary mode of victory. Agent security violates all three assumptions. The framework presented here is built from primitives that do fit the new shape: delegated authority, instruction provenance, information flow, side-effect taxonomy, runtime containment, observability with content minimisation, detection-response-recovery, learning, governance, and composition.

The framework is opinionated about its target market — regulated enterprise, particularly APRA-regulated AU industry and the Commonwealth + AUKUS ladder (see 06-international-expansion.md). It is not designed for hobbyist agents, consumer copilots, or unregulated B2B integrations, though most of its primitives apply.


2. The threat surface in one page

The framework presupposes the threat surface, summarised here so this document is self-contained.

The dominant attack classes, with named real-world incidents (2025–2026):

# Class Real evidenced example
1 Indirect prompt injection via tool result EchoLeak (CVE-2025-32711, Jan 2025) — zero-click Copilot prompt injection, CVSS 9.3, exfil from a single inbound email
2 Confused deputy / privilege misuse Meta rogue agent (Mar 2026) — agent passed every IAM check, exposed sensitive data to unauthorised employees with valid credentials
3 Data exfiltration via approved channels Vercel / Context.ai (Apr 2026) — attackers pivoted via third-party AI tool with legitimate access
4 Destructive blast under autonomy PocketOS / Cursor (Apr 2026) — agent wiped production DB + backups in <10 seconds using Railway token found in unrelated file
5 Supply-chain compromise of agent tooling LiteLLM PyPI (Mar 2026) — malicious packages distributed; MCP STDIO RCE (Apr 2026) — by-design flaw, 7,000+ servers affected
6 Cost / token amplification Documented as a class; Wiz Research: 340% YoY rise in prompt-injection attempts
7 Identity / non-repudiation collapse Structural — every "human" commit by an agent under the human's token
8 Compliance & jurisdiction EU AI Act high-risk obligations (Annex III deferred to 2 Dec 2027 by the Digital Omnibus); APRA Letter on AI 30 Apr 2026
9 Reproducibility / forensics Structural — non-deterministic outputs, vendor-controlled model versions, no agent-regression-test standard
10 Insider-threat amplification Per-capita blast radius rises with agent capability

(See 04-reference-threat-model.md § 2.5 for the full real-world parallel set.)

Defence-in-depth is empirically supported. The framework's central architectural claim — that no single layer holds and the stack is the value — is supported by three independent classes of evidence:

  1. Anthropic's April 2026 Trustworthy Agents in Practice explicitly formalises a four-layer shared-responsibility model and reports significantly lower exploit success rates against stacked controls than against any individual layer.
  2. Market consolidation pattern (2025–2026). Every commercial AI-security stack assembled by Cisco, Microsoft, Palo Alto Networks, SentinelOne, and CrowdStrike converged on multi-layered architectures rather than single-point detection (see 05-market-research.md § 3). This is the strongest form of "the market voted with capital"-style evidence available.
  3. Published OWASP / NIST analysis of agent-class incidents (EchoLeak, PocketOS, MCP STDIO RCE) in 2025–2026 consistently attributes successful attacks to multiple concurrent control failures rather than to defeat of any single defence.

(v0.2.2 note: earlier framework versions cited a specific "73.2% → 8.7%" reduction figure attributed to an October 2025 joint OpenAI / Anthropic / Google DeepMind paper. The v0.2 critique correctly flagged that this citation was not verifiable. v0.2.2 has removed the specific number from the load-bearing claim; the defence-in-depth thesis rests on the three classes of evidence above, which are independently citable. If the original paper surfaces with a verified methodology, v0.3 may restore the specific number as illustrative.)


3. The orthogonal-primitives spine

The framework's spine is ten primitives, derived from the structure of agentic risk rather than from analogy. Each primitive is:

The ten items are not strictly hierarchical, but they have a partial dependency order described in § 3.0 below. The order below reflects that partial dependency, not a "first to last" deployment sequence.

3.0 The typed spine and dependency graph (NEW in v0.2.1)

The v0.2 critique correctly identified that v0.2 called all ten items "orthogonal native primitives" while the document itself acknowledged they have dependencies. The honest typing:

Category Items What it is What it does
Native risk primitives A Authority · B Instruction Provenance · C Information Flow · D Side-Effect Taxonomy · Composition (the risk; handled by overlay J) Properties intrinsic to agentic action. Remove any of them and agentic systems are still possible; remove the security properties of any of them and the systems become unsafe. Define what must be true for an agentic system to be safe.
Control planes E Runtime Containment · G Detection-Response-Recovery · H Learning & Assurance Operational mechanisms that enforce or maintain the native risk invariants. Sense → decide → act → learn. Define how the invariants are operationalised in deployed systems.
Evidence substrate F Observability + Evidence The foundation from which detection, learning, and assurance all read. Upstream of every other category. Defines what can be known about the system.
Assurance plane I_arch (Assurance & Accountability) Internal control verification: are the other primitives' invariants being maintained? Board-reportable, audit-grade. Defines how the system proves it works.
Composition overlay J The architectural overlay that handles the composition risk listed under native risks above. Recast in v0.2.1 as a bounded capability-graph overlay on A–G, not a peer primitive. Defines how the composition risk is operationalised across multiple agents.

Important typing note (v0.2.2 clarification). The composition risk is itself a native risk of agentic systems (multi-agent privilege laundering happens whether or not we have an overlay to handle it). Primitive J is the architectural overlay that operationalises the controls for that risk. The rest of the document uses "J" as shorthand for both — readers should understand "Primitive J" to mean "the composition overlay handling the composition native risk."

Regulatory mapping is no longer in the spine. v0.2 placed per-regime compliance evidence collection inside Primitive I and called it the framework's "single most defensible piece of IP." v0.2.1 corrects this: regulatory mappings are commercial adapters that emit evidence from I_arch's substrate to APRA / MAS / PRA / OSFI / OCC / EU AI Act formats. They live outside the architectural spine, in 03-maturity-model.md and in the regulator-mapping module of the productised wedge (see 07-build-vs-buy.md v0.2). The architectural framework is jurisdiction-agnostic; the adapter layer is jurisdiction-specific.

This was the v0.2 critique's central finding (Diagnosis 2 in 01-debate-v0.2-synthesis.md). Remove APRA / MAS / PRA from the world and the old Primitive I disappears while A–H and J are unchanged — that is the test of a commercial wrapper, not an architectural primitive.

The dependency graph

Reading the graph: native risk primitives (A, B, C, D, J-overlay) define the invariants. E bounds blast radius. F is the substrate everything else reads from. G, H, I_arch sit above F and read from it. F is upstream of detection, learning, and assurance — which is why the v0.2 minimisation correction to F propagates to all three (see § 3.6, 3.7, 3.8).

What this typing earns

  1. The count question becomes principled-but-revisable. Why these items? Five native risk questions (who acts, under what instruction, on what data, producing what side effect, composed how), three control planes (sense, decide+act, learn), one substrate, one assurance plane. The internal completeness criteria for each category are testable; if any agentic-action question is unanswered, a new native primitive is required; if any operational mechanism is missing, a new control plane.
  2. Real incidents map to dependency-graph paths, not single primitives. An attack that defeats prompt injection (B failure) AND succeeds because behavioural detection couldn't catch the resulting tool-call (G blind spot) AND was not in the threat-intel library (H gap) is one coupled fault tree, not "Primitive B failed." Incident review uses the graph.
  3. The framework can now be pruned, not just extended. v0.2 had no honest path to remove a primitive (Kimi's "framework that cannot die"). With typing, removal becomes legitimate: if a candidate primitive fails the "intrinsic to agentic action" test for its category, it is demoted out of the spine (as v0.2.1 does to old I).

3.1 Primitive A — Agent Authority

What it is. The model of who may cause side effects, under whose delegation, with what scope, for how long, and under what approval condition. The agent-specific extension of IAM.

The invariant. Every action the agent takes is attributable to a distinct, scoped, time-bounded agent principal — never to an impersonated human user. Every commit, push, API call, tool invocation carries an agent_principal_id and (where the human initiated it) a human_initiator_id.

Key controls.

Failure mode when missing. Audit logs misattribute agent actions to humans. SOX / SOC 2 / CPS 230 change-management controls are silently bypassed because the agent's commits look like human commits. Insider-threat amplification by agents becomes invisible.

Design rule. Agents must operate under their own attributable principal type, never as impersonations of human users.

Falsification condition. If, three months after rollout, more than 5% of production agent actions cannot be attributed to a named agent principal distinct from the originating human, the implementation has failed the design rule.

3.2 Primitive B — Instruction Provenance & Trust Hierarchy

What it is. The model of where instructions came from, which instructions may override which, and what constraints retrieved or tool-generated content imposes on policy and tool choice.

The invariant. Untrusted content (issue comments, web pages, MCP tool results, retrieved documents, plugin descriptions) cannot authorise privileged side effects or override system policy without an explicit policy-aware review step.

Key controls.

Failure mode when missing. ChatGPT's worked example: untrusted GitHub issue → agent modifies CI workflow → secrets leak through approved logs. The signed plugin, the egress proxy, and the observability stream all behave correctly; the agent was told to do the wrong thing through a channel the framework treats as data.

Design rule. Untrusted instructions can influence agent output but cannot influence agent policy or tool choice without a policy-aware override step.

Falsification condition. Red-team test corpus that embeds policy-override instructions in tool-result content; if more than 10% of tests produce a policy override, the implementation has failed.

3.3 Primitive C — Information-Flow Model

What it is. Data-classification-aware control of what the agent may read, what may influence its policy decisions, what may leave the system, and what may be written to durable state.

The invariant. Each data class has an explicit set of permitted sources, permitted processors, permitted sinks, and permitted retention. The agent cannot cross those boundaries silently.

Key controls.

Failure mode when missing. Customer PII flows into the LLM provider's training pipeline. Tenant A's prompts surface in Tenant B's session. GDPR / Privacy Act / HIPAA breaches that the perimeter looks clean on.

Design rule. Data classes constrain the agent's input sources, processors, sinks, and retention. No data class may reach an unapproved combination without an explicit policy decision logged as evidence.

Falsification condition. If three months of telemetry show any data-class-to-unapproved-sink flow without a corresponding logged policy decision, the implementation has failed.

3.4 Primitive D — Side-Effect Taxonomy

What it is. Every action the agent can perform is classified along three orthogonal axes: reversibility, externality, and regulator-notifiability. The classification determines which other primitives apply.

The invariant. Irreversible, externally-visible, or regulator-notifiable actions require pre-commitment to non-destructive action paths — staging, simulation, queuing, dual-control, time-delay — before execution. Termination after the fact is a cleanup primitive, not the headline response.

Action axes:

Reversibility Externality Regulator-notifiability
Read-only Internal-only Not notifiable
Reversible-without-cost Externally-visible (logs) Notifiable on breach
Reversible-with-cost Externally-visible (effect) Notifiable on event
Irreversible (durable state change) Customer-facing Required to report
Irreversible (external commitment) Regulator-facing

Magnitude and aggregation handling (v0.2.2 cross-reference). The three axes above classify individual actions but do not capture magnitude (a $5,000 refund vs a $5 refund) or aggregation (10 sequential $499 refunds within a single session). Both are handled by Primitive E's blast budgets: per-action thresholds bound magnitude; per-session aggregate thresholds bound aggregation. The classification (D) determines which budget applies; the budget itself (E) enforces it. The v0.2.1 review correctly identified that this cross-reference was implicit; v0.2.2 makes it explicit.

Key controls.

Failure mode when missing. Gemini's worked example: agent issues terraform destroy; framework's response is "terminate the session"; production is already gone. Or: agent issues 10 sequential $499 refunds; framework's response is "kill the session after refund 8"; $4,000 is already moved.

Design rule (the one Doc 01 v0.1 got wrong). For irreversible side effects, the headline response is deferral, not termination. Termination becomes a cleanup primitive that operates after deferral has prevented the irreversible action.

Falsification condition. Tabletop exercise: simulate a compromised agent issuing irreversible destructive actions. If the framework's controls allow the action to externalise faster than the response time, the implementation has failed.

3.5 Primitive E — Runtime & Containment

What it is. Designed-to-die runtime environments. Ephemeral execution, blast budgets, JIT credentials, forensic capture on termination, sandboxing where the per-surface compliance posture requires it.

The invariant. The blast radius of any individual agent session is bounded by design. A compromised session cannot persist, cannot exceed budget, and leaves no durable side effects outside the explicitly approved set.

Key controls.

Failure mode when missing. Agent runs on the developer's laptop with full host filesystem and network. PocketOS / Cursor wipe: agent finds Railway token in unrelated file, destroys production in 10 seconds. No bound on damage.

Design rule. Ephemeral, attestable, blast-bounded runtimes dominate in-flight monitoring. You do not need perfect detection if compromised principals cannot persist past their bounded window.

Falsification condition. Chaos-test exercise: force-kill the agent runtime mid-session. If anything persists that shouldn't — files, network state, credentials, in-flight side effects — the implementation has failed.

3.6 Primitive F — Observability & Evidence (tiered F0–F4 in v0.2.1)

What it is. Structured, queryable, immutable event streams of every agent action — with tiered reconstruction fidelity (F0–F4) mapped to Primitive D side-effect classes, plus tamper-evidence, tenant-aware retention, and privacy-preserving aggregation where the threat model permits. The substrate for G (Detection-Response), H (Learning), and I_arch (Assurance).

The invariant (revised in v0.2.1). Every primitive's enforcement action is logged with attribution at the fidelity tier appropriate to the side-effect class of the action it relates to. High-fidelity reconstruction is available for high-risk irreversible actions; metadata-only logging is the default for read-only / internally-visible actions. The v0.2 framing "minimisation is the default" was correct in spirit but operationally incomplete — the v0.2 critique correctly identified that G, H, and J need semantic content to function, and a single minimisation default broke all three.

The F0–F4 reconstruction tiers (NEW in v0.2.1).

Tier Content Maps to Primitive D classes Access controls
F0 Metadata only: event type, timestamp, principal, surface, outcome. No prompt, no tool arguments, no response content. Read-only / internal / not-notifiable Telemetry team default access
F1 Metadata + policy-decision logs: which rule fired, why, what the outcome was. Still no raw prompt / tool args / response. Reversible-without-cost / internal SOC / detection team access
F2 F1 + hashed / tokenised content fingerprints. Detection signatures and behavioural baselines can be computed; raw content is not retrievable from the log alone. Reversible-with-cost / externally-visible SOC + incident-response with audit
F3 F2 + redacted-PII raw content (the prompt, tool args, response, with PII / secrets tokenised). Sufficient for signature creation against semantic attacks (prompt injection, instruction smuggling). Irreversible / customer-facing Incident-response only; explicit access trail per query
F4 Full unredacted content. Forensic-grade replay. Required only for active investigation of high-impact incidents. Regulator-notifiable / financial / production / external Two-person rule for access; access trail logged at F2+ in its own right

The default-by-class (v0.2.2 wording fix). Every event is emitted at the minimum tier sufficient for the control objective, capped by the maximum tier permitted by its Primitive D classification. "Sufficient" is determined by the control objectives the event must serve (detection by G, signature creation by H, audit by I_arch). F0 / F1 are the operational steady state; F2 supports behavioural detection; F3 supports semantic signature creation; F4 is reserved for forensic-grade replay of high-risk incidents. (v0.2.1's "highest tier permitted" framing recreated the panopticon pressure the framework was supposed to reject; the v0.2.1 review correctly identified this and v0.2.2 corrects it.)

Key controls.

Failure mode when missing or mis-tiered. Either (1) investigation cannot reconstruct a session post-incident because tier was too low for the action's class, or (2) the audit log becomes the breach because tier was too high for the action's class. The v0.1 panopticon failed in mode (2); a naïve v0.2 minimisation default would fail in mode (1) for high-risk actions. F0–F4 tiering is the resolution.

Design rule. Observability is the substrate of every other primitive. Reconstruction fidelity is class-conditional, not uniform. Universal raw capture is malpractice (v0.1 error); universal minimisation is operationally incoherent (v0.2 error); class-conditional tiering is the correct shape.

Falsification condition (revised). Pick a random agent session from yesterday. (a) Reconstruct what happened end-to-end at the tier appropriate to its highest-risk action in under 60 seconds. (b) Verify that every event was emitted at no higher a tier than its Primitive D classification authorised. (c) Verify that every F3/F4 access has a matching F2 access-trail event. Failure on any of the three is a failure of the rule.

Measurement contract (NEW in v0.2.1). See § 8.

3.7 Primitive G — Detection, Response & Recovery

What it is. Three separable verbs that v0.1 conflated into "Adaptive Immunity": detection (sense an anomaly), response (do something proportionate), recovery (return to a safe state). Each has distinct mechanisms and distinct evidence requirements.

The invariant. Suspicious agent behaviour produces a graduated response proportional to the confidence of detection, the reversibility class (Primitive D) of the action, and the authority scope (Primitive A) of the principal — never a binary block-or-allow.

Key controls.

Detection

Response (graduated, the v0.2 correction)

Recovery

Failure mode when missing. Confirmed-compromised agent continues acting because detection only emits alerts. Or detection acts but only blocks the immediately-detected action while the agent retries via a different path. Or termination fires but credentials persist and the agent reconnects with the same authority.

Design rule. Detection produces a risk score; response is graduated; recovery includes credential revocation, side-effect rollback, and downstream notification. Termination without recovery is incomplete.

Dependency on F (NEW in v0.2.1). Signature-based detection runs on F0–F1 (metadata is sufficient for many known-bad patterns). Behavioural detection requires F2+ — semantic fingerprints, not just metadata. The v0.2 critique correctly identified that v0.2's "minimisation default" framing broke this: a metadata-only stream cannot support "the agent's tool-call shape changed semantically." Detection's tier requirement is determined by what the rule needs to evaluate; F's tier is determined by what the action's D class authorises; the deployable detection coverage is the intersection of those two constraints, not the union.

Falsification condition. Walk through a confirmed high-confidence anomaly scenario end-to-end. Measure: time-to-detection, time-to-quarantine, time-to-credential-revocation, completeness of rollback, downstream notification latency. If any step requires a human in the critical path, or any rollback step fails, the implementation has failed.

Measurement contract (NEW in v0.2.1). See § 8.

3.8 Primitive H — Learning & Assurance

What it is. The feedback loop from incidents and red-team exercises into detection signatures, policy rules, and behavioural baselines. Plus the assurance program that validates the framework is working.

The invariant. Every incident and every red-team finding produces a deployable signature or policy update within a bounded SLA. Detection coverage is measured continuously, not annually.

Key controls.

Failure mode when missing. Red-team findings discussed in retros, never operationalised. Detection coverage drifts; signature library decays. The same attack class succeeds twice.

Design rule. Memory compounds across incidents within a tenant; the rate of compounding is the measure. Single-tenant learning is necessary and deployable; multi-tenant aggregated learning is a future moat that depends on unsolved research problems.

Dependency on F (NEW in v0.2.1). Signature creation against semantic attacks (prompt injection, instruction smuggling) requires F3+ content — redacted-PII raw text suffices for fingerprinting, but pure metadata does not. The v0.2 critique correctly identified that v0.2's framing implied signature creation could work on minimised telemetry alone; in practice, the H pipeline needs explicit F3-tier authority for the events feeding signature generation.

Falsification condition. Show the last six signatures added to detection. For each: when was the incident or red-team finding? When did the signature deploy? If the median exceeds the SLA, the loop is broken.

Measurement contract (NEW in v0.2.1). See § 8.

3.9 Primitive I_arch — Assurance & Accountability (split from v0.2's monolithic Primitive I)

v0.2.1 structural change. v0.2's Primitive I bundled two distinct concerns: an architectural assurance plane (internal control verification, board reporting, attributable identity) and a commercial regulatory-adapter layer (mappings to APRA / MAS / PRA / OSFI / OCC / EU AI Act control codes). The v0.2 critique correctly identified the latter as GTM laundered into the architectural spine — remove APRA / MAS / PRA from the world and the regulatory-mapping concerns disappear while everything else is unchanged. That is the test of a commercial wrapper, not an architectural primitive.

v0.2.1 separates them:

I_arch — what stays in the spine

What it is. The plane on which the other primitives' enforcement is verified. Did A's controls actually attribute every action to an agent principal? Did D's controls actually defer every irreversible action? Did F's tiering actually emit at the correct tier for each action's D class? Internal control-effectiveness verification.

The invariant. Every native risk primitive (A, B, C, D, J-overlay) and every control plane (E, G, H) has a corresponding control-effectiveness signal in I_arch — a measurable indicator of whether its design rule is being maintained in the deployed system. The signal is jurisdiction-agnostic; the commercial adapters consume it.

Key controls.

Failure mode when missing. Control-effectiveness is opined on, not measured. The framework reports to the platform team but has no board-grade signal. Policy exists on paper but its enforcement cannot be verified.

Design rule. Every native risk primitive and every control plane has a measurable control-effectiveness signal. The signals are jurisdiction-agnostic; commercial adapters consume them to emit regime-specific evidence.

Falsification condition (revised in v0.2.1). For each primitive A–H and J, name its control-effectiveness signal. If any primitive lacks a measurable signal, I_arch is incomplete and the framework's assurance posture is incomplete. Note: the v0.2 falsification condition ("can APRA / MAS / PRA / OSFI / OCC ask tomorrow and be answered in 24 hours") was the v0.1 critique's "this looks like a sales demo" critique made operational. It still matters — but it is now a measurement on the I_commercial adapter layer, not on I_arch.

Measurement contract (NEW in v0.2.1). See § 8.

I_commercial — what leaves the spine

The regulatory-mapping module that converts I_arch's primitive-level signals into APRA / MAS / PRA / OSFI / OCC / EU AI Act / DORA / NIS2 / HKMA SPM / JP FSA / NYDFS Part 500 / ISO 42001 / IRAP ISM / DSPF / NCSC / DoD CDAO / AUKUS Pillar 2 control attestations is not in the architectural spine. It is the load-bearing piece of the productised wedge (07-build-vs-buy.md v0.2). Per-regime mapping tables remain in 03-maturity-model.md.

This separation matters because:

  1. The architectural framework is jurisdiction-agnostic. A buyer in a regime not yet mapped (Singapore tomorrow, Brazil next year) gets the same framework with a new adapter.
  2. The framework can be honestly applied to non-regulated contexts. Hobbyist agents, consumer copilots, and B2B-API integrations get A–H + I_arch + J without needing I_commercial — which v0.2's monolithic I did not allow.
  3. Prioritisation across primitives becomes legitimate. § 5.3 (new) can now say "for adversary class X with regulatory exposure Y, prioritise primitives Z₁, Z₂, Z₃" without I's structural pressure forcing equal narrative weight.

3.10 Primitive J — Composition (recast in v0.2.1 as a bounded capability-graph assurance overlay)

v0.2.1 structural change. v0.2 framed J as a peer primitive with a falsification condition ("no path exists by which untrusted content induces an agent to act outside its declared scope"). The v0.2 critique correctly identified this as computationally intractable — "no path exists" in a stochastic multi-agent system is proof-of-negative and exponentially expensive to test even partially. Several models converged on the fix: recast J as a bounded capability-graph assurance overlay, not a peer architectural primitive.

What it is (revised). J is an overlay applied to A through G when those primitives are exercised in multi-agent compositions. It is not a separate native risk to be defended against; it is a scope extension of the native risks already covered (A's authority, B's instruction provenance, C's information flow, D's side-effect classification) plus a control-plane extension (G's detection-response operating across agent boundaries).

The invariant (revised). For any composition of agents {a₁, a₂, …, aₙ}, the declared joint capability set is the union of each agent's individual capabilities subject to taint-propagation rules from B (instruction provenance) and C (information flow). Any agent action that exceeds the declared joint capability set is a policy violation. The framework does not claim to prove the absence of all unauthorised paths; it claims to bound the residual risk through declared capabilities + taint propagation + representative adversarial testing + explicit residual-risk acceptance.

Key controls (revised).

Failure mode when missing. Agent A is restricted from reading customer PII. Agent A asks agent B to summarise a record set. Agent B reads the PII and returns the summary. Agent A now has effective access to data its policy forbids. The v0.2.1 control is: agent A's request to B is logged with the requested capability ("summarise records X"); the declared joint capability set is checked; if the joint capability is not declared, the request is rejected. The framework does not prove this catches every such path; it bounds the residual risk to compositions whose joint capability set has been declared and tested.

Design rule (revised). Composition is governed by declared joint capabilities + taint propagation + representative adversarial testing. The framework names what is in-scope (declared joint set) and what is residual-risk (everything else, explicitly accepted by the accountable owner). It does not claim to verify the absence of unauthorised paths.

Falsification condition (revised). For each deployed composition: (a) is the joint capability set explicitly declared? (b) is the representative adversarial test suite documented, current, and passing? (c) is the residual-risk acceptance signed and dated by the accountable owner? (d) when did the composition last surface an unauthorised-path incident, and how was it incorporated back into the declared set? Failure on any of (a)–(c) is a failure of the rule. Recurrence of (d) without (b)–(c) update is a deeper failure.

Measurement contract (NEW in v0.2.1). See § 8.

Open problem flag (NEW in v0.2.1). General intractability of multi-agent verification (proving absence of unauthorised paths in stochastic multi-agent systems) is moved to § 9. v0.2.1 explicitly does not claim to solve this; it bounds the problem to a declared, tested, signed-off perimeter.


4. Adversary capability model

Defence-in-depth without an adversary model is decoration. v0.2 introduces an explicit capability model mapped to authority boundaries.

Adversary archetype Capabilities Goal Primitives most threatened
Opportunistic external (criminal) Crafts prompt-injection payloads; embeds in publicly-accessible content; high volume, low precision Financial gain via fraud, ransomware, data exfil B (instruction provenance), G (detection)
Targeted external (organised crime / nation-state) Bespoke payloads; supply-chain compromise of agent tooling; patient, well-resourced; multi-stage Persistent access, IP theft, regulatory disruption A (authority), B, C (information flow), E (containment), H (intel)
Malicious vendor / compromised maintainer Controls an MCP server, plugin, library, or skill the customer trusts; can ship deliberate or coerced modifications Subversion of agent behaviour at scale; data exfil via approved channel B, C, F (observability of vendor behaviour), H (drift detection), J (composition)
Malicious insider Has legitimate access to one or more agent surfaces; can craft inputs to manipulate agent behaviour while remaining within their own authority Theft, sabotage, fraud — agent-amplified A (originating-human attribution), D (side-effect controls), G (response), I (governance)
Inadvertent insider Untrained or careless user; legitimate access; produces ambiguous or contradictory instructions Operational error, data exposure, accidental destructive action D (side-effect deferral), G (response), I (governance)
Compromised user / phishing victim An attacker controls a legitimate user's interaction with the agent (e.g. customer uploads attacker-crafted PDF to a customer-facing agent) Same as opportunistic external, but via a higher-trust channel B, D, G

For each customer deployment, the framework should identify which adversary archetypes are in-scope (varies by industry, threat intel, prior incidents) and which primitives must be most heavily invested in to defend against them. The Phase 0 / Phase 1 / Phase 2 phasing in 07-build-vs-buy.md is calibrated for the regulated-AU buyer's typical adversary mix (heavy on targeted external + malicious vendor + insider; lighter on consumer-facing opportunistic).


5. Cross-cutting concerns

Two concepts that span all ten primitives rather than belonging to any one.

5.1 Global threat-elevation ("fever")

When the org detects elevated threat — an active phishing campaign, an upstream supply-chain compromise (e.g. a future xz-utils-equivalent in an MCP server), credential-leak indicators, or simply quarter-end change-freeze windows — agent capabilities globally degrade:

Triggers may be manual (security team raises threat level) or automated (threat-intel feed signals, internal incident indicators). Auto-reset after the trigger clears.

Design implication. Build the dimmer switch from day one. Customers will need to dial up defences during incidents; if every change requires a re-deploy, the dial doesn't exist when it matters.

Note on v0.1. This was "Fever" in the immune-system metaphor. The control posture is unchanged; the name and rhetoric are demoted to Appendix A.

5.3 Primitive prioritisation calculus (NEW in v0.2.1)

The v0.2 critique correctly identified that v0.2 enforces equal narrative weight across all ten primitives — the framework cannot tell a customer which 2–3 primitives buy 80% of value for their specific threat class. This was a downstream symptom of v0.2's monolithic Primitive I (the regulatory-tables-must-cover-every-primitive structural pressure). With I split into I_arch + I_commercial in v0.2.1, prioritisation becomes legitimate.

The calculus. For a given deployment, the prioritisation tier of each primitive is computed from six inputs:

Worked examples.

Deployment shape Top 3 primitives Why
Developer-side, internal IP only, no production write-access, opportunistic-external adversary B, F (F0-F1), E Instruction provenance is the dominant attack surface (untrusted dependency content). Lightweight observability and ephemeral runtime do the rest. A, C, D, G, H all matter less.
Production-runtime, customer-facing, money-moving, criminal + opportunistic adversaries, APRA-regulated D, G, F (F3-F4 for money-moving actions) Side-effect deferral is the dominant control (stops the loss). Detection-response handles in-flight. High-tier observability provides forensic and regulatory evidence. A is required-but-table-stakes.
Back-office RPA, internal financial actions, malicious insider adversary, APRA-regulated A (originating-human attribution), D (per-day aggregates), I_arch (control-effectiveness signal) Insider amplification by agents is the dominant threat. Authority chain integrity + side-effect aggregation + measurable control effectiveness are top three.
AUKUS / IRAP-PROTECTED workload, nation-state adversary E (Firecracker self-host), A, J (declared joint capabilities) Containment cannot be outsourced; identity must be cryptographically attestable; multi-agent compositions need explicit declared scopes.
Multi-agent system, mixed authority J overlay, B (taint propagation), G (cross-agent detection) Composition is the failure mode; instruction provenance must traverse agent boundaries; detection must operate at protocol layer.

The calculus is not a closed formula — it requires judgement informed by the deployment's specifics. But the framework now enables the conversation ("for your shape of deployment, here are the 2–3 primitives that matter most") rather than blocking it.

Falsification condition. A customer who has run the prioritisation calculus for their deployment can name their top three primitives and defend why. If a customer cannot, the calculus has not been operationally instantiated for them.


5.2 False-positive economics

Detection systems that fire too often get disabled in production. The classic security problem of autoimmunity — defences attacking the host — applies acutely to agent systems because agents operate at machine speed: a false positive doesn't just generate an alert, it can halt a business process or trigger an erroneous kill-switch on a critical workflow.

False-positive economics is a property of every detector, every policy, every workflow — not a separate primitive. v0.1 made this a "Tregs" layer (Layer 9); the critique correctly identified this as metaphor-capture (false-positive suppression is intrinsically cross-cutting).

Design rule. Every control in Primitives G (detection / response) and H (learning) must be designed with explicit precision targets, analyst-disposition feedback loops, override paths, and false-positive cost accounting from inception. Treat false-positive rate as a first-class KPI, not as downstream tuning.

The agent-specific calibration (the v0.2 sharpening). Biology's regulatory T-cells are tuned by evolutionary cost of autoimmunity vs cost of infection. The body accepts a high pathogen miss rate because autoimmune disease is catastrophic. Enterprise agent security has the opposite cost ratio for many threat classes: missed prompt injection that exfiltrates customer PII is a catastrophic, regulator-notifiable event; "autoimmune" false positives are a productivity tax measured in dollars per developer per week. For catastrophic-tail-risk problems, the v0.1 framing ("a sensitive detector without engineered suppression is shipping autoimmune disease") is backwards — it optimises away from sensitivity, which is the wrong direction. The v0.2 correction: calibrate false-positive tolerance to the cost ratio for the specific threat class. High-cost catastrophic classes accept more false positives; low-cost noise classes accept fewer.


6. Surface-specific instantiations

The spine is surface-agnostic. The implementations differ. Three primary surfaces matter for an enterprise platform.

6.1 Developer-side agents (Claude Code, Cursor, Copilot, custom IDE agents)

Threat shape: developer credentials; dev-machine attack surface; source-code IP egress; malicious skills / plugins / MCP servers; supply-chain injection via dependencies.

Highest-value primitives:

Primitive Specific to dev-side
A Authority Distinct dev-agent identity per engineer; tiered "what can the agent touch": this repo / this org's repos / cloud sandbox / staging / prod. JIT elevation with human gate.
B Instruction provenance Tool-result tagging for untrusted (READMEs, web pages, MCP responses, npm package metadata). The dominant injection vector is here.
C Information flow IP-class-aware egress proxy on the dev machine; source code cannot reach unapproved LLM endpoints; secrets-class scanning on every outbound prompt.
D Side-effect taxonomy Staged commits not direct pushes; PR review before merge; terraform plan gates before apply.
E Runtime Container or VM per session; reset on session end; dev machine itself is not the runtime.
F Observability Every tool call streamed to a central log with replayable transcripts; engineer + agent identity both recorded; content minimisation per data class.

Distinctive concerns. Shadow-AI (devs install personal API keys outside corporate egress). Plugin sprawl (the marketplace is per-developer and ungoverned). Dev-machine secret theft (long-lived gh tokens with admin:org).

6.2 Production runtime agents (customer-facing features)

Threat shape: end-user prompt injection (every user input is potentially adversarial); multi-tenant isolation; runtime cost containment; response-quality regression; regulatory exposure on every output.

Highest-value primitives:

Primitive Specific to production runtime
B Instruction provenance Aggressive untrusted-input tagging; prompt-injection detection on every user message AND every extracted-content channel (PDF text layers, image OCR, audio transcripts).
C Information flow Strict tenancy: this agent cannot read another tenant's data, full stop. Hard-coded, not policy-enforced.
D Side-effect taxonomy Per-session aggregate budgets (not per-message); transaction-level dual-control; staged externalisation for any customer-facing irreversible action.
E Runtime Per-request session boundaries; no persistent agent state across users. AWS AgentCore-with-AU-residency contractual for regulated workloads.
F Observability Per-tenant event streams; minimisation per data class (PII / financial / health); replayable for incident investigation.
H Learning Adversarial prompts seen against your customers, fed back into detectors. Single-tenant pipelines are deployable now; cross-tenant privacy-preserving aggregation remains a § 9 open problem (research-grade for adversarial NLP content), not a Phase 0/1 control.

Distinctive concerns. Regulatory output exposure (agent produces legal / medical / financial advice it shouldn't). Cost amplification under attack (a single user can drive token costs into the thousands). Prompt-injection from user-uploaded content.

6.3 Internal-process agents (back-office automation, RPA-style)

Threat shape: long-running agents with broad creds; often touching payments / payroll / customer records; often integrated with legacy systems that lack modern auth; often deployed in shadow-IT fashion by business units.

Highest-value primitives:

Primitive Specific to back-office
A Authority Service identities for each automation; no shared "automation user" accounts; originating-human-of-record captured for every agent decision.
D Side-effect taxonomy Workflow-scoped permissions — this agent can issue refunds up to $500 per transaction, $5,000 per day, $20,000 per week, with declining authority into higher-risk classes. Dual-control above thresholds.
G Detection-response-recovery Four-eyes / dual-control gates for any high-value action; per-originator pattern analysis (the "47 refunds from one employee" scenario); out-of-pattern recoveries.
F Observability Full transaction trails attributable to the agent AND the originating human; integration with existing AML / audit pipelines.
I Governance Tight mapping to existing change-management; agent-initiated changes treated as requiring the same approvals as human-initiated; CPS 230 critical-operations review for any agent touching money or PII.

Distinctive concerns. Legacy integration security (the agent talks to a 1998 mainframe via screen-scraping; what does identity attestation even mean there?). Regulatory reporting (APRA, AUSTRAC for AML). Insider-threat-as-business-process (the agent is doing exactly what it was configured to do, and the configuration was wrong).


7. Regulatory anchors by jurisdiction

The framework is jurisdiction-agnostic; the regulatory wrapper differs by market. The full per-jurisdiction mapping table is maintained in 03-maturity-model.md § 3. Summary table:

Jurisdiction Anchor regime(s) Primary primitives engaged
Australia APRA CPS 230, CPS 234, APRA Letter on AI (Apr 2026), Privacy Act, IRAP / ISM, DSPF, ASIC, DTA AI policy, OAIC APP A, F, G, I primarily; all primitives applicable to in-scope workloads
New Zealand RBNZ BS11, NZ Privacy Act Same as AU
Singapore MAS Notice 644, MAS FEAT + Veritas, IMDA AI Verify, PDPA A, C, F, G, I
Hong Kong HKMA SPM AI / GenAI guidance, SFC, PCPD A, F, I
United Kingdom PRA SS1/23, FCA AI Live Testing, UK AI Bill (draft 2026), NCSC AI guidance, UK GDPR, FSMA A, D, F, I
European Union EU AI Act (high-risk obligations: Annex III deferred to 2 Dec 2027), DORA (Jan 2025), GDPR, NIS2 C (information flow critical for GDPR), D, F, I
Canada OSFI E-23, AIDA (paused), PIPEDA A, F, I
United States — federal NIST AI RMF + AI 100-2, OCC / Fed / FDIC SR 11-7 (extended to GenAI), SEC cyber, CISA A, F, G, I
United States — state California AI Transparency Act, Colorado AI Act (Feb 2026), NYC Local Law 144, Texas TRAIGA, NYDFS Part 500 C, I
Japan FSA Discussion Paper on AI in Finance, METI AI Governance, APPI F, I
MENA UAE Cybersecurity Council, ADGM / DIFC, Saudi NCA + SAMA E (sovereign residency critical), F, I
AUKUS / Five Eyes AUKUS Pillar 2 (AI + cyber), ASD ISM, UK NCSC, US CISA AI, DoD CDAO RAI All primitives, with elevated E (Firecracker self-host per Doc 07 v0.2 § 10) and J (composition for multi-agency workflows)

Practical implication. Per-regime compliance-evidence collection is the load-bearing product feature (the I_commercial adapter layer that sits outside the architectural spine, consuming control-effectiveness signals from I_arch) — the engine that converts the framework's primitive-level evidence into regime-specific control attestations. See 07-build-vs-buy.md for the build-vs-buy decisions and 03-maturity-model.md for the full per-regime control mappings.


8. Measurement contracts (conformance axis), consolidated (revised in v0.2.1; language normalised in v0.2.2)

The v0.2 critique correctly identified that v0.2's "falsification conditions" instrumented only the conformance axis of a six-axis failure taxonomy (conformance / effectiveness / cost / theory / scope / obsolescence) and lacked operational scaffolding (measurement protocols, owners, thresholds, abandonment actions). v0.2.1 promoted the conformance-axis falsification conditions to full measurement contracts with named operational scaffolding. v0.2.2 normalises the language: the framework is conformance-falsifiable as of v0.2.2 (an implementation can be shown to fail the measurement contracts); it is not yet theory-falsifiable on the effectiveness / cost / scope / obsolescence axes. The remaining five axes are explicitly deferred to v0.3 and listed in § 9 (open problems). Wherever this document uses "falsifiable" or "falsification condition" without further qualification, the conformance axis is meant.

8.1 The measurement contract template (NEW in v0.2.1)

Each primitive's measurement contract has nine fields:

Field What it captures
Metric The specific quantity measured.
Numerator / Denominator The exact definition (what counts; what the base is).
Sampling All events / probabilistic sample / event-class-conditional.
Owner The named human or team accountable for the metric's value.
Cadence How often the metric is computed and reviewed.
Severity bands (per Primitive D class) Per-event invariants for high-risk D classes; fleet SLOs for low-risk D classes. Not a single percentage.
Exception process What can be granted (and by whom) when a band is breached temporarily.
Failure action What the deployer must do when an exception cannot be granted or expires.
Theory-revision trigger What evidence would cause the design rule itself to be questioned (not just the implementation). This is the link to the deferred theory-axis falsification.

8.2 Measurement contracts per primitive

Note: percentages below are defaults — operators should adjust per Primitive D severity-band rules. The 4.9% un-attributed actions Gemini's debate flagged is correctly read as "the deployer accepts a blind spot in that band, with explicit risk acceptance" rather than "5% is fine universally."

Primitive Metric Severity band example (high-risk D class) Severity band example (low-risk D class) Owner Cadence Theory-revision trigger
A Authority % of agent actions with agent_principal_id distinct from human_initiator_id 100% for money-moving / production-write / regulator-notifiable. Fail closed on first breach. ≥99% over 30 days for read-only / internal Platform IAM owner Monthly review If A enforcement breaks because credentials cannot be cryptographically distinguished from human credentials at the dependent service, A's design rule needs re-derivation
B Instruction provenance Red-team policy-override success rate against tool-result-embedded payloads <1% for any payload that would authorise an irreversible action <10% for payloads that influence advisory-only outputs Detection engineering Quarterly red-team + continuous CI corpus If novel injection classes consistently >50% pass-rate, B's trust-hierarchy model is incomplete
C Information flow Count of data-class-to-unapproved-sink flows without logged policy decision, per 30 days 0 for regulated / financial / health data classes <5 per quarter for internal-only data Data governance Monthly If approved sinks change faster than the classification taxonomy, C needs richer flow semantics
D Side-effect taxonomy Tabletop exercise outcome: simulated compromised agent issuing irreversible action Action externalises faster than response time = fail closed (Not applicable — D's invariant is binary for high-risk classes) Incident response Semi-annual tabletop If new agent capabilities don't fit the existing taxonomic axes (reversibility / externality / regulator-notifiability), the axes are incomplete
E Runtime Force-kill test: persistent state after termination 0 persistent files / network sessions / credentials / in-flight effects for any session (Not applicable — E's invariant is binary) Platform Monthly chaos test If forensic capture proves insufficient for incident reconstruction at F4, E's "session ends cleanly" model has a gap
F Observability (a) Session reconstruction time at appropriate tier; (b) Tier-emission compliance; (c) Access-trail completeness for F3/F4 (a) <60 sec; (b) 100% events emitted at ≤ their D-class authorised tier; (c) 100% F3+ accesses have F2 audit trail Same; the controls apply universally F substrate owner + SOC for access controls Monthly If detection demonstrably requires F4 content the D classification forbids, the substrate-tier model has a contradiction
G Detection-response-recovery End-to-end recovery completion time for confirmed high-confidence anomaly <5 min from detection to credential revocation; <30 min to side-effect rollback completion <1 hour from detection to acknowledgment SOC Per-incident + monthly aggregate If high-confidence anomalies routinely have no actionable recovery path, G's response-recovery model is incomplete
H Learning Median time from incident / red-team finding to deployed signature <14 days for any class <60 days Threat intel Monthly + per-incident If signatures consistently produce >50% false-positive rate on first deploy, H's signature-generation pipeline has a structural bug
I_arch Assurance % of primitives A–H + J with measurable, board-reportable control-effectiveness signal 100% (Not applicable) CISO office Quarterly board report If a primitive cannot produce a measurable signal even in principle, that primitive may not belong in the spine
J Composition overlay (a) % of multi-agent compositions with declared joint capability set; (b) % with current adversarial test suite passing; (c) % with signed residual-risk acceptance 100% for each of (a), (b), (c) on any composition that touches regulated data or makes irreversible actions (a) 100%, (b) 100%, (c) 100% but allow longer cadence for low-risk compositions Architecture review Per-composition + quarterly review If unauthorised paths regularly emerge outside the represented adversarial suite, the bounded-overlay model is fundamentally insufficient and J needs re-derivation
Cross-cutting Fever Time to propagate elevated-threat mode to all active sessions <5 min from operator command to fleet-wide enforcement Same; the control is universal Security operations Quarterly chaos test If elevation is regularly required but customer push-back exceeds tolerable threshold, the elevation model may be too coarse
§ 5.2 FP economics (a) Per-detector false-positive rate measured; (b) autoimmune-incident count tracked (a) measured for every detector; (b) >0 incidents tracked is acceptable IF correctly remediated Same; cross-cutting property Detection engineering Monthly If FP rate cannot be measured for a class of detectors (e.g. ML behavioural baselines without ground truth), the measurement model is incomplete

8.3 What § 8 does not yet do (deferred to v0.3 — listed honestly)

Per the v0.2 critique (ChatGPT's six-axis failure taxonomy), § 8 currently instruments only the conformance axis (does the implementation conform to the design rule?). The other five axes are deferred:

These axes are real. v0.2.1 does not instrument them; v0.3 must. The "theory-revision trigger" field in the measurement-contract template is a placeholder — it names what evidence would cause us to question the design rule, but it does not specify the abandonment protocol that would act on the evidence. That protocol is what Kimi's debate called the Constitution for Framework Revision and is the primary v0.3 deliverable (see § 11.6 for the editorial response).

A deployment that fails three or more of the conformance-axis measurement contracts is materially exposed; one that fails six or more should not be in production for regulated workloads. That criterion is itself conformance-axis only; a deployment that conforms to every contract but is being attacked successfully because the design rules are wrong is a v0.3 problem.


9. Open problems and acknowledged limits

The framework does not solve these. They are located architecturally — they live inside named primitives — but are unsolved as of May 2026. (This is the v0.2 reframing of the v0.1 § 9 that the critique correctly flagged as defensive.)

Open problem Primitive it lives in Why unsolved
Inherent prompt-injection vulnerability of LLMs B (instruction provenance) Open research problem; LLMs cannot reliably distinguish data from instruction. Mitigation only, no elimination.
Cross-vendor identity attestation A (authority) No standard for cryptographic agent identity that traverses agent → MCP server → downstream service. Industry collaboration needed.
Multi-tenant privacy-preserving threat-intel aggregation on adversarial NLP content H (learning) Moved from § 3.8 controls to here in v0.2.1. Federated learning and differential privacy on natural-language adversarial content are research-grade; deployable single-tenant pipelines exist but cross-tenant aggregation is not yet a deployable control. The v0.2 critique correctly identified that v0.2's framing listed this as a control when it should have been an open problem.
Self-modifying agents (agents that update their own skills / hooks / config) A + B + I_arch Breaks the stable-principal assumption every primitive rests on. v0.2.1 prohibits self-modification without out-of-band approval; longer-term solutions require trusted-execution-environment-style integrity guarantees.
Provider-side model drift H + I_arch Vendor model updates change agent behaviour silently; no "agent regression test" standard.
General intractability of multi-agent verification J (composition overlay) Added in v0.2.1. Proving the absence of unauthorised paths in stochastic multi-agent systems is exponentially expensive even to test partially. v0.2.1 bounds the problem (declared joint capabilities + representative adversarial testing + residual-risk acceptance), but the general problem of "no unauthorised path exists" remains research-grade and is not solved by the framework.
Framework's own theory-of-failure axis (Constitution for Framework Revision) All primitives — meta-level Added in v0.2.1. The framework's § 8 measurement contracts instrument the conformance axis only. The other five failure axes (effectiveness / cost / theory / scope / obsolescence — per ChatGPT's v0.2 critique taxonomy) require a separate framework-revision protocol stating when a design rule must be abandoned, decomposed, merged, or replaced. v0.2.1 acknowledges this gap; v0.3's primary deliverable will be the Constitution.
Named structural coverage gap: low-D-class reconnaissance detection F + G + H Added in v0.2.1; framing sharpened in v0.2.2. This is not a research-grade open problem — it is a structural coverage gap the framework now names explicitly: some detection signatures (particularly behavioural baselines for prompt-injection reconnaissance) require F2+ semantic content to compute, while low-D-class read-only actions are authorised only at F0–F1. The intersection is a blind spot for early-stage reconnaissance attacks against low-risk actions before they escalate to high-risk D classes. Deployers must either (a) accept the blind spot as documented residual risk, (b) selectively elevate F-tier for action classes where reconnaissance is materially probable (with explicit policy authority), or (c) supplement framework controls with surface-specific reconnaissance detection outside the F-tiered substrate. v0.3 will provide a deployment-time tooling pattern for (b); v0.2.2 acknowledges this as a named, owned gap rather than deferred research.

These problems are real, named, and unsolved. They are not "framework gaps" in the sense of "the framework should have answered them and didn't"; they are research and standards gaps that the framework explicitly flags as constraints on what any AI agent security product can guarantee. A defensible product is honest about the constraints; a marketing framework hides them. v0.2.1 takes the first path. The v0.2 critique correctly noted that v0.1's § 9 was defensively positioned ("research gaps external to the framework"); v0.2 relocated open problems inside named primitives; v0.2.1 expands the list with what the v0.2 critique surfaced and acknowledges them honestly.


10. Design principles distilled

  1. Authority is attributable, scoped, and revocable. Distinct agent principals; no human impersonation.
  2. Untrusted content cannot override system policy. Instruction provenance is enforced; tool-result content is data, not command.
  3. Data classes constrain sources, processors, sinks, and retention. Information-flow enforcement is hard-coded, not policy-soft.
  4. Side effects are taxonomically classified. Reversibility, externality, regulator-notifiability determine the controls that apply.
  5. For irreversible actions, deferral beats termination. Stage, simulate, queue, dual-control, time-delay. Termination is cleanup, not response.
  6. Ephemeral runtimes bound blast radius. Compromised principals cannot persist.
  7. Observability is the substrate; minimisation is the substrate of safe observability. Policy decisions logged; raw content minimised per class.
  8. Detection produces risk scores; response is graduated; recovery includes rollback. Termination without rollback is incomplete.
  9. Memory compounds across incidents; the compounding rate is the measure. Single-tenant intel is a feature; multi-tenant aggregated intel is the moat.
  10. Compliance evidence collection is a feature, not an afterthought. Per-regime evidence pipelines from inception.
  11. Composition preserves least privilege. Multi-agent systems require primitive enforcement at the protocol boundary.
  12. False-positive tolerance is calibrated to threat-class cost ratio. Catastrophic-tail-risk classes accept more false positives than productivity-tax classes.
  13. Every design rule has a conformance-falsification condition. If the framework cannot be shown to fail in any operationally measurable way, it cannot be shown to succeed. Theory-falsification on the other five failure axes (effectiveness / cost / scope / obsolescence + the meta-level abandonment protocol) is deferred to v0.3.

11. Editorial response to v0.1 critique

This section records how v0.2 responded to each finding from the TriLLM adversarial debate (see 01-debate-synthesis.md).

11.1 Accepted findings (applied)

v0.1 critique v0.2 response Where
Biological metaphor functions as architectural generator Demoted to Appendix A reading guide; ImmunoSec name dropped; framework renamed "Agent Security in Depth" Title, Appendix A
Nine layers conflate orthogonal concerns Replaced with ten orthogonal primitives, each engineerable, each falsifiable § 3
Observability mis-classified at Layer 6 Promoted to substrate role (Primitive F is dependency-substrate for G, H, I); ordering clarified § 3, § 3.6
Mandatory total observability is privacy-hostile Design rule rewritten: policy decisions logged, raw content minimised per data class; tamper-evidence, tenant-aware retention, privacy-preserving aggregation explicit § 3.6
"Kill the session" wrong for high-authority agents (Gemini's apoptosis critique) Side-effect taxonomy (Primitive D) added as a first-class primitive; design rule: deferral is headline, termination is cleanup § 3.4, § 3.7
Granuloma / "sustainable détente" language Removed entirely (deletion)
Missing native primitives (ChatGPT's CI-workflow example) Added: A authority + B instruction provenance + C information flow + D side-effect taxonomy. The CI-workflow example is now structurally addressable (Primitive B). § 3.1, 3.2, 3.3, 3.4
Unfalsifiability (Claude R1) Each primitive ships with an explicit falsification condition; consolidated in § 8 § 3.x, § 8
Weak threat-actor model Added § 4 adversary capability model with six archetypes mapped to primitives § 4
Missing composition layer Added Primitive J composition with explicit privilege-preservation invariant § 3.10
Tregs as separate layer is metaphor-capture Demoted to "false-positive economics" cross-cutting concern (§ 5.2); also corrected the direction of the design rule for catastrophic-tail-risk classes § 5.2
§ 9 "open problems" reframing Open problems located inside named primitives; honestly flagged as research and standards gaps without claiming the framework solves them § 9

11.2 Qualified findings (partial agreement)

v0.1 critique v0.2 qualification
Discard the biology entirely (Gemini's hardest line) Partially accepted. Discarded as architecture; preserved as Appendix A reading guide. The metaphor is genuinely good for memory and customer narrative; it is bad for design generation. v0.2 separates the two uses.
Build a 9-or-10-primitive replacement (ChatGPT) Accepted in substance with irony noted. v0.2 has 10 primitives. The number isn't the issue; the derivation is. v0.2's ten are derived from agentic risk (authority, information flow, side effects, composition), not from a 600-million-year-old immune system.
Conflated layers prevent engineering allocation (Kimi) Fully accepted. Each new primitive is engineerable as a discrete domain — A is IAM territory, C is data governance, D is transaction architecture, etc. Layer 1's "four control families in a trenchcoat" is now four (or more) distinct primitives.

11.3 Pushed-back findings (rejected with reasoning)

v0.1 critique v0.2 reason for pushing back
The framework is unsalvageable, discard everything (Gemini's strongest reading) Pushed back. The framework's product instincts — attributable agent identity, ephemeral runtimes, structured-but-minimised telemetry, false-positive budgets, threat-elevation modes, continuous red-teaming, regulator-mapped evidence — are correct and survive the rewrite intact. The scaffolding needed replacing; the controls did not.
The number nine is the problem Pushed back as a primary critique. Number was never the issue; derivation was. v0.2 has ten primitives derived from agentic risk. That's defensible regardless of the count.
Framework should not be regulated-enterprise-specific Pushed back. v0.2 narrows the claim explicitly (§ 1): the target market is regulated enterprise, particularly APRA-regulated AU industry and the Commonwealth + AUKUS ladder. Surface-agnostic generalisation is rejected. Most primitives apply to other contexts but the framework is opinionated about its anchor market.

11.4 What survives unchanged

The framework's core product instincts and target market:

The wedge product strategy in 07-build-vs-buy.md v0.2 maps cleanly onto the new spine: MCP supply-chain governance is the Primitive B + C + H wedge; replay-grade audit is Primitive F + I; sovereign residency is Primitive E + I (the regulator-mapping module engine).

11.5 What this transition tells us about the framework

Two debates in (Doc 07 v0.1 and Doc 01 v0.1), the pattern is clear: adversarial multi-model critique reliably surfaces structural issues a single author won't see. Doc 07's debate produced metaphor-captured prioritization (Claude R2's orthogonal critique). Doc 01's debate produced biological-priors-produce-wrong-design-rules (Gemini's apoptosis) and the framework-has-no-native-security-model (ChatGPT's spine attack). Both findings are load-bearing for the rewrite.

Standing practice for the framework: any load-bearing document gets a TriLLM debate before being considered v0.x stable. The cost is roughly 5 minutes of TriLLM time plus 30 minutes of synthesis. The value is the framework's resistance to silent structural error.

11.6 v0.2.1 editorial response — applying the v0.2 critique

The v0.2 → v0.2.1 transition responds to a second TriLLM debate on Doc 01 v0.2 (see 01-debate-v0.2-synthesis.md; full transcript at ../debates/01-immunosec-framework-v0.2-critique.md). The pattern: v0.1 → v0.2 closed the metaphor-capture / missing-primitives / wrong-design-rules debt; v0.2 → v0.2.1 closes the substrate-propagation / GTM-laundering / J-tractability / measurement-discipline debt. Each iteration narrows structural debt without changing strategy.

Accepted findings (applied in v0.2.1)

v0.2 critique v0.2.1 response Where
§8 falsification conditions test conformance, not theory Measurement contracts added with named owners + cadences + severity bands + theory-revision triggers; full six-axis instrumentation deferred to v0.3 § 8
Orthogonality asserted but false Typed spine added; explicit dependency graph; "native control families" replaces "orthogonal primitives" framing § 3.0
Primitive I is GTM laundered into architecture (Claude's central finding) I split into I_arch (architectural, stays in spine) + I_commercial (regulatory adapters, moves out of spine to Doc 03 / Doc 07) § 3.9
Primitive J's "no path exists" falsification condition is computationally intractable J recast as bounded capability-graph assurance overlay; falsification = declared joint capabilities + adversarial test suite + residual-risk acceptance § 3.10
F's minimisation correction silently broke G, H, J (Claude's most original v0.2-specific finding) F0–F4 reconstruction tiers introduced, mapped to Primitive D side-effect classes; G's and H's dependencies on F-tier explicitly documented § 3.6, 3.7, 3.8
No prioritisation across primitives § 5.3 prioritisation calculus added with worked examples § 5.3
73.2% → 8.7% empirical anchor uncited Acknowledged as uncited; downgraded from "load-bearing claim" to "illustrative reference" pending citation discovery § 2
H's cross-tenant aggregation listed as a deployable control when it is research-grade Moved from § 3.8 controls to § 9 open problems § 9
Equal narrative weight across all primitives With I split and prioritisation calculus added, primitives no longer carry uniform narrative weight; the framework can honestly say "for your shape of deployment, prioritise these" § 5.3, § 3.9

Qualified findings (partial response in v0.2.1; full response deferred)

v0.2 critique v0.2.1 partial response v0.3 deferred work
Framework has no theory of failure (Kimi — meta-epistemic) Measurement contracts add theory-revision triggers as a field. § 9 adds the "Constitution for Framework Revision" as an explicit open problem. The Constitution itself — sacrifice / decomposition / merger / fallback / count-bound protocols. v0.3's primary deliverable.
Six-axis failure taxonomy (conformance / effectiveness / cost / theory / scope / obsolescence) § 8.3 names all six axes; v0.2.1 instruments conformance + acknowledges the others are gaps. Instrument the remaining five axes.
"Primitive" type-overload (ChatGPT) § 3.0 typing introduces five categories (native risk / control plane / evidence substrate / assurance plane / overlay). Further refinement; possibly different category labels or finer sub-types per v0.3 critique.
Document split (Risk Model / Architecture / Measurement Contract / Regulatory Adapters / Commercial Narrative) I_commercial moved out of Doc 01; rest deferred. Possibly split Doc 01 v0.3 into co-versioned multiple documents.

Pushed-back findings (rejected with reasoning)

v0.2 critique v0.2.1 reason for pushing back
Gemini's binary-threshold absolutism ("one unattributed action fails closed; arbitrary percentages are confessions of architectural failure") Operationally undeployable. Severity- and class-conditional thresholds (binary for high-risk D classes; SLO-based for low-risk) are the correct shape. ChatGPT's and Claude's positions adopted instead. § 8 severity bands reflect this.
Kimi's full-rewrite recommendation ("v0.2.1 is dashboard improvement; v0.3 must include the Constitution") Partially accepted. v0.2.1 is a substantial patch tier — typed spine + F tiers + I split + J recast + measurement contracts + prioritisation calculus are not "dashboard improvements," they are structural fixes. The Constitution itself is deferred to v0.3 because it requires a separate design-and-review cycle.
Implicit suggestion that v0.2 was "not worth pursuing" v0.2 was a substantial improvement over v0.1. v0.2.1 is a further substantial improvement. The framework is converging, not diverging. The natural pattern is v0.1 → v0.2 → v0.2.1 → v0.3 (incremental + structural release); diminishing returns set in probably around v0.3 → v0.4. We are not there yet.

What v0.2.1 demonstrates about the framework's posture

Three TriLLM debate cycles in, the framework has accepted critique in three structural rounds. The strategy (target market, wedge product, international ladder, regulatory tailwind, commercial path) has remained stable throughout. The architecture has changed three times in ways that make it materially more defensible:

This is the right pattern for a load-bearing framework. The cost of each cycle is ~1 day of work; the benefit is each cycle catches structural errors that would have been silent. The pattern terminates when critique cycles produce diminishing returns — not before.

11.7 v0.2.2 editorial response — applying the v0.2.1 review

The v0.2.1 → v0.2.2 transition responds to the third TriLLM cycle, which was reframed from adversarial critique to balanced assessment (01-review-v0.2.1.md; full transcript at ../debates/01-immunosec-framework-v0.2.1-review.md). The review's verdict was unanimous: lock v0.2.1 after a one-day errata pass; proceed immediately to Doc 06 critique + Doc 08 MVP spec + Doc 09 consulting template; run v0.3 as a parallel non-blocking track.

v0.2.2 is the editorial errata pass applied. It is the locked architectural baseline.

Errata items applied (eight, all editorial)

# Item Where in v0.2.2
1 J's type ambiguity clarified (composition is the native risk; J is the architectural overlay handling it) § 3.0 native-risk row + composition-overlay row + clarification note
2 F's "highest tier permitted" wording corrected to "minimum tier sufficient for the control objective, capped by maximum permitted by Primitive D" § 3.6 The default-by-class paragraph
3 D's magnitude/aggregation handling made explicit via E cross-reference § 3.4 Magnitude and aggregation handling paragraph (new)
4 § 7 residual "Primitive I" wording swept and updated to I_arch / I_commercial § 7 Practical implication paragraph
5 § 6.2 cross-tenant aggregation residue clarified — research-grade per § 9, not Phase 0/1 control § 6.2 Production-runtime H Learning row
6 73.2 → 8.7 empirical anchor cleaned — removed specific number from load-bearing claim; defence-in-depth thesis now rests on three independent evidence classes § 2 Defence-in-depth is empirically supported paragraph
7 "Conformance-falsifiable" language normalised — distinguishes from theory-falsifiable (deferred to v0.3) § 8 header + § 10 Design Principle #13
8 F/G/H reconnaissance blind-spot framing sharpened — from "research-grade open problem" to "named structural coverage gap for low-D-class reconnaissance" § 9 final row

What v0.2.2 does NOT change

What v0.3 still contains (deferred per v0.2.1 review unanimous verdict)

  1. Constitution for Framework Revision (Kimi's persistent contribution across cycles). The meta-level abandonment protocol: sacrifice / decomposition / merger / fallback / count-bound conditions. v0.3's primary deliverable.
  2. Full six-axis falsification instrumentation (ChatGPT's persistent contribution). v0.2.2 instruments only the conformance axis (with full measurement contracts per § 8). v0.3 will instrument effectiveness / cost / theory / scope / obsolescence per primitive.
  3. Possible document split (multiple reviewers). Risk Model / Architecture / Measurement Contract / Regulatory Adapters / Commercial Narrative. May emerge naturally from Doc 08/09 work; not required for v0.3 success.

All three are parallelisable with Doc 08/09 implementation and may be informed by implementation feedback. The framework is locked; v0.3 is concurrent extension, not corrective rewrite.

Closing the v0.1 → v0.2 → v0.2.1 → v0.2.2 → v0.3 arc

Four versions in (counting v0.2.2 as a locked editorial pass on v0.2.1), the framework has reached a state where:

This is the right state for a load-bearing framework to be in before downstream artefacts (Doc 08 MVP spec, Doc 09 consulting template) are built on top of it. The framework has earned the lock.


Appendix A — Immune-system reading guide (the metaphor, demoted)

The framework was originally derived by analogy to the human immune system. v0.2 has rebuilt the architecture around native primitives, but the biological intuition remains useful as a mnemonic for non-technical stakeholders and as vocabulary for customer-facing communication. This appendix records the mapping for that purpose. It is a reading guide, not the architecture.

Biological mechanism v0.2 primitive(s) Intuition preserved
Skin / mucosa A (Authority) + C (Information flow) — the perimeter primitives "Block what can be blocked without thinking"
Microbiome (no direct mapping; was metaphor only) "Make the safe path the easy path" — a design principle (#5), not a primitive
Innate pattern recognition (macrophages, PAMPs) G (Detection — signature sub-component) "Fast, dumb, generic detection of known-bad"
Inflammation G (Response — graduated state) "Cheap response to suspicion buys time for expensive response"
Adaptive immunity (B cells, T cells) G (Detection — behavioural + Response — termination) "Specific, learned, surgical"
MHC presentation F (Observability) "Cells that can't be inspected get killed" — but the v0.1 design rule was wrong: real MHC is local and ephemeral, not centrally aggregated. Use the intuition without the architectural mistake.
Apoptosis / granuloma E (Runtime) + G (Response — termination) "Designed to die" — but termination is a cleanup primitive, not the headline response (the v0.1 error). See § 3.4 and § 3.7.
Memory cells / vaccination H (Learning) "Past exposure compounds future defence"
Regulatory T-cells (Tregs) § 5.2 (False-positive economics) "Prevent autoimmunity" — but calibrate the tolerance to the threat-class cost ratio. See § 5.2.
Fever § 5.1 (Global threat-elevation) "Whole-system state shift on suspicion"
Self-tolerance § 5.2 again "Don't attack the host"

Use this appendix for customer narrative, marketing copy, and stakeholder communication. Use the body of the document for architecture, engineering, and compliance evidence. The two are designed to be consistent but the architecture is the source of truth.

The v0.1 name ImmunoSec is retired. The framework is Agent Security in Depth.


Next: see 02-controls-landscape.md for a per-primitive inventory of what exists today and where the genuine wedge gaps sit. See 03-maturity-model.md for a customer-facing maturity scorecard. See 07-build-vs-buy.md v0.2 for the build-vs-buy investment plan.