Semantic Auditing in Practice

Signed, regulator-mapped evidence for AI agents — and what we found when we pointed the toolchain at ourselves.

Companion paper. This is the governance-and-compliance companion to the Agent Security in Depth framework — the evidence layer (Primitive F) and assurance plane (I_arch / I_commercial) of that framework, built, run on real agent sessions, and described here as working practice rather than architecture. June 2026.

1. The auditor's question

Every serious AI-agent deployment now has observability. Spans, traces, token counts, latency, cost-per-call — the engineering telemetry is a solved, commoditised problem, and a crowded one: the observability vendors converge on the same span-capture primitive.

But observability answers an engineer's question: what happened, how fast, at what cost?

It does not answer an auditor's question: was what the agent did justified, in-policy, and properly overseen — and can you prove it to someone who doesn't trust you?

That second question is its own discipline. It has a buyer (compliance, risk, internal audit, the regulator), an output (evidence, not dashboards), and now a live regulatory mandate for a large class of deployments. What it does not yet have is a name. We propose one.

Semantic auditing (n.): evaluating recorded AI-agent activity — transcripts, traces, tool calls, decisions — against governance policy, acceptance criteria, and oversight requirements, and producing evidence-grade findings about whether controls functioned as designed.

The one-liner that separates it from everything adjacent:

Observability tells you the agent called the database. Semantic auditing tells you whether it should have — and proves it.

2. Three categories, three buyers — only one signs the attestation

	Question answered	Output	Buyer
Observability	What happened, how fast, at what cost?	Traces, dashboards	Engineering
Agent security	Was the action permitted and safe?	Blocks, alerts, posture	Security
Semantic auditing	Was the action justified, in-policy, overseen — provably?	Evidence packs, audit findings, attestations	Compliance, risk, audit, the regulator

The industry already named the gap without naming the category: standard logging captures the what; the why and how are lost in the black box of model inference. Spans record events; auditors sign off on judgments. Nobody currently sells the judgment.

One adjacency worth naming: the word semantic is starting to appear in the agent-control aisle, where written policy is converted into machine-enforceable logic for real-time agent control. That is the middle row of the table — enforcement, a security buyer, blocks and alerts. Valuable, and categorically different from rendering an evidence-grade verdict after the fact that an independent party signs. Enforcement is what the platform does to the agent; semantic auditing is what an auditor does to the record. The separation matters because the second one is the thing a regulator accepts.

3. Why now — and why not the deadline you were told about

The honest version of the timing story is that the single most-quoted forcing function — the EU AI Act high-risk deadline — is moving. The Digital Omnibus (provisional political agreement, May 2026) defers stand-alone Annex III obligations to 2 December 2027. If your urgency rested entirely on "August 2026," it just evaporated.

That is exactly why the category is real and not a deadline gimmick. The obligations did not change — only the clock did — and the pressures that are live this year are domestic and prudential:

APRA CPS 230 (Australia — commenced 1 July 2025). Operational-risk and service-provider obligations a regulated entity must be able to demonstrate, not assert; pre-existing service contracts must comply by their next renewal or 1 July 2026, whichever comes first. Independent assurance is the load-bearing phrase: a regulated bank cannot accept the platform vendor's self-attestation as independent.
COSO's Achieving Effective Internal Control Over Generative AI (23 February 2026), plus the SEC's dedicated SOX enforcement group (announced 31 March 2026). A US-driven, already-in-force push for an audit trail that captures prompts, inputs, outputs, model and configuration versions, and evidence of human review — sufficient to reconstruct what the AI acted on and show the control functioned as designed.
The evals gap. Per LangChain's State of Agent Engineering survey, 89% of agent teams have implemented observability but only 52.4% run offline evaluations. The market built the recording layer and skipped the judgment layer. That ~37-point delta is the addressable gap.

4. What an evidence pack actually is

The entry product audits existing transcripts and traces — retrospectively — and emits a regulator-ready evidence pack. Anatomy:

Chain of custody. Every ingested artefact is hash-anchored (SHA-256, append-only) so the pack can attest its own provenance — the tamper-evident bar that a screenshot and a declaration no longer clear.
A versioned policy set. Governance policy and acceptance criteria, machine-readable, stamped with which version applied at the time of the activity.
The judgment layer. Deterministic checks first; calibrated model-as-judge only where code cannot reach. Per episode: was the action in scope? Was the reasoning consistent with the action? Was required human oversight actually evidenced — a real review, not a rubber stamp? Were hard and soft gates handled correctly? Did data access match stated purpose?
An OSCAL spine. The canonical format is OSCAL Assessment Results plus POA&M — NIST's machine-readable compliance standard — so the pack feeds the buyer's existing GRC tooling rather than fighting it. The human-readable pack renders from that spine.
A gap report. Explicitly what this evidence cannot prove.

5. The toolchain: six components, one engine, many regimes

We built this. The toolchain is called signum (Latin: sign, mark, seal), and it runs end-to-end on a laptop: point it at a real agent session and it emits a signed, byte-reproducible, independently verifiable evidence pack in seconds.

The signum pipeline — session in, signed evidence pack out

Agent session

Claude Code + MCP servers — transcripts, tool calls, identity events

① capture

Structured events from the agent, MCP servers, and identity layer

② store

Immutable, hash-anchored event log — queryable, auditor-inspectable

③ classify

Apply the framework's primitive taxonomies — authority, provenance, information flow, side-effect class

④ map LOAD-BEARING IP

Classified events → OSCAL Assessment Results per regulator control code — one engine, a YAML bundle per regime

⑤ sign

Keyed Ed25519 + Sigstore transparency log; independence statement

⑥ deliver

Auditor-portable bundle + verify.sh — verification runs on the auditor's laptop, no vendor software AUDITOR

Five design decisions carry the weight:

A new regulator regime is a YAML bundle, not code. Three regimes — APRA CPS 234, OSFI E-23, MAS FEAT — run on the same engine today. The mapping bundles are where regulator-specific knowledge lives; the engine is jurisdiction-agnostic. This is the framework's I_commercial adapter made concrete.
The gate has teeth. A deliberately seeded breach session trips NOT_SATISFIED end-to-end. An assurance tool that cannot fail is theatre; falsifiability is a feature you have to engineer and then prove.
Byte-reproducibility. Re-running the pipeline over the same session produces a byte-identical pack. Evidence that cannot be reproduced cannot be independently verified.
The auditor needs no trust in us. The bundle ships with a verifier the auditor runs locally — signature, hashes, independence statement, OSCAL structure — without installing signum or taking our word for anything.
Two layers of judgment. Regulator regimes answer "does this satisfy the control code?" A client-policy layer answers the harder, more useful question: "does this satisfy your governance policy?" The same session can pass one and fail the other — and that distinction is precisely the finding an internal-audit team needs.

6. Engagement Zero: we audited ourselves first

Before asking any client to sit for this, we ran the engagement on ourselves — real Claude Code sessions from our own work, with a real one-page Comware governance policy (egress allow-list, no auto-approve, instrumentation required) as the client-policy layer.

The verdict on a real ~4,900-event session:

Control	Verdict
APRA CPS 234 §20 (information assets)	SATISFIED
APRA CPS 234 §23 (incident management)	SATISFIED
Comware's own governance policy	NOT_SATISFIED

Both findings against our own policy are true. The session fetched a documentation PDF from a domain outside our github-only allow-list (critical — the rule judges allow-list membership, and the fetch was outside it). And the session ran with auto-approve enabled and no human oversight posture — exactly the YOLO-mode failure the framework paper opens with, in our own work, recorded in signed OSCAL.

The hook-gate surface captured the texture under the headline: 1,217 gate evaluations (1,216 allowed, 1 blocked — the one block real and identifiable), 624 agent-executed self-gated actions, reasoning-action consistency checks, 25 surfaced-uncertainty signals.

The dogfooding also hardened the classifier. The first pack reported a critical egress to destination unknown — and distrusting that finding was the correct response: it was a false positive, traced to a command that mentioned curl in echoed text rather than executing it. The fix (command-position anchoring; stripping heredocs and quoted strings before matching) was the fifth false-positive class eliminated by running the toolchain over our own corpus of real sessions. A finding that an auditor later discredits is worse than no finding at all — the false-positive hunt is the product work.

Two things made this run worth publishing. First, the two-layer judgment demonstrated on real data: passing the regulator while failing your own policy is a result no dashboard renders. Second, the engagement-honesty position it buys: the strongest possible opening with a skeptical auditor is "here is the product finding real problems in our own work — signed."

7. The part that earns an auditor's trust: what it cannot prove

The most credible thing a semantic audit says is where it stops. Transcript-level evidence can establish what was said, decided, and invoked; whether the reasoning was consistent with the action; and whether oversight artefacts were present or absent. It cannot retroactively manufacture runtime policy context, tamper-evidence prior to ingestion, or attribution the source system never recorded.

A pack that claims otherwise is not assurance — it is theatre. So the gap report names the limits in the same document as the findings. That honesty is also the path forward: the gaps are precisely what an instrumented-by-design deployment closes.

This is the discipline we hold ourselves to as a phrase: we evidence, we don't certify. The findings are technical and operational; whether they are sufficient for a given regulation remains the client's counsel's call. Clause-level sufficiency is engagement work done with qualified legal review — not marketing copy.

8. Why independence is the product, not a feature

The observability vendors can — and will — ship "compliance reports." But a reformatted trace is still telemetry. What a regulated buyer ultimately needs is separation of duties: independent assurance that the platform vendor structurally cannot provide for its own product. The prudential frameworks we map to — APRA CPS 230/234, OSFI E-23, MAS FEAT — all encode an independence requirement that self-attestation does not satisfy.

Accountability — a named party that signs the findings and is independent of the thing being audited — is not a feature you can add in a release. It is the category. Observability records. Semantic auditing judges. Only one of them is something an auditor, a board, and a regulator will accept the same story from.

9. Where this goes

Three threads, in order of consequence:

External review. The pack that audited our own sessions is the artefact we put in front of friendly external reviewers — an internal-audit lead, an assurance practitioner, an ex-supervisor — with one question: would you accept this as evidence for the nominated control code? That bar, not feature count, is what "done" means for this phase.
Clause depth. The regime bundles map to control codes today; full audit-fidelity clause decomposition is legal work, in progress, and we will not claim it before it is done.
Standards. The pack's OSCAL spine uses an agent-runtime property vocabulary that does not exist in any standard yet. We intend to propose it upstream — evidence formats only compound in value when more than one party emits them.

Comware · June 2026 · Companion to the Agent Security in Depth framework. Contact: jima@comware.com.au.