Semantic Auditing in Practice
Signed, regulator-mapped evidence for AI agents — and what we found when we pointed the toolchain at ourselves.
Companion paper. This is the governance-and-compliance companion to the Agent Security in Depth framework — the evidence layer (Primitive F) and assurance plane (I_arch / I_commercial) of that framework, built, run on real agent sessions, and described here as working practice rather than architecture. June 2026.
1. The auditor's question
Every serious AI-agent deployment now has observability. Spans, traces, token counts, latency, cost-per-call — the engineering telemetry is a solved, commoditised problem, and a crowded one: the observability vendors converge on the same span-capture primitive.
But observability answers an engineer's question: what happened, how fast, at what cost?
It does not answer an auditor's question: was what the agent did justified, in-policy, and properly overseen — and can you prove it to someone who doesn't trust you?
That second question is its own discipline. It has a buyer (compliance, risk, internal audit, the regulator), an output (evidence, not dashboards), and now a live regulatory mandate for a large class of deployments. What it does not yet have is a name. We propose one.
Semantic auditing (n.): evaluating recorded AI-agent activity — transcripts, traces, tool calls, decisions — against governance policy, acceptance criteria, and oversight requirements, and producing evidence-grade findings about whether controls functioned as designed.
The one-liner that separates it from everything adjacent:
Observability tells you the agent called the database. Semantic auditing tells you whether it should have — and proves it.
2. Three categories, three buyers — only one signs the attestation
| Question answered | Output | Buyer | |
|---|---|---|---|
| Observability | What happened, how fast, at what cost? | Traces, dashboards | Engineering |
| Agent security | Was the action permitted and safe? | Blocks, alerts, posture | Security |
| Semantic auditing | Was the action justified, in-policy, overseen — provably? | Evidence packs, audit findings, attestations | Compliance, risk, audit, the regulator |
The industry already named the gap without naming the category: standard logging captures the what; the why and how are lost in the black box of model inference. Spans record events; auditors sign off on judgments. Nobody currently sells the judgment.
One adjacency worth naming: the word semantic is starting to appear in the agent-control aisle, where written policy is converted into machine-enforceable logic for real-time agent control. That is the middle row of the table — enforcement, a security buyer, blocks and alerts. Valuable, and categorically different from rendering an evidence-grade verdict after the fact that an independent party signs. Enforcement is what the platform does to the agent; semantic auditing is what an auditor does to the record. The separation matters because the second one is the thing a regulator accepts.
3. Why now — and why not the deadline you were told about
The honest version of the timing story is that the single most-quoted forcing function — the EU AI Act high-risk deadline — is moving. The Digital Omnibus (provisional political agreement, May 2026) defers stand-alone Annex III obligations to 2 December 2027. If your urgency rested entirely on "August 2026," it just evaporated.
That is exactly why the category is real and not a deadline gimmick. The obligations did not change — only the clock did — and the pressures that are live this year are domestic and prudential:
- APRA CPS 230 (Australia — commenced 1 July 2025). Operational-risk and service-provider obligations a regulated entity must be able to demonstrate, not assert; pre-existing service contracts must comply by their next renewal or 1 July 2026, whichever comes first. Independent assurance is the load-bearing phrase: a regulated bank cannot accept the platform vendor's self-attestation as independent.
- COSO's Achieving Effective Internal Control Over Generative AI (23 February 2026), plus the SEC's dedicated SOX enforcement group (announced 31 March 2026). A US-driven, already-in-force push for an audit trail that captures prompts, inputs, outputs, model and configuration versions, and evidence of human review — sufficient to reconstruct what the AI acted on and show the control functioned as designed.
- The evals gap. Per LangChain's State of Agent Engineering survey, 89% of agent teams have implemented observability but only 52.4% run offline evaluations. The market built the recording layer and skipped the judgment layer. That ~37-point delta is the addressable gap.
4. What an evidence pack actually is
The entry product audits existing transcripts and traces — retrospectively — and emits a regulator-ready evidence pack. Anatomy:
- Chain of custody. Every ingested artefact is hash-anchored (SHA-256, append-only) so the pack can attest its own provenance — the tamper-evident bar that a screenshot and a declaration no longer clear.
- A versioned policy set. Governance policy and acceptance criteria, machine-readable, stamped with which version applied at the time of the activity.
- The judgment layer. Deterministic checks first; calibrated model-as-judge only where code cannot reach. Per episode: was the action in scope? Was the reasoning consistent with the action? Was required human oversight actually evidenced — a real review, not a rubber stamp? Were hard and soft gates handled correctly? Did data access match stated purpose?
- An OSCAL spine. The canonical format is OSCAL Assessment Results plus POA&M — NIST's machine-readable compliance standard — so the pack feeds the buyer's existing GRC tooling rather than fighting it. The human-readable pack renders from that spine.
- A gap report. Explicitly what this evidence cannot prove.
5. The toolchain: six components, one engine, many regimes
We built this. The toolchain is called signum (Latin: sign, mark, seal), and it runs end-to-end on a laptop: point it at a real agent session and it emits a signed, byte-reproducible, independently verifiable evidence pack in seconds.
verify.sh — verification runs on the auditor's laptop, no vendor software AUDITORFive design decisions carry the weight:
- A new regulator regime is a YAML bundle, not code. Three regimes — APRA CPS 234, OSFI E-23, MAS FEAT — run on the same engine today. The mapping bundles are where regulator-specific knowledge lives; the engine is jurisdiction-agnostic. This is the framework's I_commercial adapter made concrete.
- The gate has teeth. A deliberately seeded breach session trips NOT_SATISFIED end-to-end. An assurance tool that cannot fail is theatre; falsifiability is a feature you have to engineer and then prove.
- Byte-reproducibility. Re-running the pipeline over the same session produces a byte-identical pack. Evidence that cannot be reproduced cannot be independently verified.
- The auditor needs no trust in us. The bundle ships with a verifier the auditor runs locally — signature, hashes, independence statement, OSCAL structure — without installing our software or taking our word for anything.
- Two layers of judgment. Regulator regimes answer "does this satisfy the control code?" A client-policy layer answers the harder, more useful question: "does this satisfy your governance policy?" The same session can pass one and fail the other — and that distinction is precisely the finding an internal-audit team needs.
6. Engagement Zero: we audited ourselves first
Before asking any client to sit for this, we ran the engagement on ourselves — real Claude Code sessions from our own work, with a real one-page Comware governance policy (egress allow-list, no auto-approve, instrumentation required) as the client-policy layer.
The verdict on a real ~4,900-event session:
| Control | Verdict |
|---|---|
| APRA CPS 234 §20 (information assets) | SATISFIED |
| APRA CPS 234 §23 (incident management) | SATISFIED |
| Comware's own governance policy | NOT_SATISFIED |
Both findings against our own policy are true. The session fetched a documentation PDF from a domain outside our github-only allow-list (critical — the rule judges allow-list membership, and the fetch was outside it). And the session ran with auto-approve enabled and no human oversight posture — exactly the YOLO-mode failure the framework paper opens with, in our own work, recorded in signed OSCAL.
The hook-gate surface captured the texture under the headline: 1,217 gate evaluations (1,216 allowed, 1 blocked — the one block real and identifiable), 624 agent-executed self-gated actions, reasoning-action consistency checks, 25 surfaced-uncertainty signals.
The dogfooding also hardened the classifier. The first pack reported a critical egress to destination unknown — and distrusting that finding was the correct response: it was a false positive, traced to a command that mentioned curl in echoed text rather than executing it. The fix (command-position anchoring; stripping heredocs and quoted strings before matching) was the fifth false-positive class eliminated by running the toolchain over our own corpus of real sessions. A finding that an auditor later discredits is worse than no finding at all — the false-positive hunt is the product work.
Two things made this run worth publishing. First, the two-layer judgment demonstrated on real data: passing the regulator while failing your own policy is a result no dashboard renders. Second, the engagement-honesty position it buys: the strongest possible opening with a skeptical auditor is "here is the product finding real problems in our own work — signed."
7. The part that earns an auditor's trust: what it cannot prove
The most credible thing a semantic audit says is where it stops. Transcript-level evidence can establish what was said, decided, and invoked; whether the reasoning was consistent with the action; and whether oversight artefacts were present or absent. It cannot retroactively manufacture runtime policy context, tamper-evidence prior to ingestion, or attribution the source system never recorded.
A pack that claims otherwise is not assurance — it is theatre. So the gap report names the limits in the same document as the findings. That honesty is also the path forward: the gaps are precisely what an instrumented-by-design deployment closes.
This is the discipline we hold ourselves to as a phrase: we evidence, we don't certify. The findings are technical and operational; whether they are sufficient for a given regulation remains the client's counsel's call. Clause-level sufficiency is engagement work done with qualified legal review — not marketing copy.
8. Why independence is the product, not a feature
The observability vendors can — and will — ship "compliance reports." But a reformatted trace is still telemetry. What a regulated buyer ultimately needs is separation of duties: independent assurance that the platform vendor structurally cannot provide for its own product. The prudential frameworks we map to — APRA CPS 230/234, OSFI E-23, MAS FEAT — all encode an independence requirement that self-attestation does not satisfy.
Accountability — a named party that signs the findings and is independent of the thing being audited — is not a feature you can add in a release. It is the category. Observability records. Semantic auditing judges. Only one of them is something an auditor, a board, and a regulator will accept the same story from.
9. Where this goes
Three threads, in order of consequence:
- External review. The pack that audited our own sessions is the artefact we put in front of friendly external reviewers — an internal-audit lead, an assurance practitioner, an ex-supervisor — with one question: would you accept this as evidence for the nominated control code? That bar, not feature count, is what "done" means for this phase.
- Clause depth. The regime bundles map to control codes today; full audit-fidelity clause decomposition is legal work, in progress, and we will not claim it before it is done.
- Standards. The pack's OSCAL spine uses an agent-runtime property vocabulary that does not exist in any standard yet. We intend to propose it upstream — evidence formats only compound in value when more than one party emits them.
Comware · June 2026 · Companion to the Agent Security in Depth framework. Contact: jima@comware.com.au.