Chapter 12Testing, evaluation, and trace discipline

Testing an agentic system is a different problem from testing a deterministic system. The system under test is partially probabilistic; the failure modes are statistical; the assertions are over distributions of behavior rather than over single outputs. The standard testing techniques, unit tests, integration tests, end-to-end tests, still apply to the deterministic substrate. They are necessary; they are not sufficient.

This chapter develops what is added on top: trace discipline as the system of record, evaluation as a continuous process, and a set of test patterns that are tractable to apply against probabilistic components. The argument is that the architecture from Chapters 510 makes these patterns possible — bounded, governed, observable systems can be tested in ways that unconstrained agents cannot. The broader evaluation literature — benchmarks across planning, tool use, memory, and domain tasks, and the gaps in cost-efficiency, safety, and stability that motivate the architecture here — is surveyed in the academic literature (Survey on Evaluation of LLM-based Agents, 2025); see Chapter 21.

The chapter takes a position: in 2026, testing has shifted from a phase to a property. The relevant question is not “did the system pass the test suite before release”, though it should, but “is the system continuously evaluating itself against the envelopes its architecture defines?” The shift is forced by the nature of probabilistic components: their behavior can change between releases without code changes (model upgrades), within releases under load (sampling variability), and across time even with the same model and prompt (drift in input distributions). A pre-release test pass certifies the system at one instant; a continuous evaluation certifies it across instants.

The three layers of testing

An agentic system has three layers of testing concerns. Each layer needs its own techniques.

Layer 1: Deterministic substrate

The bounding layer, governance layer, memory gateway, tool adapters, and orchestration code are deterministic software. Standard practice applies: unit tests, contract tests, integration tests, fuzz tests where appropriate. This layer should have the same testing rigor as any production system; the agentic system’s reliability rests on it.

The architectural commitments from Chapters 56 make this layer testable: bounds, validators, and policy gates are deterministic and can be asserted directly. Tests should cover:

Treat these tests as table stakes; coverage gaps here cause incidents. A useful coverage metric is the governance event matrix: a list of every governance event class crossed with a list of every condition under which the event class fires, asserted to be covered by tests. Gaps in the matrix are gaps in the test suite.

Layer 2: Agent behavior within the envelope

The agent itself is the probabilistic component. Tests assert properties, boundedness, safety, policy compliance, output validity, rather than exact outputs. The standard techniques:

These tests are deterministic at the bound boundary even when the agent’s behavior inside the bound is not. That is the architectural payoff of Chapter 5 and Chapter 6 for testing.

A key insight: the agent’s behavior is probabilistic, but the envelope is not. The test suite asserts on the envelope. If the envelope holds, the agent’s variable behavior inside it is acceptable; if the envelope is breached, the agent’s behavior is unacceptable regardless of how reasonable it looks. This framing makes a previously-hard testing problem (testing a probabilistic component) into a tractable one (testing a deterministic envelope around a probabilistic component).

Layer 3: Quality of agent output

The hardest layer. Did the agent produce a good answer? Good is domain-specific and probabilistic. Three approaches, each with limits:

The model-based evaluator in the first two approaches has an industry name: LLM-as-a-judge. It is the most common way to test LLM quality at scale, and it is inherently compromised, the judge has the same probabilistic failure modes as the agent it grades. A judge can be sycophantic, can be prompt-injected by the content it evaluates, and drifts with its own model version. Use it where nothing cheaper exists, but treat its verdicts as the softest signal in the suite. Deterministic structural compliance (Layers 1 and 2) is always more trustworthy than an LLM-as-a-judge score (Layer 3); when the two disagree, believe the deterministic check.

Quality testing is the layer where the practice is weakest in industry. The book’s position is that the team should commit to a small, stable set of quality signals tied to business outcomes and resist the urge to overfit to model benchmarks. Benchmarks measure benchmarks; the system’s quality is its production outcome.

A neglected dimension of quality testing is stability under input perturbation. The same task, phrased slightly differently, should produce similar-quality outputs. A test suite that includes perturbation, semantically-equivalent but syntactically-different inputs, catches the failure mode where the system is brittle to phrasing. Production-derived perturbations (real user phrasings of the same intent) are more valuable than synthetically-generated ones, because they reflect the distribution the system actually faces.

Calibrating the governance layer

The risk scorer and the policy thresholds it feeds are themselves components that can drift, and calibrating them is a recurring operational task rather than a one-time setup. Chapter 6 states the rule for a mature system: calibrate on incident data, so that actions that turned out to be problematic would have had risk scores above the escalation threshold. For a new system that rule is circular, because there is no incident data to calibrate against. The team that ships the scorer on day one with thresholds set by intuition ships a scorer that is, in effect, uncalibrated, and the first weeks of traffic will either flood the approval queue with false positives or let genuinely risky actions through unflagged. That greenfield case is worth treating as its own bootstrap procedure.

Bootstrap without incident data

The procedure begins from conservative defaults and tightens or loosens as evidence arrives, rather than waiting for an incident to supply the evidence. Three steps.

First, set initial thresholds by action class, not as a single number. A refund, a code commit, and a configuration read are not the same kind of action, and a single risk threshold blurs the cliff between them. Class actions by reversibility and blast radius: read-only operations start below the elevated-logging threshold; reversible mutations start at the elevated threshold; irreversible or external-facing mutations start above the approval threshold. The defaults are deliberately conservative, because the cost of a false positive at startup (an approval that was not strictly needed) is a small delay, while the cost of a false negative is the incident the bootstrap exists to prevent.

Second, seed the calibration with red-team cases. Deliberately craft borderline actions, the kind the team expects the system to see in production: a refund just under a dollar threshold, a diff that touches a sensitive path but is small, a query that is plausibly analytical and plausibly exfiltration. Score each seeded case, and set the class’s threshold at the decile that catches the seeded risky cases while passing the seeded benign ones. The seeded set need not be large, a few dozen cases per action class is enough to find a threshold that is not obviously wrong, and it gives the team a concrete artifact (“the scorer routes these N cases correctly”) to review rather than an intuition to defend.

Third, run a tightening loop over the first N sessions. Sample real sessions in production, score every action, and have a human annotate a rotating fraction (the riskiest decile the scorer flagged, plus a random sample of the rest) as genuinely risky or benign. After each batch of annotations, adjust the thresholds: if the flagged set contains mostly benign actions, the threshold is too low; if the random sample contains risky actions the scorer let through, it is too high. The loop converges when the scorer’s flagged set is mostly genuinely risky and the unflagged set is mostly genuinely benign, which is the operational definition of “well-calibrated enough to operate.”

When to loosen

A scorer is well-calibrated enough to loosen when two conditions hold over a sustained window: the approval queue is dominated by actions a human reviewer consistently approves (the threshold is catching work it should not), and the random-sample annotation shows no risky actions passing unflagged. Loosening is then a measured change, replayed against the golden trace set (below) to confirm it does not let a known-incident shape through. Tightening is the same procedure in reverse, triggered by an incident, a near-miss, or a drift in the deny rate (below). Both are controlled changes to a versioned component, not prompt tweaks.

Testing the probabilistic governance components

A subtlety the bootstrap surfaces is that the risk scorer itself is often a model, and a model-based governance component is not covered by the Layer 1/2 testing that asserts on the deterministic envelope. The scorer’s routing is deterministic, the rule that acts on its score is fixed, but the quality of the score is a property of the scorer, and it drifts with the scorer’s own model version. The testing discipline for a model-based evaluator is therefore distinct from testing the envelope: track scorer calibration as a metric over time (the agreement between the scorer’s flagging and human annotation on the rotating sample), assert monotonicity where the scorer claims it (a larger refund scores higher than a smaller one), and test for drift by re-running the seeded red-team set against every scorer model upgrade before promotion. Where multiple model-based evaluators run in concert, the agreement between them is itself the signal, developed below under quality testing.

Trace discipline

The trace is the system of record. It is not logging in the operational sense; it is the artifact that makes every reasoning step, tool call, memory access, policy decision, and approval event observable, attributable, and replayable.

Trace discipline rests on four commitments.

1. Structured traces

Every event in the trace is a typed record with explicit fields: timestamp, agent identity, session identity, event class, parent event (where applicable), payload (input, output, metadata). Unstructured logs are not traces.

In 2026 there is no reason to invent a bespoke schema for this. The industry standard is OpenTelemetry (OTel), and its Semantic Conventions for GenAI define spans and attributes for model and agent telemetry, gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, and the like. (These GenAI conventions are still stabilizing as of 2026; treat the specific attribute names as illustrative and pin to a convention version.) Emit agentic traces as OTel spans, mapping the agentic event classes below onto span names and attributes. Building on OTel means the trace plugs into existing collectors, backends, and dashboards rather than a one-off pipeline, and it makes the agentic system observable with the same tooling as the rest of the stack.

Common event classes:

These names are illustrative; the principle is that every event is typed and the type is queryable.

A short trace excerpt makes the format concrete. The session below is a coding-assistant agent making a small change; the trace captures the bounding check, the governance pipeline, and the tool invocation:

{"ts": "2026-06-17T09:15:42.103Z", "session": "s_8f23", "event": "agent.action_proposed",
 "tool": "write_file", "args": {"path": "src/api/signup.ts", "diff_size_lines": 12}}
{"ts": "2026-06-17T09:15:42.105Z", "session": "s_8f23", "event": "bounds.check_passed",
 "tool": "write_file", "checks": ["iteration", "cost", "time", "action_surface",
                                   "data_access_scope", "reversibility"]}
{"ts": "2026-06-17T09:15:42.106Z", "session": "s_8f23", "event": "governance.validator",
 "tool": "write_file", "result": "pass"}
{"ts": "2026-06-17T09:15:42.109Z", "session": "s_8f23", "event": "governance.policy_gate",
 "gate": "no_secrets_in_diff", "decision": "allow"}
{"ts": "2026-06-17T09:15:42.110Z", "session": "s_8f23", "event": "governance.policy_gate",
 "gate": "protected_paths", "decision": "allow"}
{"ts": "2026-06-17T09:15:42.111Z", "session": "s_8f23", "event": "governance.risk_score",
 "action": "write_file", "score": 10}
{"ts": "2026-06-17T09:15:42.243Z", "session": "s_8f23", "event": "tool.invocation",
 "tool": "write_file", "latency_ms": 132, "cost_usd": 0.0008, "result": "ok"}
{"ts": "2026-06-17T09:15:42.244Z", "session": "s_8f23", "event": "cost.tick",
 "spent_usd": 0.34, "remaining_usd": 1.66}

Two architectural observations from the format:

The schema for each event class is itself versioned. Trace consumers, the dashboards, the replay infrastructure, the alerting, depend on the schema, and trace schema changes go through the same review as code changes. Backward-incompatible trace schema changes are rare and explicit.

2. Per-session correlation

All events for a single agentic session share a session identifier. Events from related sessions (orchestrator delegating to a worker, agent invoking a sub-agent through handoff) are linked via parent-child relationships. The trace forms a tree (or, in multi-agent shapes, a DAG) per logical task.

This correlation makes incident response feasible. Without it, traces are an unsorted pile.

Correlation extends across system boundaries. When an agentic system interacts with an upstream or downstream service, correlation identifiers should propagate so that the full request lifecycle can be reconstructed. A customer-support agent’s session correlates to the CRM ticket; an operations agent’s session correlates to the incident response record. The trace is most valuable when it connects to the rest of the engineering record.

3. Full trace on flagged sessions

In high-volume systems, full tracing every session is impractical. The right architectural pattern is:

The cost of full tracing is real; the cost of not having traces when an incident lands is much larger. Design the trace store with cost-vs-retention tradeoffs in mind, not as an afterthought.

Retention by event class is a useful refinement. Governance events (validator pass/fail, policy decisions, approvals) often have compliance retention requirements measured in years. Routine operational events (cost ticks, tool invocations) may have retention measured in weeks. A trace store that supports per-class retention reduces storage cost while preserving the data that matters for audit.

4. Replay capability

The trace must be replayable. Replay in this book means: given a trace, the agentic system’s deterministic substrate can be re-executed against it (or against a simulation of it) to:

Replay is not free. It requires that the trace captures enough information, inputs to model calls, tool responses, retrieval results, to re-run the deterministic surrounding code. Build replay as a first-class capability and exercise it routinely; do not discover at incident time that the trace is insufficient to reconstruct the failure.

Replay also requires mocking. A trace shows the agent calling DELETE /user/123; the replay must not actually hit production again. The trace store therefore doubles as an API mock server during replay: when the deterministic substrate re-issues a tool call, the mock returns the exact observation captured in the original trace rather than executing the call for real. This is what makes replay safe to run continuously and in CI, every external effect is served from the recorded trace, and the only thing executed live is the deterministic glue (and, for counterfactual replay, the single component being varied).

One subtlety follows from this, because adversarial and counterfactual replay (below) deliberately push the agent off the recorded path: the mock will eventually be asked for a call the original trace never captured. On such a miss it must never fall through to the live system. The mock returns a typed cache-miss observation, and the assertion moves one step earlier, to the bounds check or governance gate that precedes the unrecorded call, or the diverged call is routed to a sandbox that can answer it safely. Replay’s safety guarantee is precisely that a divergence reaches a cache miss or a sandbox, never production.

Replay loop

Figure 9. Replay loop

Replay also has security implications. A trace that captures sensitive content captures it for as long as the trace is retained. The trace store has access-control requirements at least as strong as the original data sources, with explicit purpose-of-access logging when traces are read by humans. This is part of the architectural design, not an afterthought.

Replay-driven test patterns

With trace discipline in place, several test patterns become possible.

Regression testing from traces

A historical trace plus a new system version: re-run the deterministic substrate against the trace; observe whether the agent’s behavior stays within the envelope. Within the envelope means: bounds still hold, governance still allows what it allowed, outputs still satisfy the same validators, end-task outcome (if known) is comparable.

This is the most valuable testing pattern enabled by the architecture. It catches drift between model versions, between policy versions, between memory states. It produces a continuous validation signal at the cost of trace storage and replay infrastructure.

A practical implementation: maintain a golden trace set, a curated collection of traces that represent the system’s behavioral envelope. The set includes routine successful sessions, near-failures the architecture caught, escalations, refusals, and a few representative incidents. New releases must reproduce the envelope-respecting behavior on every trace in the set. Add to the set when incidents reveal under-covered scenarios; retire traces when they no longer represent realistic input.

Adversarial replay

Take a trace; modify the inputs (a different prompt, an injected tool response, a tampered retrieval result); re-run. Assert that the system’s defenses hold under the modified inputs. This is fuzz testing applied to agentic systems, with traces as the seed.

Adversarial replay is especially useful for security testing. Take a trace of a normal session; inject prompt-injection content into a retrieved document; re-run. Assert that the agent’s actions are still within the governance envelope. The test is reproducible (it is a trace replay), deterministic (the substrate is deterministic), and asserts on a real attack surface (the retrieval content).

Counterfactual replay

Take an incident trace; modify a single architectural parameter (a tighter bound, a stricter validator, a different policy); re-run. Assert whether the incident would have been prevented. This is the right way to make a case for policy or bound changes after an incident: data, not opinion.

Counterfactual replay also supports proactive architecture work. Before adopting a new pattern or tightening a bound, run the change against the trace archive; see what would have happened differently. If the change blocks five incidents and twenty legitimate operations, the team has concrete data about whether the change is worth making.

Trace-driven test generation

The trace archive is itself a source of tests. Routine processes can derive tests from traces:

The test suite grows organically from production. The team does not have to imagine every test case at design time; the suite reflects the system’s actual operating envelope.

This loop has a name worth giving engineering leadership: the data flywheel. Every production failure is captured as a trace, annotated by a human with the correct outcome, and added to the golden trace set; the enlarged set makes the next release safer, which surfaces subtler failures, which further enriches the set. The asset that compounds is not the model, which the team may not even own, but the curated trace corpus, which encodes the organization’s hard-won knowledge of how its system fails and what correct behavior looks like. Framing the trace archive as a flywheel is how the infrastructure gets funded: it is a compounding, proprietary asset, not a cost center.

Quality signals that survive model changes

A common failure mode in agentic testing is to rely on quality signals that are sensitive to model version. The model upgrades; the test suite fails not because the system degraded but because the model phrases things differently. The team loses trust in the suite.

Quality signals that survive model changes:

Quality signals that do not survive model changes well:

Use the reliable signals as the testing backbone; use the fragile ones cautiously, with the understanding that they fail often for non-substantive reasons.

For pairwise preference and reference-comparison testing, use multiple judges and report the spread. A single judge whose verdict is sometimes “A is better” and sometimes “A is worse” on the same comparison is not reliable; a panel of three judges, with agreement of two or more required, is more reliable. Where the judges disagree, the disagreement itself is the signal, the change being tested is in a borderline region and probably needs human review.

A starting default, to deviate from rather than to admire: three judges, two-of-three agreement to accept a verdict, and disagreement escalates the comparison to a human review queue at the same priority as an approval-bound action. Measure the panel’s agreement rate over time and widen it for action classes where agreement stays low.

Evaluation as continuous process

Evaluation is not a phase; it is continuous behavior of the production system. Three commitments:

  1. In-production evaluation. A continuously running evaluator (an evaluator-optimizer pattern, Chapter 4) samples real sessions, scores outputs, and surfaces failures for review. The evaluator is itself bounded and observable.

  2. Behavioral metrics over time. Per-week dashboards for bound-trigger rate, governance-deny rate, approval-request rate, cost per session, latency per session, refusal rate. Drift in any of these is a signal.

  3. Periodic adversarial review. Scheduled adversarial testing, prompt injection, jailbreaks, tool-response injection, policy-evasion attempts, against the production system. Not as compliance theater; as actual probing.

These are operational practices, not testing tasks. They sit at the intersection of testing and observability and are owned by the team that operates the system.

The choice of evaluator in continuous evaluation matters. An evaluator that is itself a model call has its own failure modes; an evaluator that is a deterministic rule check is more reliable but covers less. The mature approach is a mix: deterministic checks on the structural properties (schema, policy, governance), model-based checks on the quality dimensions (relevance, coherence, helpfulness). Each is monitored separately; drift in either is investigated.

Cost regression testing

Cost is a first-class quality dimension and the most likely to drift silently. The architectural commitment is:

A test that asserts “average cost remains under X” misses long-tail incidents. A test that asserts “the 99th percentile cost remains under Y” catches them. Operate on percentiles.

A specific cross-version failure deserves a named test: token inflation. A new model can be chattier than its predecessor, emitting 20% more tokens for the identical prompt and task, which is a silent cost regression even when quality and behavior are unchanged. Track token variance per task across model upgrades, not just dollar cost in aggregate, because a per-token price drop can mask a token-count rise (or the reverse) until the bill arrives. A model migration is not cost-neutral until the per-task token distribution has been measured on the new model.

Cost regression testing should also be tied to the same trace archive as quality regression testing. A new release’s expected cost can be projected by replaying the archive against the release; significant differences (whole percentiles shifting upward) are caught before release. Without this projection, cost regressions appear after deployment, often invisibly.

Drift detection

Drift is the agentic-system equivalent of slow leaks: behavior changes incrementally, no individual change is alarming, the cumulative effect is significant. The patterns:

The architectural commitment is to make drift observable. Drift that is observed is investigable; drift that is hidden becomes incident.

Reasoning trajectory observability

Layer 3 testing asks whether the final output is good. Layers 1 and 2 ask whether the envelope held. Neither asks whether the agent made progress on the way there — whether each iteration moved the task forward or merely consumed budget retracing the same ground. That reasoning trajectory is observable from the trace alone, without grading the model’s internal chain-of-thought, and it is the earliest warning for several failure modes Chapter 11 names that output-quality metrics detect only after the damage is done.

Per-iteration progress metrics treat each harness turn as a step whose value can be scored from typed trace events, not from prose the model emitted about its progress. Useful signals include: new information introduced (a tool returned data not present in prior turns); state delta (a file changed, a record written, a test status flipped); plan stability (the declared next action aligns with the prior turn’s plan rather than restarting); and monotonic convergence toward a stated goal (remaining subtasks decrease). A session whose iterations show no state delta for several turns while cost and iteration counters climb is stalled even if each individual output looks plausible. The defense Chapter 11 names for infinite and thrashing loops — trace alerting on per-iteration progress — reduces to concrete thresholds: alert when the progress score stays below a floor for N consecutive turns, or when cost-per-progress-unit exceeds a fleet baseline.

Thrash signatures (Glossary) are recognizable shapes in the trace that precede bound aborts. Repeated near-identical tool calls with cosmetically different arguments (the same search query rephrased, the same file read several times without intervening writes) indicate the agent is re-deriving rather than building on observations. Oscillating plans — propose migration A, revert to B, return to A — show up as alternating tool sequences without cumulative effect. A high thought-to-action ratio (many model turns per tool call with no forward state change) often correlates with false termination: the model declares done without having executed the tools that would make it true (Chapter 11). These signatures are cheap to compute from the trace structurally; they do not require reading reasoning text.

Reasoning-token distribution adds a cost-shaped quality signal. Reasoning models expose, directly or via billing metadata, tokens spent on internal deliberation as distinct from output and tool traffic. Track the ratio of reasoning tokens to tool effects per task class: a refactor session that spends ten thousand reasoning tokens to produce a fifty-line diff may be acceptable once and alarming as a fleet median. Sudden shifts in that ratio after a model upgrade — more reasoning, same outcomes — are a form of behavioral drift distinct from output-quality drift, and they belong on the same dashboards Chapter 18 already use for cost and latency percentiles. Pair reasoning-token percentiles with progress metrics: rising reasoning cost with flat progress is thrashing before the iteration bound fires.

The architectural commitment is that trajectory observability stays outside the model. The harness records what happened; deterministic code scores progress and fires alerts; bounds still abort runaway sessions. This is Layer 3 observability — outcome and trajectory — not a fourth layer, and it complements rather than replaces golden-trace replay and LLM-as-a-judge evaluation.

The architectural payoff for testing

Each architectural commitment from earlier chapters pays a testing dividend:

A team that has not made these commitments cannot test its agentic system meaningfully. The output of testing such a system is a vague sense that “it works most of the time.” That is not enough for production. The architectural commitments are what turn it into a testable property.

Building a test program

For a team starting on agentic-system testing, the recommended order of investment is:

  1. Trace discipline. Without traces, nothing else is possible. Build structured trace capture, correlation, and storage first.

  2. Bound and governance tests. Layer 1 testing. Standard practice; deterministic; the substrate everything else rests on.

  3. Property-based and adversarial tests. Layer 2 testing. Begin with the highest-risk scenarios; expand as the system stabilizes.

  4. Replay infrastructure. Once traces exist, build the capability to replay them against the substrate. This enables the most powerful test patterns.

  5. Continuous evaluation. Wire up sampling, scoring, and dashboarding. Run continuously.

  6. Drift monitoring. With the substrate above in place, monitoring becomes a question of which metrics to track and what thresholds to alert on.

  7. Reasoning trajectory metrics. Progress scores, thrash detection, and reasoning-token ratios on production traces (above).

Skipping steps is tempting and unwise. A team that builds continuous evaluation without trace discipline cannot investigate the failures the evaluation finds. A team that builds replay without governance tests cannot trust the replay’s results. The order is sequential because each layer rests on the previous.

Summary

Testing an agentic system requires three layers: deterministic substrate testing (standard practice), agent-behavior-within-the-envelope testing (property-based assertions over bounds and governance), and quality testing (outcome-driven and replay-driven). Trace discipline is the substrate that makes the higher layers possible: structured, correlated, replayable traces. Reasoning trajectory observability — progress per iteration, thrash signatures, reasoning-token ratios — connects envelope compliance to output quality. Quality signals are chosen for their stability across model changes. Evaluation is continuous, not periodic.

The architecture from Chapters 510 is what makes all of this practical. Chapter 13 opens the production-facing chapters with the interface through which humans supervise these systems, the first of the chapters that lead through enterprise integration and model routing to the system vignettes of Chapter 16 and the operational discipline of Chapter 18.