Chapter 19The harness: architecting the agent loop

Part IV closes with the harness design — the loop every prior chapter has attached to but none has yet specified.

Every chapter so far has described something that attaches to the agent: bounded autonomy (Chapter 5) bounds it, governance (Chapter 6) gates it, memory (Chapter 7) feeds it, skills (Chapter 10) extend it, the trace (Chapter 12) records it. None of them has designed the thing they all attach to. The concept of the harness, the deterministic envelope around the model, was introduced in Chapter 4; this chapter designs it.

The loop and the code that runs it are the harness: the deterministic program that turns a model into an agent. The book’s thesis, reliability lives in the shell, not the model, names this directly, because the harness is the shell. It is where the bounds are checked, the governance pipeline is called, the context is built, the tool results are dispatched, and the trace is written. A team can adopt every pattern in this book and still ship an unreliable system if the harness is an afterthought, because the harness is the one component that touches all the others on every turn. It is the last thing the book designs because it is the thing that holds the rest together.

Most teams adopt an agent framework rather than writing the loop by hand; below develops what production discipline requires around that default.

This is an architecture chapter, not an agent-building tutorial. It does not teach prompt-craft, model selection, or how to make a model reason better; those move too fast and the book stays above them (Preface). It teaches how to design the deterministic envelope the model runs inside, and, its central practical question, how to decide where a new capability belongs: in harness code, in a tool, in a skill, or in a new agent.

The harness is the envelope, not the inner loop

A common industry usage treats the harness as everything that is not the model — the system prompt, the tools, the filesystem, the sandbox, the loop, all in one bag. The broad usage is widespread enough to appear not only in blog posts but in internal architecture documents, where a request interceptor with a model call in the middle gets labeled a “harness engine” with no loop and no agent. This book uses a narrower definition, and this distinction is decisive: the harness is the deterministic loop and gateway that assembles context, calls the model, parses intent, dispatches each proposed action through the bounds and governance, observes the result, and decides whether to loop again. The system prompt is content the harness carries, not the harness itself; the filesystem, the sandbox, and the browser are tools the harness governs, not parts of it. Keeping the harness narrow is what keeps “the model proposes; the harness disposes” sharp — if the prompt and the tools are “the harness,” then editing a prompt is “harness engineering,” and the line between content and architecture dissolves. The narrower definition is the one this book can enforce and test; the broader one cannot be.

A reasonable objection arises immediately. Chapter 1 argued that the reason–act–observe loop has been absorbed into the models: a modern reasoning model plans, acts, and self-critiques without external orchestration code, and Chapter 4 marked ReAct, Plan–Execute, and Reflection as largely model-internal by 2026. If the loop is inside the model, what is left to design?

The answer is the distinction between the inner loop and the envelope. What migrated into the model is the inner cognitive loop, the back-and-forth of reasoning, deciding which tool to reach for, and reasoning again about what to do next. What did not migrate is everything around it: assembling the context the model reasons over, executing the actions it proposes, enforcing the bounds, calling the governance pipeline, persisting state, and recording the trace. Chapter 1 named these in a single sentence, “bounding it, governing it, persisting its state, observing it, and recovering from its failures.” This chapter is the depth treatment of that sentence.

One distinction inside this is easy to blur and essential to keep sharp: tool selection can move into the model, but tool execution should not. The model deciding, mid-reasoning, that it wants to call a tool is part of the inner loop; the call itself, and the observation it returns, are the harness’s work, because the call is where a bound can refuse, a gate can escalate, and the trace can record. The discriminating question for any tool is therefore: can the harness refuse or modify this call before its effect occurs? Where execution returns to your code, an explicit tool loop in a graph or agent runtime, or a “computer-use” action your process carries out, the answer is yes even when selection happened inside the model: the boundary between deciding and doing is logical, not a matter of which call it lands on, and it holds.

The case that breaks this is provider-hosted execution: a server-side code interpreter, a hosted search, or a provider-resident tool that the model both selects and runs inside a single API call, handing your code a finished result it never had the chance to gate. There the seam is not merely moved, it is lost, you cannot refuse, re-scope, or meter pre-flight an effect that has already happened on someone else’s infrastructure. So the seam is a constraint the team imposes, not a fact about every platform: effectful and irreversible capability must run on the harness side of it, client-executed tools, or a provider tool sandboxed so its effects cannot escape, while provider-hosted execution is reserved for the read-only or idempotent (a web search, a lookup) where a completed result is acceptable. This is the line Chapter 4 drew, for the patterns that became “mostly model-internal,” what stayed architectural was the bounding, the observability, and the tool surface, and it is what Chapter 5 means in calling the bounding layer the first surface the agent’s outputs encounter. A provider-hosted effector is that first surface conceded.

So the harness is not the inner loop; it is the deterministic envelope around it. The practical consequence is that a single harness turn may wrap several model-internal steps: the harness assembles a context and makes one model call, and inside that call the model may reason and decide which tools it wants several times before it returns its proposed actions. The harness sees the boundary, what went in, what actions came out, and executes those actions itself, through the gateway. It is deliberately blind to the deliberation in between, the same way the bounding layer is blind to the model’s reasoning. As models absorb more of the inner cognitive loop, the envelope does not shrink, if anything it carries more weight, because more of the system’s behavior is decided inside an opaque call that only the envelope can constrain. The exception is the one just named: where a provider absorbs execution as well as selection, the envelope governs less, which is precisely why the execution seam is the one to defend.

The loop, structurally

Concretely, the harness is the code that runs one turn and decides whether to run another. Its responsibilities, in order:

Assemble context. Build the input the model will reason over this turn: the system instruction, the relevant memory (Chapter 7), any loaded skills (Chapter 10), the tool declarations, and the working state of the task. This is a construction step, not an accumulation step, a point developed below.
Call the model. The single probabilistic step. Everything before and after is deterministic code the team owns.
Parse intent. Interpret the response as structured intent, a final answer, or one or more proposed tool calls with arguments, rather than treating free text as if it were a command.
Dispatch through the gateway. Route every proposed action through the bounding layer (Chapter 5) and the governance pipeline (Chapter 6) before it touches anything. The model proposes; the harness disposes.
Observe and update. Feed each result, success, refusal, or escalation, back as an observation, update memory and the bounds ledgers, and write the trace (Chapter 12).
Decide whether to continue. Check the iteration, cost, and deadline bounds, and loop again or terminate.

The shape of one turn:

The per-action gateway internals, the action-surface check, the per-call cost and data-scope checks, and reversibility routing, are developed as pseudocode in the Concord worked example (Chapter 17). What matters here is the loop around the gateway, and the loop-level bounds the harness itself enforces: the iteration counter, the session deadline, and the running cost budget, checked before the expensive call rather than after (Chapter 5). In skeletal form:

def run_turn(session, task_state):
    # The deterministic envelope. call_model() is the only DIRECTLY probabilistic
    # line; a tool or sub-agent reached via the gateway may wrap its own model call.
    if session.now() >= session.deadline or session.over_budget():
        return Aborted("time or cost budget")         # Ch 5: check before the costly call
    context = assemble_context(session, task_state)   # memory, skills, tools, state
    response = call_model(context,
                          max_tokens=session.remaining_token_budget())
    session.trace("model.responded", response.summary)

    intent = parse_intent(response)                   # final answer or proposed actions
    if intent.is_final:
        return Done(intent.answer)

    for action in intent.actions:
        # the gateway enforces every bound axis AND the governance pipeline (Ch 5, 6, 17)
        result = gateway(session, action)             # may allow, refuse, or escalate
        task_state.observe(action, result)            # refusals are observations too
        session.memory.record(action, result)         # Ch 7
        session.trace("action.result", action, result)

    session.iter_count += 1
    if session.exceeds_any_bound():                   # iteration, cost, deadline (Ch 5)
        return Aborted("bound exceeded")
    return Continue(task_state)

Two things are worth stating plainly about this skeleton. First, the only directly probabilistic line is call_model; the tools and sub-agents reached through the gateway may each wrap a model call of their own, but every one of those is itself a bounded step, and the harness code between them is ordinary software, testable, reviewable, and version-controlled (Chapter 12). Second, the harness never lets a model output reach the world directly. The model returns intent; the harness decides what becomes an effect. That separation is not an implementation convenience. It is the architectural seam on which the entire discipline depends, and it is the subject of the next section.

Reasoning versus hands and eyes

The most useful principle for designing the seam is a division of labor that runs through the whole book without ever being named: the model is the system’s reasoning organ, and it has no hands or eyes of its own. It cannot read a file, call an API, write a record, or perceive the state of the world. It can only emit text proposing that one of those things happen. Every hand (a tool that acts) and every eye (a tool that perceives or retrieves) is deterministic code the harness owns. Chapter 4 drew the line precisely: the cognitive act of deciding to reach for a tool is distinct from the action of executing the call; the model decides which tool, the architecture decides which tools exist, with what authorization, and with what logging.

The architectural rule that follows is simple to state and governing in practice: the model decides; the harness acts. No effector is wired directly to a model output, and where a platform would run one for you (the provider-hosted execution discussed above), effectful tools are kept on the harness side of the seam. The value of routing every hand and eye through harness code is that each such routing is a place where a bound can be checked, a policy gate can fire, an argument can be validated against a schema, and a side effect can be recorded in the trace. A model given a direct effector is a model that has escaped governance; a model whose every effect passes through the harness is a model whose every effect is bounded, observed, and recoverable. “Hands and eyes” is just an informal gloss on this rule, a way to remember that perception and action are the harness’s job, and reasoning is the model’s.

The principle cuts in two directions, and both are design errors when ignored. Pushing into the model what should be deterministic, asking it to compute a total, enforce a policy, or format an identifier in prose, takes a guarantee the harness could have made for free and replaces it with a probability. Pushing into code what genuinely requires judgment, hard-coding a branch the task actually needs the model to reason about, throws away the one thing the model is for. Designing the harness well is largely the discipline of drawing this line in the right place, capability by capability, which is the decision the rest of the chapter is about.

Context is assembled, not accumulated

Step one of the loop, assemble context, is where several prior commitments — a bounded window (Chapter 3), memory gateway scoping (Chapter 7), skills loading (Chapter 10) — converge into a single harness responsibility. The context window is the model’s entire field of view for a turn, and it is a bounded, costly resource. The naive harness treats it as an accumulator: it appends every turn, every tool result, and every document to a growing transcript and passes the whole thing back to the model. That harness exhausts its window, pays to re-read stale material on every call, and buries the relevant signal in noise — the prompt stuffing failure mode (Glossary, Chapter 11).

The disciplined harness assembles the context for each turn rather than accumulating it, exercising moves earlier chapters introduced: a large, stable prefix for cache reuse with the volatile material last (Chapter 18); retrieval-mediated context rather than stuffed context (Chapter 7); progressive disclosure of skills with eviction when their task ends (Chapter 10); and a deliberate per-turn window budget. The harness is where these are exercised because it is the component that builds the input, no other layer can.

The framing worth carrying away is that context is constructed on every turn. The harness is the constructor. Treating it as an append-only log is the most common way an otherwise sound architecture becomes slow, expensive, and unreliable in production.

Where capability lives

The recurring question a team faces once the harness exists is practical: a new capability is needed, where does it go? The book’s earlier chapters answer fragments of this. This is the place to answer it whole, because the choice is the single most consequential design decision the harness forces, and it is what the “code way versus the emerging skills way” of building agents is about.

First separate capability from knowledge. A new fact, corpus, or retrieval source is not a capability; its home is memory and the ingestion pipeline (Chapter 7, Chapter 8). A change to the agent’s standing behavior is a change to the system instruction the harness assembles, not a new component. For executable or procedural capability, there are four homes, and they differ in what they cost to build, to change, and to govern.

Harness code. When the behavior must be guaranteed and does not itself touch the world, internal logic, identical every time, not subject to the model’s judgment, it belongs in deterministic code: control flow, enforcement, parsing, retries, formatting, the bounds themselves. Code is the cheapest home at runtime and the most expensive to change, because changing it means a deployment. Anything essential for correctness or safety lives here. The bounds are code; the governance pipeline is code; the loop is code.

A tool. When the agent needs to act on or perceive the world, a new verb or a new sense, the capability is a tool: an effector or sensor admitted to the governed action surface (Chapter 5). A tool is the unit the bounding layer governs and the trace records; its implementation may be perfectly deterministic code, “tool” names the governed surface and the fact of an external effect, not the presence of judgment. Add a tool when the agent needs to do something it structurally cannot do today, and accept that every tool widens the action surface and so must be justified against it.

A skill. When what is changing is not the tools but the know-how for using existing tools, a procedure, a project convention, a workflow over verbs the agent already has, the capability is a skill (Chapter 10): procedural knowledge, loaded at runtime, editable without a deployment. A skill is the right home for the large and fast-moving body of “how we do things here.” Reach for a skill instead of writing a new prompt or a new playbook, and instead of hard-coding a procedure that will change next month.

A new agent. When the work needs its own bounded context, action surface, or reasoning budget, and only when coordination genuinely benefits from the separation, the capability is a new agent: a sub-loop with its own envelope (Chapter 9). This is the most expensive home in every dimension, and the book’s standing caution holds: most production systems are a single agent with tools, and a new agent should be the last resort, not the first reach.

A short decision sequence resolves most cases:

The sequence is a heuristic, not a strict partition, a sub-agent acts on the world too, and a tool’s implementation is itself code, but it resolves the common case by asking the cheapest distinguishing questions first: is this knowledge rather than capability, and if capability, does it need a new effector or sensor the agent lacks, or only new know-how over the tools it already has? The trade-offs behind the sequence are worth making explicit, because they are what the choice is trading. Iteration speed rises as the capability moves from code toward skills: a skill changes without a deployment, code does not. Deployment risk falls in the same direction for the same reason. But governance surface and testability move the opposite way: code is the most testable and the most tightly governed home, a skill the least, because a skill is content that arrives at runtime and cannot be unit-tested the way a function can. The pull toward skills is therefore real and often correct, they are fast and cheap to change, but it is also a pull away from the parts of the system that are easiest to verify.

This is exactly the place where the chapter must not undercut Chapter 10, so the constraint is stated as a rule rather than a caution. Know-how may move into skills freely; power may not. The action surface, the bounds, and the enforcement never move into a skill, because a skill is the agent reading a document, not the document granting authority. A capability that needs a new tool needs that tool admitted to the action surface in code, no matter how convenient it would be to let a skill conjure it. The emerging “skills way” of building agents is powerful precisely because it lets teams stabilize the hardened core, the loop, the action surface, the governance, in code while iterating on procedure as content. It becomes dangerous the moment a team uses it to smuggle capability past the core it was supposed to sit on top of.

Tool design as harness design

When an agent seems to reason badly, audit the tool surface before the model — poor tool design is often the root cause of apparent reasoning failures, because tools are the harness’s hands and eyes and their design is part of the harness’s design. A few principles recur. Prefer coarse, intention-revealing tools to thin wrappers over raw power: a tool that exposes governed, meaningful operations is safer and more legible than a raw query interface that hands the agent the whole database and hopes, the semantic-layer argument of Chapter 14. Keep the surface small, a focused agent is well served by a handful of tools, not a registry of hundreds, both because a large surface is an attack surface and because it dilutes the model’s choice (Chapter 5). Enforce the schema at the call boundary, in the harness, rather than trusting the model to honor it. And treat each tool’s name and description as part of its contract: they are what the model reasons over when it decides whether to reach for the tool, so a drifted description silently degrades behavior with no error raised (Chapter 11). When a tool tries to do too many unrelated things, split it; when several tools are always called in lockstep, consider whether one coarser tool would govern more cleanly. The heuristic pairs with trace replay (Chapter 12): when debugging a cascade, start with what the tools exposed, not only what the model said.

Retrofitting bounds and governance into a framework

Most teams will not write the loop above by hand. They adopt an agent framework, LangGraph, a provider Agent SDK, or an in-house orchestrator, that ships a default harness: the loop, the state plumbing, the tool dispatch. The default is real but partial. It gives you the skeleton of step 1 through step 6, but it does not give you the bounds, the governance pipeline, the context discipline, or the trace, and it is built around its own assumptions about where tools execute. The production harness this chapter describes is mostly what you build around that default, and the practical question is not whether to adopt a framework but where, once it runs an agent loop, your code can still refuse a tool call before its effect occurs. That question has a small number of answers, and they map onto the seams a framework either exposes or does not.

Four interception points are what matter, in the order a request passes through them. The context seam is where the framework assembles the prompt before the model call; a framework that lets you hook context construction lets you enforce the assembled-not-accumulated discipline, scope retrieval through the memory gateway, and inject the loaded skills’ declarations (Chapter 7, Chapter 10). The model-call seam is the single probabilistic step; a framework that routes inference through a pluggable client lets you insert the model gateway for egress filtering, capability-tier, and cost attribution (Chapter 15), and to set the per-call max_tokens that bounds a single generation against the remaining budget (Chapter 5). The tool-execution seam is the load-bearing one: it is the point at which the framework, having received the model’s proposed tool call, executes it. A framework that exposes this seam as a dispatch hook lets you route every proposed action through the bounding gateway and the governance pipeline before it touches anything, which is exactly the interception the loop above depends on. And the observation seam is where the tool’s result returns to the model; a framework that lets you transform the observation lets you sanitize tool responses, enforce output schemas, and write the structured trace (Chapter 6, Chapter 12).

The discrimination between a framework you can govern and one you cannot is whether the tool-execution seam sits on your side of the call or the provider’s. A graph engine or in-house orchestrator that runs tools in your process exposes the seam directly: you wrap the dispatcher, and every effect passes through your gateway. A provider Agent SDK that runs provider-hosted tools inside a single API call does not expose the seam at all, because the effect has already happened on the provider’s infrastructure before your code sees the result. That is the provider-hosted execution case this chapter warned against, and it is the one a retrofit cannot fix in code: the framework has conceded the execution seam at design time, and no wrapper you add around the SDK restores the refusal point the effect has already passed. The retrofit discipline for a framework that does this is to read its tool list as a contract, keep every effectful or irreversible tool on your side of the seam (client-executed tools, or a provider tool sandboxed so its effects cannot escape), and restrict the provider-hosted tools to the read-only or idempotent (a web search, a lookup) where a completed result is acceptable. Where a framework offers both hosted and client-executed tools, the choice between them is the choice the seam makes for you.

The retrofit proceeds in the same three stages as the brownfield case (Chapter 18), because a framework’s default harness is, from the architecture’s standpoint, an ungoverned agent that happens to run. Trace first: wrap the framework’s model-call and tool-execution hooks to emit a structured trace, which ends the blindness and surfaces the real distributions of cost, iteration, and action before anything is bounded. Bounds second: interpose the bounding gateway at the tool-execution seam, with ceilings set at the observed limits plus headroom, so the runaway cases the trace surfaced now abort. Governance third: route the consequential actions through schema validation, policy gates, and approval, beginning with the single highest-risk action and broadening as confidence grows. The framework’s loop is untouched throughout; what changes is what your code does at the seams the framework exposes, and the discipline is the same one the brownfield retrofit follows, because the problem is the same: a running loop that must be hardened without a rewrite.

Two practical tests tell you whether the retrofit has landed. The first is the refusal test: can you write a test in which a bound or a policy gate refuses a proposed action, and assert that the action never reached its downstream effect? If the framework swallows your refusal and runs the tool anyway, the tool-execution seam is not really on your side, and the framework cannot carry this book’s discipline. The second is the trace test: does the framework emit the events the governance-event matrix of Chapter 12 requires, or does it emit only its own logs? If the latter, the framework’s observability is for debugging its loop, not for governing yours, and you must wrap the seams to emit the typed events the trace store expects. A framework that passes both tests is governable; one that fails either is a component to build around, with your own loop, rather than to build on.

Harness anti-patterns

The failure-mode catalog (Chapter 11) treats most of these at length; they are named here only as the harness-level shapes to watch for, with pointers to where each is developed. Reasoning in prose for what should be code, asking the model to enforce a limit or compute a value the harness could guarantee (Chapter 5, Chapter 6). The kitchen-sink tool surface, exposing every tool to every task instead of scoping the surface to what the task needs (Chapter 5). Prompt stuffing, accumulating context instead of assembling it (Chapter 11). Anthropomorphizing the loop, treating the model as the system and the harness as plumbing, when the architecture is the reverse: the harness is the system and the model is a bounded component within it. Skill sprawl without eviction, loading procedural content that never leaves the window (Chapter 10). Each is a way of forgetting that the harness, not the model, is where the team’s guarantees live.

Summary

The harness is the deterministic envelope that turns a model into an agent: it assembles context, calls the model, parses intent, dispatches every action through the bounds and governance, observes the result, and decides whether to continue. The inner cognitive loop has largely moved into the model; the envelope has not, and as the model absorbs more reasoning the envelope carries more, so long as execution stays on the harness side of the seam, which is why provider-hosted execution of effectful tools is the case to avoid. The principle that organizes the design is that the model reasons and the harness acts, no effector runs outside the harness’s reach, so that every hand and eye passes through a place where a bound, a gate, a schema check, and a trace entry can attach. Context is assembled on every turn, not accumulated. And the recurring question of where a new capability belongs has, for executable capability, four answers, code for guaranteed internal logic, a tool for a new hand or eye, a skill for know-how over existing tools, a new agent only when the work needs its own envelope (new knowledge goes to memory instead), with the firm constraint that know-how may move into skills but power may not. Design the harness as the system, with the model as the one bounded component inside it, and the rest of the book’s discipline has something solid to attach to.