Chapter 8The ingestion pipeline: Architecting semantic memory

Chapter 7 developed memory from the agent’s side: a tiered system reached through a gateway that enforces scoping, retrieval policy, and write-time governance. That treatment solved the read path. For semantic memory, the organization’s curated facts, rules, codebases, and domain knowledge, it left the harder enterprise problem open. How does the corporation’s chaotic, dynamic, access-controlled data enter the semantic store in the first place, and stay correct as the source changes?

Much of the agentic literature treats retrieval-augmented generation as a solved problem: point a crawler at a wiki, split the text into fixed-size chunks, embed them, and load a vector index. That is a prototype, not an architecture. In production it yields stale answers, contradictory retrievals, and cross-tenant data leaks. This chapter treats semantic-memory ingestion as what it actually is, an extract, transform, load (ETL) pipeline subject to the same engineering discipline as any other data product, plus concerns specific to a probabilistic consumer.

The chapter’s central claim is that the reliability of an agentic system is upper-bounded by the deterministic rigor of its ingestion pipeline. Governance at read time (Chapter 6) can refuse to surface a poisoned record only if it can tell the record is poisoned. It cannot un-poison a store whose contents are stale, mislabeled, or stripped of the access controls that governed their source. The write path is where most of memory’s correctness is won or lost.

The blind-ingestion anti-pattern

The dominant failure in enterprise memory is blind ingestion: a job crawls a repository, splits documents by character count, embeds the fragments, and writes them to an index with no further structure. The approach demonstrates well and fails in four specific ways.

  1. Access-control collapse. A confidential document, a layoff plan, an unreleased financial result, is embedded into an index that does not carry the source system’s role-based access control. A similarity search run for a user who could never open the original surfaces its contents anyway. The vector store has quietly become a channel that bypasses the access controls of every system it ingested. This is the confused-deputy failure (Chapter 11) applied to data rather than tools.

  2. Context poisoning. Documents carrying personally identifiable information (PII) or secrets are embedded verbatim. The agent later retrieves them and leaks the sensitive content into an outbound message or a tool call. No prompt instruction reliably prevents this, because the model is asked to ignore data that is already in its context.

  3. Staleness and contradiction. A policy is revised. The new version is ingested; the old version is never removed. The store now holds two contradictory chunks, and similarity search has no notion of which is current. The agent confidently acts on superseded policy.

  4. Loss of lineage. An answer is produced from a retrieved chunk, but the chunk cannot be traced to its source document, author, or version. When the answer is wrong, no one can find or fix the origin, and the error recurs.

Each failure is a property of the ingestion pipeline, not of the model. The architectural response is to treat the write path as a governed supply chain.

The ingestion pipeline

The architectural commitment is that data does not enter semantic memory directly. It passes through a deterministic pipeline that redacts, tags with identity and lineage, enriches, and, where the data warrants it, extracts structure, all before anything is stored.

Figure 5. The ingestion pipeline

The pipeline rests on six commitments. The first three protect the store from the data; the last three make the data useful and current.

Pre-embedding redaction

A reasoning model cannot be relied on to ignore sensitive data already present in its context, so data-loss prevention must run before embedding, not after retrieval. The pipeline routes all extracted text through a deterministic scanner, regular expressions, exact-match dictionaries, and fast local classifiers, that removes or masks secrets, identifiers, and protected health information. Where the structure of a sentence matters to the embedding, the scanner substitutes a typed placeholder, such as a redacted-identifier token, rather than deleting the span; the semantics survive and the raw liability does not. Redaction at ingestion is cheaper and more auditable than redaction at every read, and it closes the window in which an unredacted vector exists at all.

Identity and access-control synchronization

This is the commitment that enterprise deployments most often miss, and the one whose absence is most damaging. An agent is a synthetic user acting for a human; the store must respect that human’s authorization. Because vector and graph stores do not natively understand corporate directory groups or token scopes, access control has to be flattened into metadata at ingestion and re-evaluated at query time, a pattern best called late-binding authorization.

The mechanism has three parts. Each chunk and graph node is tagged at enrichment with the access-control list of its source document. The memory gateway (Chapter 7) resolves the querying user’s group memberships and appends a deterministic pre-filter to every query, so a user sees only records whose tags admit them. And because access rights change, an asynchronous sync worker subscribes to the identity provider’s change events and updates the tags on existing records as memberships change. Without the sync worker, the store enforces yesterday’s permissions, a slow leak that audits miss because every individual query looks correctly filtered.

Lineage and traceability

When trace replay (Chapter 12) shows that an agent erred because of a poisoned retrieval, the on-call engineer must be able to follow that chunk back to its origin. Every embedded chunk and extracted entity therefore carries immutable lineage: the source system, the document identifier, the document version or commit hash, and the ingestion timestamp. Lineage turns an untraceable swamp into an auditable store. A wrong answer leads to the exact document and version that produced it, the source is corrected, and the fix propagates on the next ingestion rather than being patched in the prompt.

Cache invalidation and tombstoning

Semantic memory is a cache of enterprise state, and cache invalidation is its hardest operational problem. When a source document is updated or deleted, its old chunks do not disappear on their own. The pipeline needs an explicit tombstone worker: on a source change, it generates chunks for the new version, issues a delete against every record matching the old document identifier and version, and only then inserts the new chunks. Treating updates as content-keyed upserts is the common shortcut and a reliable source of contradiction, it leaves orphaned fragments of prior versions behind, and similarity search will eventually surface them next to their replacements.

Semantic and multimodal parsing

Splitting text by a fixed token count severs paragraphs, divides a function from the comment that explains it, and strips tables of their headers. The pipeline parses for meaning instead: headings set chunk boundaries, structured records stay intact, and code is split along syntax-tree boundaries so that a function and its signature travel together. Enterprise documents also carry diagrams, screenshots, and tables; these are passed through a vision-capable model at extraction time to produce dense textual descriptions, which are then chunked and embedded alongside the prose. A retrieval system blind to the architecture diagram in a design document is blind to the part of the document that carried the most meaning.

Shifting reasoning to ingestion

Expensive reasoning is best spent asynchronously at ingestion, where no user is waiting, rather than synchronously while the agent runs. Two precomputations earn their cost. The pipeline can generate, for each chunk, the questions that chunk answers and embed those alongside it, so that retrieval matches a user’s question against questions rather than against raw exposition, a markedly more accurate match. And it can extract typed entities and relationships into a knowledge graph, so that at run time the agent can traverse explicit structure, this service depends on that database, instead of inferring relationships by fuzzy similarity. The graph is also what makes the targeted deletion of Chapter 7 tractable: erasing an entity is removing a node and its edges, not re-summarizing every document that mentioned it.

Curation as a workflow

Not all data deserves ingestion. A wiki holds authoritative policy next to abandoned drafts, brainstorming notes, and personal scratchpads. Ingesting the latter degrades the signal-to-noise ratio of every subsequent retrieval. Semantic memory must be curated, and curation is a workflow with a human in it, not a crawler left running.

The mechanism by which a fleet learns is the promotion of knowledge from episodic to semantic memory, and that promotion is governed. An agent resolves a novel incident and proposes a runbook drawn from its episodic trace. The proposal enters a human approval queue (Chapter 6). An engineer reviews, edits, and approves it. Only then does the runbook enter the ingestion pipeline to be redacted, enriched, embedded, and stored. The gate separates the agent’s day-to-day activity from the foundational knowledge the fleet depends on, which keeps an agent from poisoning the well it later drinks from.

The skills supply chain

The skills layer (Chapter 10) is a runtime-loaded form of semantic memory, and an organization’s internal skills deserve the same ingestion rigor as its documents. A skill file passes through the same pipeline: enrichment tags a deployment procedure with an access-control list restricting it to the operations group, and lineage records the commit that introduced it, so that a skill implicated in an incident leads back to the change that introduced the flaw. Treating skills as a privileged, ungoverned shortcut into the agent’s capabilities reopens every failure this chapter is built to prevent.

Operationalizing the pipeline

The ingestion pipeline runs on a different clock from the agent. It is asynchronous, batch or streaming, and scaled independently of the reasoning loop, and it needs its own observability: ingestion latency from a source change to retrievability, redaction hit rates, and counts of orphaned chunks all belong on a dashboard.

Its characteristic failure is silent. When ingestion stalls, nothing breaks, the agent keeps answering, confidently, from yesterday’s data. The architecture must make staleness loud. The pipeline emits freshness heartbeats per source, and the memory gateway raises an alert when it is serving queries against an index that has fallen behind its freshness budget. An agent that cannot tell its knowledge is stale will defend it as readily as if it were current.

Reprocessing and the embedding-model lifecycle

The ingestion commitments above treat the embedding model as fixed. It is not. The model that turns text into vectors is a versioned dependency, and the index is bound to it in a way that has no parallel in ordinary data engineering: a vector is only comparable to other vectors produced by the same model. Cosine similarity between an embedding from one model and an embedding from another is not a worse measurement; it is a meaningless one. The vector space itself has changed.

This makes an embedding-model change unlike a code deployment or a schema migration. There is no incremental upgrade and no in-place upsert that mixes old and new, a single index holding vectors from two models returns ranked nonsense. When the embedding model changes, whether because a better one appears, the current one is deprecated, or a domain fine-tune is adopted, the entire corpus must be re-embedded. The disciplined form is a governed backfill: build a parallel index under the new model, replay the source corpus through the pipeline to populate it, validate retrieval quality against a golden query set, and cut over only when the new index meets the bar, then retire the old one. Running both during the transition costs double the storage and the full re-embedding compute, which for a large corpus is a planned, throttled batch job on the ingestion pipeline’s own resources, never an online operation on the serving path.

The embedding-model version therefore joins the source identifier, document version, and timestamp as a first-class lineage dimension (above): a stored vector is meaningful only when paired with the model that produced it. This asymmetry also marks the boundary with the model gateway of Chapter 15. A generation request can fail over to an equivalent model mid-flight, because each call stands alone; an embedding request cannot, because every vector already in the index commits the system to one model until the next backfill. Embeddings are a stateful dependency where generation is a stateless one, and the ingestion pipeline is where that state is owned.

Anti-patterns

The blind-ingestion failures above describe what happens with no pipeline at all. The more insidious failures appear in pipelines that exist but cut a corner.

Redaction after retrieval. Scanning for sensitive data at read time rather than before embedding. The unredacted vector still exists in the store, and any retrieval path that skips the read-time scrubber, an export, a debugging query, a second consumer, leaks it. Redaction belongs on the write path, where it runs once and leaves nothing behind.

Content-keyed upsert. Updating the store by re-embedding changed text and trusting the index to overwrite the old version. Because chunk boundaries shift when text changes, fragments of the prior version are orphaned rather than replaced, and similarity search surfaces them beside their successors. Invalidation must be keyed on document identity and version, not on content.

Prompt-enforced access control. Tagging nothing and instructing the model to use only documents the user is allowed to see. This is prompt-based governance (Chapter 6) in a data-layer costume; it fails the moment a chunk’s text does not announce its own sensitivity. Access control must be a deterministic pre-filter on ingested tags.

Fixed-size chunking. Splitting on a token count regardless of document structure. It severs functions from their signatures and rows from their headers, and it is the most common reason a technically correct retrieval returns a fragment the agent cannot use.

Ungoverned skill ingestion. Loading internal skills into the agent without the redaction, access-control, and lineage steps applied to documents. It reopens every failure the pipeline exists to prevent, on the most privileged content in the system.

Testing implications

Ingestion failures are severe, and, unusually for an agentic system, deterministic, which makes them testable by assertion rather than by judgment. Three classes are worth building first.

Chapter 12 develops the testing framework. Ingestion tests belong in it because their failure modes are both among the most severe in the system and among the cheapest to assert.

Summary

Semantic memory is not a model trick; it is an enterprise data product. It requires a deterministic ETL pipeline that redacts before embedding, binds identity and access control to every record and keeps them synchronized, records lineage, invalidates superseded content through tombstones, parses for meaning, and shifts expensive reasoning to ingestion time. Curation keeps the store authoritative; independent observability keeps it honest about its own freshness. With the write path governed as strictly as the read path, the agent’s foundation is auditable and current rather than a swamp the model navigates blind. Chapter 9 turns from what the agent knows to how it acts, taking up the control and coordination patterns that govern an agent’s advance through a task.