The Flatland Fallacy: Engineering the Topology of Knowledge

Article I — Geometry of Fallen Vectors

The Flatland Fallacy — A four-article series · April 2026

Raj Sakthi — Founder, LatentGeneration.ai · Co-Founder, DurgAi · Head of AI, Freshriver.ai


A luminous multi-dimensional knowledge sphere casts projections onto a flat two-dimensional plane. Engineers below build elaborate scaffolding between two nearly identical shadows, unaware that the three-dimensional objects above are radically different — one life-saving, one toxic. The chasm between dimensional reality and flat perception.
The Flatland Fallacy. Above the plane: knowledge as it actually exists — typed, relational, temporal, hierarchical. Below the plane: the cosine projection RAG queries, where two compounds with opposite clinical outcomes cast nearly identical shadows. The engineers build ever more elaborate scaffolding between the shadows (agentic loops, rerankers, longer contexts) without looking up.

Frederic Bartlett (1932) proved that human memory is not playback. It is a reconstruction --- actively shaped by context, schema, and the specific question being asked. Nearly a century later, we are busy building RAG systems that completely ignore this principle.

The gap between lexical proximity and semantic compatibility is where every silent failure in most of the RAG pipeline lives.

Don't get me wrong, naïve RAG works inside a narrow envelope — clean corpora, self-contained documents, retrieval-shaped questions. The trouble starts when we take it outside that envelope and wrap it in orchestration that feels intelligent while guaranteeing the same brittle failures under a higher token budget. The pattern is not specific to pharmaceuticals. A financial compliance system that retrieves a superseded ISDA master agreement alongside the current version because they share 90% of their vocabulary. A legal research tool that conflates two circuit court opinions reaching opposite conclusions because both cite the same precedents. The domain changes. The geometry does not.

This article names four substrate fallacies: specific dimensions of knowledge that dense embeddings discard, and architectural failure modes that survive every optimization we throw at them.

The Flatland Fallacy — three states of knowledge A three-panel spine figure recurring across the four-article series. The left panel — the shadow, vanilla RAG — is the diagnosis target of Article I and is rendered in full color. The middle panel (the object, ingestion as engineering, Article II) and right panel (the extended mind, federated substrate, Article III) are dimmed as forward references. A footer banner names Article IV. The Flatland Fallacy · three states of knowledge From retrieval over shadows to reasoning over objects to conversation between cortices The shadow vanilla RAG The object ingestion as engineering The extended mind federated substrate no ingestion stage cosine top-k chunks flat · ranked plausible answer no structure · no time · no provenance four fallacies live here Trial Drug AE Cohort Target Lab valid_from: 2019-03-15 source: NCT0271… SHACL · ontology · bitemporal reasoned answer + chain typed · time-aware · auditable Article II treats this Pharma A Pharma B Academic Regulator Literature queries travel · data stays sovereign · privacy-preserving Article III bridges these Article IV — Wiring the Cortex an open-source reference implementation traversing all three states Article I — diagnoses you are here Article II — engineers Article III — connects
Figure 0 (recurring throughout the series). The whole argument in one frame. Article I diagnoses what lives in the left panel: the shadow that vanilla RAG retrieves over, where the four fallacies live. The middle and right panels are dimmed because they have not been earned yet — Article II builds the typed object, Article III bridges islands of those objects across institutional boundaries, and Article IV provides the working code that traverses all three states. The same diagram opens all four pieces; the highlighted panel marks where each article does its work.

The Four Substrate Fallacies

The four substrate fallacies are not independent. They are one architectural omission expressed in four directions. The matrix below is the navigational summary — each section that follows expands one cell.

The four substrate fallacies as a 2×2 matrix structure lost within doc across docs precision lost II · Structural amputation Chunking severs cross-references authors assumed intact Symptom Right info, split across chunks Tell Benchmarking chunk size III · Composition failure Flat concatenation across docs destroys interpretive scaffolding Symptom More chunks, worse answers Tell Accuracy peaks then drops I · Representation failure Geometric proximity replaces relational topology Symptom High score, wrong answer Tell Lookalikes with opposite outcomes IV · Corpus failure Wrong, missing, superseded documents in the substrate Symptom Perfect retrieval, wrong answer Tell Weeks of tuning, no movement one omission four directions

Figure 3. The four substrate fallacies as a 2×2 system. Two axes — what is lost (vertical: relational structure above, semantic precision below) and where it is lost (horizontal: within a document, across documents). Read the figure as the table of contents for what follows.


Fallacy I — Representation Failure: When Geometry Replaces Topology

The premise of semantic search is that meaning can be compressed into a fixed-dimensional vector such that geometric proximity equates to semantic similarity. This works for surface-level retrieval. It fails catastrophically when the knowledge has structural depth that geometry cannot capture.

The midnight email

Consider a scenario that recurs across regulated-domain RAG deployments. The email arrives four weeks after a clinical decision support system goes live. Subject line, three words: We need to talk. Attached is a twelve-page printout. Eight model responses are highlighted in yellow. Seven are factually correct — well-sourced, coherent, properly cited. The eighth has synthesized two kinase inhibitor studies into a single recommendation, blending them into fluent, authoritative prose with nearly identical confidence scores.

The two studies describe opposite clinical pictures. One reports a compound that extends progression-free survival in EGFR-mutant non-small-cell lung cancer. The other reports a structurally unrelated compound that failed in trial for the same indication and produced Grade 4 hepatotoxicity in eleven percent of patients. They share vocabulary — kinase, inhibitor, mutation, tumor response. The pathways are different. The clinical implications are incompatible. The retrieval system returned them as near-equivalents.

The system had been tested on eight hundred queries before deployment. It had passed every benchmark the team knew how to construct. And it had told a physician, in fluent prose with proper citations, that a hepatotoxic compound extended survival. Nobody is hurt — the physician caught it on review, the kind of catch a system cannot count on. The embedding fix takes months. Then comes the chunking problem. Then the composition problem. Then the corpus problem. Each fallacy escaped reveals another underneath it.

The midnight email was pure Representation Failure. Two papers, overlapping vocabulary, opposite clinical implications, cosine similarity 0.91 --- well inside the agreement zone of this embedder's distribution. The embedding model did exactly what it was asked to do. The documents were similar in the shallow sense vector geometry captures and dangerously incompatible in the deep sense clinical decision-making requires. Teams in this position will spend months trying different embedding models — domain-tuned, multi-vector, late-interaction approaches like ColBERT — before they understand the problem is not how they were embedding. The problem is that embedding was the wrong tool for the job.

Why the collision is architectural, not tunable

The neural machinery is incriminating. The hippocampus performs two operations in constant tension: pattern separation (keeping similar-but-distinct memories distinguishable) and pattern completion (reconstructing a full memory from a partial cue) (O'Reilly & McClelland, 1994; Yassa & Stark, 2011). Separation prevents catastrophic interference — exactly the disaster that happens when your vector store returns a hepatotoxic study alongside a survival study because they share words. In the dentate gyrus — the input stage of the hippocampus — overlapping cortical inputs are re-encoded into sparse activation patterns, with active fractions consistently in a few-percent band. The mechanism relies on expansion into a much larger neural population with aggressive competitive inhibition, ensuring that even near-identical inputs activate barely-overlapping subsets.

Cosine similarity over dense embeddings does the opposite. Most dimensions are non-zero for any input, and similar inputs produce similar activations by construction. The collision is not a tuning failure. It is what the architecture is doing.

The analogy goes only so far. The dentate gyrus assumes its dictionary; the engineering problem starts with constructing one. Sparse-coding attempts in the retrieval literature --- SPLADE, sparse autoencoders over dense embeddings --- show why the operation is harder than the principle: sparsity over the vocabulary dimension is not the same operation as sparsity over the concept dimension. The brain solves a problem we are not yet equipped to pose cleanly. The biological claim grounding this section is therefore narrower than it sounds: sparse, high-dimensional codes have provably lower interference than dense ones (Babadi & Sompolinsky, 2014), and the contrastive-cosine substrate inverts that property. The rest is engineering, not analogy.

Cosine similarity (naïve RAG)

Dense activation. Two near-identical inputs collide by construction.

Active per pattern
Overlap (collision)
Cosine similarity
Patterns collide — the midnight email

Sparse coding (dentate gyrus)

Few cells active. Drag the slider to control sparsity.

Active per pattern
Overlap (collision)
Cosine similarity
Patterns separate
2.0%
Fires for paper A only Fires for paper B only Fires for both — collision

Figure 1 (interactive). Two near-identical inputs encoded two ways. The left panel is fixed: cosine similarity over dense embeddings produces nearly identical activation across both inputs, and the purple cells are the collisions the midnight email surfaced. The right panel is the dentate-gyrus solution — drag the slider to vary sparsity and watch the overlap collapse. At a few percent activation, two highly similar inputs end up represented by patterns that barely overlap. Pattern separation is a property of the encoding architecture, not a heuristic on top of it. The brain solves this problem. Naïve RAG inverts the solution. (The grid uses binary activation — on/off per cell — for visual clarity; real embedding vectors are continuous-valued. The geometric consequence — dense activation producing high overlap by construction while sparse activation drives overlap toward zero — is preserved under continuous-valued representations.)

The collision geometry — cosine similarity distributions

Why a retriever using cosine similarity cannot distinguish agreement from contradiction when vocabulary overlap is high

Agreeing pairs Contradicting pairs Overlap zone
Agree μ
0.91
Contradict μ
0.87
Overlap
73%
Risk
HIGH

At 1536 dimensions with 78% vocabulary overlap, a retriever using cosine similarity has no reliable threshold to separate agreeing from contradicting document pairs. The midnight email lives in the overlap zone.

Figure 1b (interactive). Cosine similarity distributions for agreeing and contradicting document pairs. The teal curve shows pairs that reach the same clinical conclusion; the coral curve shows pairs with opposite conclusions. The amber overlap zone is the collision region — the geometric space where the retriever cannot distinguish agreement from contradiction. Adjust the vocabulary overlap slider: at the overlap levels typical of specialized corpora (70–90%), no cosine threshold can reliably separate the two classes. The midnight email lives in the overlap zone. (Distributions modeled from observed cosine similarity patterns across regulated-domain document pairs. The qualitative shape — heavy overlap for high vocabulary similarity — is the invariant finding.)

The Flatland figure

The geometric collapse from above the line to below is what makes the substrate failures inevitable. Above: a typed knowledge object — two compounds connected by shared vocabulary but separated by mechanism, outcome, and several edges of typed relation. Below: the cosine projection naïve RAG queries — those compounds collapsed onto a single similarity axis. The projection keeps the vocabulary and discards the outcome edges.

The Flatland figure: a typed knowledge object and its cosine shadow Top half shows two pharmaceutical compounds as nodes connected by typed edges to opposing clinical outcomes through shared vocabulary. Bottom half shows the same two compounds projected onto a one-dimensional cosine-similarity axis where they land at near-identical positions. Above the line — what knowledge actually is Compound A Phase III, EGFR NSCLC Compound B failed Phase II, EGFR NSCLC Shared vocabulary kinase, inhibitor, mutation extends survival PFS +4.2 months grade 4 hepatotoxicity 11% of patients, halted outcome outcome Below the line — what cosine similarity sees projection projection distant identical cosine similarity axis (1D shadow of the embedding sphere) A B cosine 0.94 — what the retriever sees Above: outcomes are first-class — opposite clinical pictures separate by topology. Below: the projection keeps the vocabulary and discards the outcome edges.

Figure 2. The Flatland figure. Above the line, the typed knowledge object: two compounds connected by shared vocabulary but separated by opposing outcome edges. Below the line, the cosine projection naïve RAG queries: the same two compounds collapsed onto a one-dimensional similarity axis with the outcome dimensions discarded. It is the operation every fallacy in this article shares.

Figure 2 shows the projection as a fait accompli — the shadow already cast. But the loss is not instantaneous. It is sequential, and the sequence maps to the four fallacies. Drag the slider below to watch each dimension of knowledge collapse in turn.

The dimensional collapse

Drag the slider. Watch what cosine projection discards. The knowledge doesn't disappear — the representation does.

typed knowledge graph cosine shadow
Edge types visible
5 of 5
A–B separation
Topologically distinct
Clinical risk
Contained
Full knowledge representation. Compound A and Compound B share vocabulary but are separated by typed edges: contradiction, outcome, mechanism, trial phase, and temporal validity. The graph topology makes them structurally distinguishable.
Outcome (survival)
Outcome (toxicity)
Mechanism
Temporal / phase
Shared vocabulary
Contradiction

The architectural reason: trained for similarity, not stance

The biological analogy points to a deeper architectural fact about how these embedders were built. Modern embedding models are trained with contrastive objectives — InfoNCE, multiple-negatives ranking, triplet losses — whose supervision signal is "these two passages co-occur in similar contexts." The signal is never "these two passages reach the same conclusion." A paper reporting that Compound A extends survival and a paper reporting that Compound B caused hepatotoxicity in the same indication share co-occurrence statistics so densely — same vocabulary, same patient population, same molecular target class, often the same journal venue — that any contrastive objective treats them as positives of one another. The training data has no column for agreement. The model cannot learn what it was never asked to represent. The midnight email is what that absence looks like at deployment.

The geometry inherits this and compounds it: spheres cannot hold trees. Cosine similarity is computed on the surface of a unit hypersphere — a bounded manifold. Ontologies, taxonomies, and pharmacological hierarchies are not spherical objects. A drug class that branches into twenty subclasses, each branching further into variants and formulations, is a tree. Trees embed naturally in hyperbolic spaces, where the area available at distance r from the origin grows exponentially with r — unlike Euclidean space (polynomial growth) or spherical space (bounded). A pharmacological taxonomy with branching factor b and depth d produces bd leaf concepts; hyperbolic space has room for all of them at the boundary while keeping the root at the origin and preserving hierarchical distances. Spherical and Euclidean spaces cannot: they are geometrically too small for exponential branching, and everything crowds toward the center (Nickel & Kiela, 2017).

The framing oversimplifies in one direction: real knowledge is a DAG with cycles, not a strict tree, and the geometric prescription for DAGs is mixed-curvature embeddings rather than pure hyperbolic space (Gu et al., 2019). The architectural point is unchanged --- spherical embedding spaces cannot hold the structures we need them to hold --- but the fix is more careful than "switch to hyperbolic." We are taking knowledge structures whose natural geometry is hyperbolic and projecting them onto a unit sphere. We are taking knowledge structures whose natural geometry is hyperbolic and projecting them onto a unit sphere — an operation that must produce collisions, and does. Even if the training objective were stance-aware, the bounded geometry would still crowd hierarchically distant items together. The training objective is the architectural cause. The spherical projection is the geometric symptom that compounds it.

What would have to change — and why none of it ships today

Stance-aware training. NLI-distilled embedders (Sentence-T5, the GTR family, NLI-supervised E5 variants) inject entailment and contradiction supervision into the contrastive loss. The contradiction window narrows. It does not close: NLI supervision is largely sentence-level, while the contradictions in regulated corpora span paragraphs, documents, and years. Non-spherical geometry. Production-ready hyperbolic or product-manifold embedding at million-document scale does not exist. The Nickel & Kiela (2017) foundations operate on graphs of thousands of nodes, not millions of typed pharmaceutical relations; the optimization pathologies at scale have not been solved. The ingestion prerequisite. Even if a stance-aware hyperbolic embedder shipped tomorrow, it would need a typed taxonomy to embed against — which is exactly the ingestion problem Article II addresses. The geometry prescription routes upward, through the model. The ingestion prescription routes downward, through the data. Both are needed; only one is currently funded.

In a customer support corpus the blurring is tolerable. In a clinical corpus it is a compliance incident waiting for a court date.

The Modern Embedding Landscape: What Has Changed, and What Has Not

The embedding landscape has not stood still, and the critique above must account for recent developments that partially address the collision geometry.

Multi-vector representations — ColBERT, ColBERTv2, and XTR (Khattab & Zaharia, 2020; Santhanam et al., 2022; Lee et al., 2024) — represent documents as collections of token-level vectors rather than a single pooled embedding. Late interaction computes similarity at the token level, which can distinguish "extends survival" from "causes hepatotoxicity" if these phrases activate different query tokens. The effect is real but bounded: if both documents contain the shared vocabulary ("kinase," "inhibitor," "EGFR") and the distinguishing phrases appear late in the ranking calculation, the retrieval stage may still return both documents. Multi-vector representations narrow the collision window. They do not close it.

Matryoshka Representations (Kusupati et al., 2022) — available in OpenAI's embedding models and Jina's suite — provide embeddings at multiple dimensionalities (3072D, 1024D, 256D). The coarse-to-fine retrieval pipeline this enables partially addresses the capacity problem. This is a retrieval optimization, not a representation fix. The geometry is still spherical. The capacity is still bounded.

BGE-M3 (Chen et al., 2024) — the most capable production embedding system as of this writing — provides dense, sparse (SPLADE-style), and multi-vector representations simultaneously. It is the closest current system to the desideratum this article describes. But BGE-M3's sparse representation is sparse over the vocabulary dimension, not the concept dimension. It can separate "kinase inhibitor" from "protease inhibitor" because the vocabularies diverge. It cannot reliably separate two EGFR kinase inhibitor papers with opposite clinical outcomes and overlapping terminology. Vocabulary-level sparsity improves lexical retrieval. Concept-level sparsity — orthogonalizing similar-but-distinct knowledge objects — requires the architectural changes this article argues for.

None of these approaches resolve the fundamental problem. They narrow it. The collision geometry persists because the underlying space is spherical, and hierarchical, typed, relational knowledge does not embed naturally on a sphere. The improvements are worth deploying — they measurably improve recall on standard benchmarks — but they do not change the geometry that produces the midnight email.

The same limitation applies to standalone SPLADE systems: sparsity over vocabulary tokens is not the same operation as sparsity over concepts. The dentate gyrus achieves the latter — each memory activates a unique subset of a large neural population, driven apart by competitive inhibition regardless of vocabulary overlap. The retrieval literature has not yet attempted this.

I still use embeddings for first-pass candidate retrieval. But I no longer expect them to understand what they retrieve, and I build verification layers that assume they don't. If your retrieval architecture treats proximity as truth — whether the proximity is computed by an OpenAI embedding model, a Pinecone hybrid index, or a Cohere reranker on top — you are not building a knowledge system. You are building a similarity ranker and hoping it substitutes for understanding.


Fallacy II — Structural Amputation: The Chunking Scalpel

I once spent three weeks tuning a chunking pipeline for a clinical trial corpus — different sizes, overlaps, splitting strategies, even a learned boundary detector. The problem: our system kept returning dosing information without the safety thresholds that governed it. No matter how I tuned the chunker, the system missed the connection between the dosing schedule on page 12 and the stopping criteria on page 47.

The breakthrough came when I stopped asking how to split the document and started asking what knowledge structures the document contained. A clinical trial protocol is not a narrative that happens to be long. It is a structured artifact with forward references, conditional logic, cross-references to regulatory guidance, and tabular data in which the relationships between cells carry more meaning than the contents of individual cells. The dosing escalation rule references the safety threshold. The biomarker stratification modifies the inclusion criteria for a specific patient subgroup. None of this structure is visible to a splitting algorithm that treats the document as a sequence of paragraphs.

When you split this document into chunks, you don't just partition text. You sever connections its authors took for granted. You amputate the relational skeleton while pretending the flat bag of tokens still carries the same anatomy.

Clinical Trial Protocol XR-7042 — EGFR NSCLC Phase III
pp. 1–8 §1 Study design & objectives Primary endpoint: PFS. References stopping criteria §5.
pp. 9–14 §2 Inclusion / exclusion criteria EGFR+ confirmed. Modified by biomarker stratification §4.
pp. 15–22 §3 Dosing schedule & escalation 200mg → 400mg. Escalation governed by safety threshold §5.
pp. 23–31 §4 Biomarker stratification T790M subgroup. Modifies inclusion criteria §2, alters dosing §3.
pp. 32–47 §5 Safety thresholds & stopping criteria Grade 3+ hepatotoxicity → halt escalation §3. Governs design §1.
pp. 48–55 §6 Statistical analysis plan Interim futility analysis. References primary endpoint §1.
chunk 1 | chunk 2
chunk 2 | chunk 3
chunk 3 | chunk 4
chunk 4 | chunk 5
Cross-references (intact) §1 references §5 stopping criteria §3 dosing governed by §5 safety
All cross-references intact. The document's relational skeleton is preserved.

Figure 6 (interactive). A clinical trial protocol with internal cross-references. Toggle "Apply chunking" to see what a 512-token splitter does: it severs the connection between the dosing schedule (§3, page 15) and the safety threshold that governs it (§5, page 32). The chunk contains the dose. It does not contain the reason to stop escalating. That is structural amputation.

Cognitive science has a name for this. Jeffrey Zacks's event segmentation theory (Zacks et al., 2007) shows that human comprehension partitions continuous experience at perceptually meaningful boundaries — topic shifts, state changes, references closing — and that disrupting these boundaries measurably degrades later recall. A 512-token splitter does not know any of those boundaries are there. It does not merely fail to respect document structure; it actively destroys the event-level segmentation the document's authors used to organize meaning.

I now think of chunking as an amputation problem, not a layout problem. Every split is a potential severing of a relationship that matters. The question is not what is the right chunk size. The question is what knowledge structures must be preserved intact, and how do we identify them before we start cutting. That requires understanding each document at a depth no general-purpose splitter — LangChain's RecursiveCharacterTextSplitter, LlamaIndex's SentenceSplitter, or a hand-rolled regex — can achieve.

The tell: when you find yourself benchmarking chunk sizes, you have almost certainly already lost. The relationships that matter in your domain were not a function of chunk size. They were a function of document structure, and they were severed before the first embedding was computed.


Fallacy III — Composition Failure: Attention Is Not Understanding

Early in our RAG work, we noticed a puzzling pattern. Increasing the number of retrieved chunks improved our recall metrics but degraded our answer quality. With five chunks, decent answers. With fifteen, broader coverage and less accuracy. The model was seeing more of the right information and doing less with it.

Two things are happening, and they are usually discussed as one.

The first is attention sag — Liu et al. (2023) named it lost in the middle. Information in the middle of a long context window receives less reliable attention than information at the boundaries. When you stuff a context window with twenty retrieved chunks, you are not providing twenty units of relevant context. You are providing a few high-attention edges and a bulk of low-attention noise. Long-context models from Anthropic, OpenAI, and Google have improved this, but improvement is not solution: the failure mode shifts from missing the middle to lower confidence everywhere, which is not the same as having the right answer.

The second, deeper problem is composition failure. Retrieved chunks are not independent units of meaning. A chunk about drug contraindications in elderly patients means something different when it follows a chunk about pediatric dosing than when it follows a chunk about geriatric pharmacokinetics. Naïve RAG concatenates chunks into a flat sequence, destroying the interpretive scaffolding that determines how each chunk should be read. Recall is reconstruction under context. Flat concatenation is not composition. It is stacking.

The cognitive mechanism is more specific than "attention sag." When a language model receives three chunks in sequence — geriatric pharmacokinetics, followed by a dosing schedule, followed by pediatric dosing — it does not treat them as independent paragraphs. It treats them as a composition and reconstructs a narrative that connects them. The model does not know that juxtaposition is accidental. It cannot distinguish "these chunks were concatenated by a retrieval score" from "these chunks belong together because a domain expert composed them."

Endel Tulving called the underlying principle encoding specificity (Tulving & Thomson, 1973) : a memory is not a stored file but a record of the interaction between the item and its encoding context. Change the context, and the same item is recalled differently — or not at all. Change the neighboring chunks, and the same dosing paragraph is interpreted differently. Composition failure is not just losing information. It is manufacturing false context that the model treats as authored.

Composition failure: context determines meaning
Select a domain to see how the same chunk changes meaning when its neighbors change. The chunk text is identical. The interpretation is not.
Context A
LLM interpretation
Context B
LLM interpretation
Same chunk. Same embedding. Same retrieval score. Different meaning. Flat concatenation cannot distinguish these.

Figure 6b (interactive). The composition failure made concrete. In each pair, the highlighted chunk is identical — same text, same embedding vector, same retrieval score. But the neighboring chunks change its clinical, regulatory, or operational meaning. Naïve RAG concatenates chunks without relational scaffolding. The LLM receives the same text and has no mechanism for knowing which interpretation the domain requires.

The retrieval-depth paradox

More chunks improved recall and degraded answer accuracy. Drag the slider to walk through the paradox.

Document recall Answer accuracy Accuracy peak
5 chunks
Recall
68%
Accuracy
84%
Gap from peak
0%
Noise level
Low

At 5 chunks: the accuracy peak. Enough relevant context to answer correctly. Low enough noise that the model attends to what matters.

Figure 7 (interactive). The retrieval-depth paradox. Drag the slider to vary the number of retrieved chunks and watch recall rise while accuracy peaks early and then declines. At 5 chunks: peak accuracy. At 15: recall is high but accuracy has dropped 14 points. The model is seeing more of the right information and doing less with it. Pattern observed across multiple production deployments; the qualitative shape, not the quantitative scale, is what survives the hedge.

The DRM paradigm in memory research (Roediger & McDermott, 1995) quantifies the exact failure: when retrieval is driven by semantic similarity, plausible-but-wrong items intrude with the same confidence as correct ones. Contextual retrieval narrows this for intra-document references. It does roughly nothing for the inter-document case, where the hard problems live.

The fix is not to retrieve fewer chunks. That just hides the problem in narrow queries. The fix is to recover the structural relationships between retrieved content — which is to say, to stop pretending the substrate is flat.


Fallacy IV — Corpus Failure: Perfect Retrieval, Wrong Document

When something goes wrong in a RAG system, the first instinct is to blame retrieval. The embedding model wasn't good enough. The reranker missed something. The query needed better expansion. Following that instinct has cost me weeks of optimization on the wrong component.

Consider what actually happens when you retrieve. You take a query — already a compressed, lossy representation of what someone needs to know — and match it against documents that are themselves compressed, lossy representations of knowledge. The retrieval score measures similarity between two lossy compressions using a metric that captures something about semantic overlap and nothing about factual correctness, temporal validity, or completeness.

I once debugged a system where the correct document was retrieved with high confidence. The document was wrong. It was a legacy protocol superseded by a regulatory update eighteen months earlier, but the update lived in a different system and nobody had told the retrieval corpus about it. The system did exactly what we designed it to do. What it did would have led to a compliance violation.

In another case, the retrieved document was authoritative and correct — and dangerously incomplete. A manufacturing SOP described the standard process beautifully but omitted the exception handling for a specific raw material supplier. The exception lived in a supplier qualification file in a different database with no cross-reference. Perfect retrieval. Perfect document. Wrong answer.

Corpus audit: pharma manufacturing knowledge base
What your retriever searches vs. what your question requires
SOP-2024-MFG
Manufacturing process v3.1 · Current
SOP-2022-MFG
Manufacturing process v2.0 · Superseded
SPEC-RAW-042
Raw material spec · Current
SQR-SUPPLIER-K
Exception handling · In supplier QMS
VAL-PROC-2024
Process validation · Current
CAPA-2023-117
Corrective action · Closed, never digitized
BR-LOT-8847
Batch record · Current lot
DEV-REPORT-19
Deviation report · In ERP system
STAB-STUDY-Q4
Stability data · Current
LAB-NOTE-DR-K
Handwritten · Retired PI, 2005
REG-GUID-FDA
FDA guidance doc · Current
REG-AMEND-2024
Amendment to guidance · Not yet ingested
In corpus, current In corpus, superseded Not ingested (cross-system) Never digitized
5
Current & present
1
Superseded
(still retrieved)
3
Cross-system
(invisible)
2
Never digitized
(unknowable)
Half of the knowledge required to answer the question correctly is invisible to the retriever. No embedding model, reranker, or query expansion can retrieve a document that is not in the corpus.

Figure 8. A corpus audit for a pharma manufacturing knowledge base. Of twelve documents relevant to a compliance question, five are current and present, one is superseded but still retrievable — the midnight email waiting to happen — three exist in cross-system databases never connected to the retrieval corpus, and two were never digitized. Perfect retrieval over this corpus returns the wrong answer because half the required knowledge is invisible. This is Corpus Failure: the most expensive fallacy to leave undiagnosed and the cheapest to confirm.

One category of corpus failure deserves particular mention because it foreshadows the deepest problem: documents that are technically present but informationally incomplete because ingestion could not handle their modality. A table in a scanned PDF whose structure was flattened by OCR into a sequence of numbers without column headers. A figure whose caption carries the conclusion — "Compound B shows dose-dependent hepatotoxicity at concentrations above 200nM" — but whose visual content (the actual dose-response curve) was discarded because the ingestion pipeline treats images as decorations. A handwritten annotation in the margin of a protocol — the principal investigator's note "DO NOT EXCEED 300mg — see adverse event log 2019-Q3" — that was never digitized because handwriting recognition was not in scope. These are not missing documents. They are present documents with missing knowledge, amputated by ingestion pipelines that treat text extraction as the entire job. The multi-modal ingestion problem is the subject of Article II.

Memory science has a name for this too. Marcia Johnson's source monitoring framework (1993) documents how false memories arise not from missing content but from missing provenance — the brain tracks not just what was experienced but where it came from, when, and with what level of authority. Confident misremembering happens when source bindings degrade. Naïve RAG retrieves documents without provenance topology. The 1998 superseded protocol and the 2024 amendment arrive in the prompt as epistemic equals, citation strings notwithstanding. The model has no mechanism for source monitoring because the substrate has no source bindings to monitor. Source monitoring in a knowledge system requires explicit provenance topology: what supersedes what, what amends what, what was retracted and why. These are ingestion problems. They are Article II problems.

No embedding, reranker, or query expansion can retrieve a document that is not in the corpus. The diagnostic shifts once you see this fallacy. Not did retrieval return the best match, but: Is the right document in the corpus at all? Is it current? Is it complete for this question? Are there other documents that modify, contradict, or supersede it? These are knowledge engineering questions, not retrieval questions. They are the questions the field has been most reluctant to fund.


The Diagnosis: Asking a Search Engine to Reason

The four fallacies are symptoms. The diagnosis is what they share — what I will call the substrate problem for the rest of this series. RAG was designed for question-answering over document collections, and we are asking it to do knowledge engineering. This is not a failure of implementation. It is a failure of scope.

Semantic search over chunked documents can find relevant passages. It cannot represent relational topology, preserve document structure, maintain interpretive context, validate temporal correctness, or reason over hierarchical knowledge — because it was never designed to. Pointing a search engine at a reasoning problem is not a performance issue you optimize away. It is a category error.

The fundamental claim is that naïve RAG treats knowledge as a flat collection of documents that can be ranked by similarity and fed into a language model. Real knowledge is not flat. A pharmacovigilance database contains adverse event reports linked to specific lots, manufactured at specific facilities, using specific processes, with specific raw materials, from specific suppliers, administered to specific patient populations with specific comorbidities. That is a graph traversal problem with temporal, probabilistic, and regulatory constraints. Calling it a document retrieval problem is like calling a metabolic pathway a grocery list.

Every advanced approach improves the retrieval layer while leaving the ingestion layer untouched. The work that actually fixes the problem routes downward, not upward. It routes into ingestion.


Auditing the Advanced Stack and Where It Stops

A fair objection, and one this article would be dishonest to dodge: the field has not been idle. Microsoft's GraphRAG (Edge et al., 2024) introduced entity-level knowledge graphs with community detection. HippoRAG made the hippocampal indexing analogy explicit and built graph-based retrieval with personalized PageRank. Anthropic's Contextual Retrieval prepends parent-document summaries to each chunk before embedding. RAPTOR builds recursive summary trees for multi-granularity retrieval. Agentic RAG wraps the entire pipeline in multi-step reasoning with tool use and self-correction.

Each represents real engineering against real limitations. The question is not whether they improve on naïve RAG — they do, measurably, on the benchmarks they target. The question is which of the four substrate fallacies each one actually resolves, and which survive the upgrade.

The substrate audit
Click an approach. See what it fixes, what it narrows, and what it leaves untouched.
Naïve RAG
GraphRAG
HippoRAG
Contextual Retrieval
RAPTOR
Agentic RAG
I · Representation failure
Geometry replaces topology
Select an approach above.
II · Structural amputation
Chunking severs cross-references
Select an approach above.
III · Composition failure
Flat concatenation, no scaffolding
Select an approach above.
IV · Corpus failure
Wrong, missing, superseded documents
Select an approach above.
Resolved
Narrowed
Untouched
New risk
Substrate verdict
Select an approach to see the substrate audit.

Figure 4 (interactive). The substrate audit. Each advanced approach is measured against the four fallacies. No approach achieves "Resolved" on any fallacy. GraphRAG and HippoRAG narrow three of four by introducing relational structure — real progress, honestly earned. But Fallacy IV (corpus failure) survives every upgrade because it is not a retrieval problem. Agentic RAG introduces a new risk: confident, fluent, expensive wrong answers when the substrate is incomplete.

The pattern is stark once the matrix is laid out. Every advanced approach improves the retrieval layer while leaving the ingestion layer untouched. GraphRAG comes closest to substrate work because it builds explicit structure — but its structure is only as good as its extraction pipeline, and its extraction pipeline is a general-purpose LLM running over chunks. The extraction has no domain ontology telling it what matters: that mechanism-of-action is different from target-of-action, that a Phase II failure carries different epistemic weight than a Phase III success, that "inhibitor" in the context of a kinase and "inhibitor" in the context of a protease are not the same relationship. Those distinctions are ingestion problems. They are Article II problems.

### The Agentic Trap: Confidence Amplification Over a Broken Substrate One fallacy deserves particular attention. Fallacy IV --- Corpus Failure --- is the only fallacy that *worsens* under the most sophisticated orchestration. An agent that reasons over an incomplete corpus for three steps produces a more confident, more fluent, and more wrong answer than naïve RAG over the same corpus. The reasoning feels thorough. The citations look authoritative. The answer is built on a foundation that was never audited. This is not a theoretical risk. It is the same failure mode under a higher token budget. The mechanism is confidence amplification. Each reasoning step in an agentic loop draws on the same broken substrate, and each step launders the corpus's gaps through another layer of plausible synthesis. The model has no mechanism for questioning the substrate --- nothing in the loop asks *is the right document even present?* By step three, the agent has built a fluent, internally consistent, well-cited argument on top of evidence that was incomplete before the first hop. The citations are real. The reasoning is sound. The conclusion is wrong with a confidence the substrate could never have justified. Of the four fallacies, this is the one where the upgrade path makes things worse, not better.

When teams sense the substrate problem but cannot name it, the typical response is to reach upward — agentic retrieval loops, contextual chunking, longer context windows, more rerankers. That escalation has its own shape.

The Escalator of Fixes
Click each layer. Watch complexity rise while the substrate stays broken.
5
Agentic orchestration
Multi-step reasoning loops, tool use, self-reflection, retry logic
+$2.40/query · +8s latency · 12× token budget · substrate unchanged
4
Contextual chunking + reranking
Parent-document summaries, cross-encoder rerankers, query expansion
+$0.60/query · +3s latency · 4× token budget · substrate unchanged
3
Longer context windows
128k→1M tokens, stuff more chunks, hope attention covers it
+$0.30/query · attention sag at scale · substrate unchanged
2
Better embeddings + hybrid search
Domain-tuned models, BM25+dense fusion, ColBERT late interaction
+$0.08/query · marginal recall gains · substrate unchanged
1
Naïve RAG
Chunk → embed → cosine retrieve → concatenate → generate
Baseline · the midnight email lives here
↓ THE SUBSTRATE ↓Ingestion pipeline · corpus quality · knowledge structure — untouched at every layer
Complexity
Low
Cost per query
$0.02
Substrate health
Broken
Click any layer to see the cost of reaching upward instead of routing downward.

Figure 5 (interactive). The Escalator of Fixes. Each layer adds latency, cost, and complexity while the substrate — the ingestion pipeline, corpus quality, and knowledge structure — remains untouched. The tell is the bottom meter: substrate health is identical at every layer. Teams that sense the problem reach upward. The fix routes downward.

The escalator shows the cost of reaching upward. The inversion below shows why the field keeps reaching in the wrong direction.

The percentages that follow are derived from post-mortem analyses across regulated-domain deployments, not from an industry survey. The qualitative pattern is what survives the hedge; the specific numbers are directional. The inversion is the finding, not the decimals.

The investment inversion
Left: where the field spends effort, attention, and funding. Right: where production failures originate. Click any layer.
Where effort goes
Agent orchestration
35%
Conference talks, VC funding, papers
Retrieval optimization
28%
Embeddings, rerankers, hybrid search
Chunking strategy
18%
Splitting heuristics, overlap tuning
Prompt engineering
12%
System prompts, few-shot, CoT
Ingestion & corpus
7%
"Plumbing" — treated as solved
inverted
Where failures originate
Orchestration bugs
5%
Retrieval misses
12%
Structural severing
18%
Composition errors
23%
Corpus & ingestion failures
42%
Wrong, missing, superseded, never digitized
Click a layer to see the inversion
Each layer reveals how effort allocation and failure origin diverge. The field invests most where failures are rarest, and least where failures are most common.

Figure 5b (interactive). The investment inversion. Left pyramid: where the RAG ecosystem allocates engineering effort, conference attention, and venture funding — widest at the orchestration layer. Right pyramid: where production failures actually originate — widest at the ingestion and corpus layer. The pyramids are inverted. The field spends roughly 35% of its effort on agent orchestration, which accounts for approximately 5% of production failures. It spends roughly 7% of its effort on ingestion and corpus quality, which accounts for approximately 42% of production failures. The label in the center is the diagnosis: inverted.


The Long-Context Objection

A fair structural challenge, and one this article would be intellectually dishonest to ignore: context windows are growing at roughly 10x per year. GPT-4 launched at 8K. Claude 3 offers 200K. Gemini 1.5 Pro claims 1M. At this trajectory, a 10M-token context window is a matter of engineering timing, not physics.

The objection runs like this: if the entire corpus fits in the context window, the chunking problem disappears. No structural amputation if you never cut. No composition failure if the model sees every document in full, with cross-references intact. Retrieval becomes unnecessary. Why retrieve when you can simply load everything?

The answer is the Jevons Paradox of Context Expansion, and my first draft of this section got it wrong. I framed it as teams stuffing more chunks into the same corpus, making over-retrieval economical. That is not the true paradox. The true paradox is deeper, and it undermines the long-context objection at its root.

The true Jevons Paradox is this: context expansion does not satisfy demand — it transforms it.

The Demand Treadmill: Context Expansion Cannot Eliminate Retrieval
Figure 4 (analytical)Top left: The demand treadmill — question ambition (superlinear) outpaces window capacity (linear). The gap requires structured retrieval. Top right: Question types evolve — as windows grow, teams shift from single-document lookup to corpus-spanning multi-hop reasoning. Bottom left: The multi-hop wall — a question like "cluster trials into phenotypic signatures, estimate pooled effects" requires 8+ hops across a 10M-page corpus, exceeding even 1M-token contexts. Bottom right: The paradox mechanism — window growth transforms ambition, which transforms question complexity, which hits the wall.

William Stanley Jevons, writing in 1865 about coal consumption, observed that making a resource more efficient does not reduce its total use. It changes what people use it for. More efficient steam engines did not mean less coal burned. They meant coal became economical for applications that had previously been uneconomical, which increased total demand beyond what the efficiency saved. The resource expands; the ambition expands faster.

The same dynamic governs context windows. A team with a 4K window asks one class of question: "Find the dosing section in this protocol." "What adverse events were reported in cohort B?" These are single-document, single-hop questions. The window constrains the ambition.

A team with a 1M window does not ask the same questions over more documents. They ask an entirely different class of question: "Cluster all heart-failure trials by phenotypic signature — HFrEF with high comorbidity burden, HFrEF with low burden, HFpEF with metabolic syndrome — then estimate class-specific pooled treatment effects using Bayesian meta-analysis, accounting for trial quality, temporal drift in protocols, geographic variation in patient populations, and publication bias."

This question does not require "more chunks." It requires the entire corpus, organized, in a structured traversal that no context window can provide — not because the window is too small to hold the text, but because the reasoning requires operations the window cannot perform: cross-trial comparison, temporal stratification, subgroup decomposition, hierarchical pooling. Each operation is a hop across the knowledge graph, and each hop requires targeted retrieval against a specific subpopulation of the evidence base.

The Scale Wall: Real Corpus Sizes vs Context Window Capacity
Figure 5 (analytical)Left: Real corpus sizes on a log scale against context window capacity markers. Only toy corpora fit in context windows. Every regulated-domain deployment exceeds capacity by orders of magnitude. Right: The long-context counterargument and three rebuttals — scale, the Jevons demand treadmill, and cost.

A 1M-token window holds roughly one thousand pages of text. The question above requires reasoning across a ten-million-page hospital corpus or a two-million-page FDA guidance library. The gap is three orders of magnitude. But even if the window could hold all ten million pages — suppose a hypothetical 10B-token context — the problem would not be solved, because multi-hop reasoning over unstructured text does not become tractable simply by loading more text. The model must: identify which trials are heart-failure trials; stratify them by phenotype; verify that the phenotypic assignment is consistent across decades of changing diagnostic criteria; identify the relevant endpoints for each stratum; compute pooled effects with appropriate weighting; account for temporal drift in how endpoints were defined and measured. Each of these steps requires structured access to a specific subset of the corpus. Loading the entire corpus into context and hoping attention sorts it out is not a strategy. It is a prayer.

The demand treadmill operates independently of corpus size. Even if the corpus were small enough to fit — say, a fifty-thousand-page pharmaceutical manufacturing library — the reasoning required for a question like the phenotypic clustering example would still need structured retrieval. The question is not "find me all the pages." The question is "find me the specific subset of trials that match this phenotypic signature, then the specific subset that match this other signature, then the relationships between the signatures, then the temporal evolution of treatment effects within each signature." This is graph traversal, not document loading.

Long-context models are a genuine engineering achievement. They excel at tasks that require single-document comprehension across tens of thousands of tokens — legal contracts, academic papers, financial reports, codebases. They are not a substitute for structured retrieval over million-page corpora, and they are not a substitute for multi-hop reasoning that requires targeted access to specific subgraphs of the knowledge base at each reasoning step.

The chunking problem is not solved by bigger windows. It is solved by understanding what knowledge the chunks contain before you cut them, so that the retrieval system can route each hop of a multi-hop question to the precise subset of the corpus that hop requires. That requires treating ingestion as first-class engineering, which is the subject of Article II.

Three Tests You Can Run Tomorrow

Before Article II ships, three concrete tests a reader can run on their own corpus. Each takes a half-day or less. Each tells you something the dashboard does not. None require new infrastructure.

Test 1 · The calibrated contradiction probe. Pick three pairs of documents you can label by hand: one pair you know contradicts on a substantive issue (different versions of a protocol, two papers with opposite findings, a current SOP and its superseded predecessor), one pair you know agrees (the anchor and a paraphrase of it), and one pair that is unrelated (the anchor and a clearly off-topic document). Compute cosine similarity for all three pairs in your production embedding model. The diagnostic is not the absolute number — cosine baselines vary dramatically across models (text-embedding-3-small centers around 0.3 for unrelated text; some BGE and E5 variants center above 0.6 due to anisotropy) — it is the position of your contradicting pair between the unrelated floor and the paraphrase ceiling. If the contradicting pair sits near the ceiling, your retriever cannot tell contradiction from paraphrase. The fix routes through ingestion, not retrieval.

What to look for: Compute the position ratio (contradict − floor) ÷ (ceiling − floor). If it exceeds 0.85, your contradicting pair is geometrically indistinguishable from a paraphrase in your model's own distribution — active Fallacy I exposure that no retrieval optimization can fix. Between 0.65 and 0.85 is the gray zone where rerankers may help and the substrate is fragile. Below 0.65, the embedding model has at least partial separation capacity for this kind of contradiction. In our regulated-domain audits using text-embedding-3-large, contradicting pairs sat at ratio 0.88–0.96; the absolute cosine values were 0.88–0.94 and the paraphrase ceilings were 0.90–0.97 — the ratios are what survive the change of model.

# Test 1: The Calibrated Contradiction Probe
# Cosine baselines vary across embedding models. A bare 0.85 threshold
# misfires on models with high anisotropy. Calibrate against your own
# model's distribution: where does your contradicting pair sit between
# unrelated text (floor) and a known paraphrase (ceiling)?

from openai import OpenAI  # or your embedding provider
import numpy as np

client = OpenAI()
MODEL = "text-embedding-3-small"  # ← your production model

def embed_batch(texts):
    resp = client.embeddings.create(model=MODEL, input=texts)
    return [np.array(d.embedding) for d in resp.data]

def cosine_sim(a, b):
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

# ── Four documents: anchor, the contradicting one, a paraphrase, an unrelated ──
anchor      = "Compound A extends progression-free survival in EGFR+ NSCLC (PFS +4.2mo, Phase III)."
contradict  = "Compound B caused Grade 4 hepatotoxicity in 11% of EGFR+ NSCLC patients (Phase II, terminated)."
agreeing    = "Treatment with Compound A demonstrated significant PFS extension in EGFR-mutant NSCLC patients."
unrelated   = "Q3 marketing dashboard shows mobile engagement up 12% quarter-over-quarter."

v_anchor, v_contradict, v_agreeing, v_unrelated = embed_batch(
    [anchor, contradict, agreeing, unrelated]
)

floor       = cosine_sim(v_anchor, v_unrelated)    # floor: how low does your model go?
test        = cosine_sim(v_anchor, v_contradict)   # the diagnostic measurement
ceiling     = cosine_sim(v_anchor, v_agreeing)     # ceiling: paraphrase = "definitely similar"

ratio = (test - floor) / max(ceiling - floor, 1e-9)

print(f"  Floor    (unrelated)  : {floor:.4f}")
print(f"  Test     (contradict) : {test:.4f}")
print(f"  Ceiling  (paraphrase) : {ceiling:.4f}")
print(f"  Position in [0,1]     : {ratio:.2f}   (contradict / paraphrase)")
print()

if ratio > 0.85:
    print("⚠ FALLACY I EXPOSURE")
    print("  Your contradicting pair is geometrically indistinguishable")
    print("  from a paraphrase. No retrieval optimization will fix this.")
elif ratio > 0.65:
    print("⚡ GRAY ZONE")
    print("  Contradicting pair sits in the upper half of your similarity")
    print("  distribution. Rerankers may help; the substrate is fragile.")
else:
    print("✓ Partial separation")
    print("  Your embedder distinguishes contradiction below the agreement")
    print("  ceiling. This narrows the cone — it does not close it.")
Your model's contradiction geometry
Run the calibrated probe above. Paste the three cosine values. See where your contradicting pair sits between the unrelated floor and the paraphrase ceiling — in your own model's distribution, not against a borrowed threshold.
Floor cos(anchor, unrelated)
Test cos(anchor, contradicting)
Ceiling cos(anchor, paraphrase)
floor (unrelated) ratio 0.65 ratio 0.85 ceiling (paraphrase)
partial separation
gray zone
fallacy I
0.317
● unrelated
0.938
paraphrase ●
Test value
0.912
Position ratio
0.96 (test−floor)/(ceiling−floor)
⚠ FALLACY I EXPOSURE. Your contradicting pair sits at 96% of the distance from unrelated to paraphrase. It is geometrically indistinguishable from agreement in your model's own distribution. No retrieval optimization will fix this — the fix routes through ingestion.
Figure 8b (interactive). The calibrated contradiction-probe diagnostic. Cosine baselines vary across embedding models — text-embedding-3-small centers around 0.30 for unrelated text; some BGE and E5 variants center above 0.60 due to anisotropy — so a bare 0.85 threshold misfires. The diagnostic that survives the change of model is the position of the contradicting pair between the unrelated floor and the paraphrase ceiling. Type your three numbers from the probe; the marker moves to your ratio. The presets show what the same corpus looks like through different embedders — the absolute cosine values shift dramatically, the ratio does not.

Test 2 · The retrieval-depth sweep. Run your eval set at retrieval depths of 1, 3, 5, 7, 10, and 15 chunks. Plot accuracy against depth. Where does accuracy peak? How big is the gap between peak and 15? If accuracy peaks early and then degrades — the pattern in Figure 7 — you are in Composition Failure, and adding rerankers or longer context windows will not fix it. The fix is to recover the structural relationships between chunks, not to score them more sharply.

What to look for: If accuracy peaks at k ≤ 5 and drops more than 8 percentage points by k = 15, you are in Composition Failure. If accuracy peaks at k ≤ 3, your retrieval is producing more noise than signal at typical production depths. If accuracy is flat or rising through k = 15, your corpus may be clean enough for naïve RAG — run Test 3 to confirm.

Test 3 · The wrong-answer audit. Take your last hundred incorrect answers. For each, classify: was the right document not in the corpus, or was it in the corpus but not retrieved? The split is diagnostic. Most teams discover that more than half their failures are corpus problems they had been treating as retrieval problems. That is Fallacy IV staring back from the audit, and it is the cheapest fallacy to confirm and the most expensive one to leave undiagnosed.

What to look for: If more than 30% of your failures are corpus problems (right document not in corpus, wrong version retrieved, missing cross-system document), you have Fallacy IV exposure that no retrieval optimization can address. In our audits, the median split was 55% corpus problems, 30% retrieval problems, 15% generation problems. Most teams discover they have been spending 80% of their effort on the 15% slice.

If two or more of these tests return uncomfortable answers, you are in good company, and Article II is for you.

Quick-start: the calibrated probe in twelve lines

Run this against your production embedding model. Replace the four documents with your own. The diagnostic is the position of your contradicting pair between the unrelated floor and the paraphrase ceiling — not the absolute cosine number.

from openai import OpenAI
import numpy as np

client = OpenAI()
cos = lambda a, b: float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

anchor     = "Compound A extends progression-free survival in EGFR+ NSCLC (PFS +4.2mo)."
contradict = "Compound B caused Grade 4 hepatotoxicity in 11% of EGFR+ NSCLC patients."
agreeing   = "Compound A showed significant PFS extension in EGFR-mutant NSCLC."
unrelated  = "Q3 mobile engagement metrics rose 12% quarter-over-quarter."

emb = client.embeddings.create(
    model="text-embedding-3-large",
    input=[anchor, contradict, agreeing, unrelated]
).data
v = [np.array(e.embedding) for e in emb]

floor, test, ceiling = cos(v[0], v[3]), cos(v[0], v[1]), cos(v[0], v[2])
ratio = (test - floor) / (ceiling - floor)
print(f"floor {floor:.3f}  test {test:.3f}  ceiling {ceiling:.3f}  ratio {ratio:.2f}")
# ratio > 0.85 → Fallacy I exposure (contradict ≈ paraphrase in your geometry)

Diagnostic Compass: Symptom-to-Fallacy Mapping
Figure 2 (analytical)Left: Symptom-to-fallacy heatmap mapping ten common production symptoms to four substrate fallacies. Right: Four-question diagnostic flowchart for rapid triage. Most production RAGs have 2-3 fallacies active simultaneously. The fix routes downward (ingestion), not upward (orchestration).

A Preliminary Sketch of the Fix

This article has diagnosed without prescribing. Before Article II ships, a high-level sketch of the ingestion-first architecture — enough that a senior engineer can evaluate feasibility and cost.

The substrate thesis, restated: naïve RAG fails because the ingestion pipeline discards the relational, hierarchical, and temporal structure that makes knowledge useful. The fix is to build an ingestion pipeline that preserves that structure — or, where the structure does not exist in the source documents, constructs it through domain-aware extraction.

Layer 1: Multi-modal extraction

The ingestion pipeline must handle text (structured and unstructured), tables (with their relational semantics), figures (with their captions and visual content), and handwriting (the annotations, margin notes, and corrections that carry critical safety information). Current pipelines typically extract text and discard everything else. A clinical trial protocol whose dosing table was flattened into a sequence of numbers by OCR has lost the structural information that determines safe administration.

Layer 2: Ontology-aware extraction

Generic entity extraction is necessary but insufficient. What matters is typed relations: Compound A inhibits EGFR through competitive binding in the context of non-small-cell lung cancer with progression-free survival as the primary endpoint. This requires a domain ontology — not a generic biomedical ontology like UMLS, but a task-specific ontology that encodes what matters for the questions the system will actually receive.

Layer 3: Provenance topology

Every document enters the system with temporal metadata (when it was authored, when it was updated, when it was superseded) and epistemic metadata (what evidence supports it, what contradicts it, what its confidence bounds are). This is source monitoring — the explicit representation of provenance that allows the system to recognize when a 1998 protocol has been superseded by a 2024 amendment.

Layer 4: Quality gates and corpus auditing

Before any document enters the retrieval corpus, it passes through quality gates: is the extraction complete? Is the typing accurate? Is the provenance intact? And after ingestion, a continuous audit monitors: what percentage of relevant documents are missing? What percentage are superseded? What percentage contain extraction failures? These metrics are the dashboard for Fallacy IV — the only fallacy you can measure without running queries.

This is not a small engineering effort. It is a different engineering effort from the one the field has been optimizing. The article argues not that it is easy but that it is necessary — and that the resources currently allocated to agentic orchestration would produce more reliable systems if redirected downward to the substrate.

What Comes Next

Naming these fallacies is not the same as escaping them. The substrate is not solved. The substrate is the problem.

The Flatland Fallacy — series architecture
Click each article to see the dependency chain. The fix routes downward — from naming the failures to building the substrate.
Article I · You are here
The Fallacies
Name them, shame them, show how they propagate. Four substrate failures the RAG ecosystem has been optimizing around instead of through.
Delivers: diagnostic framework + 3 self-tests
The four fallacies are symptoms. They share a root cause: ingestion treated as plumbing. Article II starts there.
Article II · The Substrate Silence
From Shadows to Organized Objects
Multi-modal, ontologically aware ingestion as first-class engineering. Bitemporal provenance. The consolidation loop that turns a flat document collection into a typed knowledge object. The work the conference circuit ignores because it does not photograph well.
Delivers: reference ingestion architecture + pharma case study
Good ingestion produces a typed knowledge substrate. That substrate enables a different retrieval architecture — one that routes through graph topology, not cosine geometry. Article III builds on it.
Article III · The Promised Land
Bridging Islands of Knowledge
Privacy-preserving query across institutional and jurisdictional boundaries. Federated knowledge graphs that let organizations reason together without merging. Democratization without centralization. The shape of the cure is hiding in the connections we haven't yet been able to make.
Delivers: federation architecture + privacy model + simulated cross-institutional query
Architecture without implementation is a position paper. Article IV provides the artifact.
Article IV · Wiring the Cortex
An Open-Source Reference Implementation
A working system: ingest a realistic pharmaceutical corpus, build the typed knowledge graph, run hybrid retrieval over it, and demonstrate simulated federation between two providers. The article is the README.
Delivers: OSS repository + working demo + benchmark against naïve RAG
The code is the argument. If the substrate thesis is correct, the implementation should outperform naïve RAG on the exact failure modes Article I diagnosed — and do so at lower total cost.
The fix routes downward, not upward
Article I names the symptoms. Article II treats the cause. Article III shows the architecture. Article IV proves it works.

Figure 9 (interactive). The series architecture. Each article builds on the previous one’s deliverable. The dependency chain is the argument: you cannot build III without II, and you cannot know you need II without I.

The distinction between II and III matters. The architecture is the easy part. The work is the hard part. Most engineers who land in the substrate problem reach first for a graph database when they should be reaching for an ingestion engineer.

If you have read this far and recognized your system in the diagnostic figures, you are in good company. Bring the worst document in your corpus and the question your stakeholders keep asking that the system gets wrong. Article II will start there.

References

Anthropic. (2024). Introducing Contextual Retrieval. Anthropic Engineering Blog. https://www.anthropic.com/news/contextual-retrieval Babadi, B., & Sompolinsky, H. (2014). Sparseness and expansion in sensory representations. Neuron, 83(5), 1213--1226.

Barnett, S., Kurniawan, S., Thudumu, S., Brannelly, Z., & Abdelrazek, M. (2024). Seven failure points when engineering a retrieval augmented generation system. Proceedings of the IEEE/ACM 3rd International Conference on AI Engineering (CAIN).

Bartlett, F. C. (1932). Remembering: A study in experimental and social psychology. Cambridge University Press.

Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., & Liu, Z. (2024). BGE M3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.

arXiv:2402.03216. Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Truitt, S., & Larson, J. (2024). From local to global: A graph RAG approach to query-focused summarization. arXiv:2404.16130.

Formal, T., Lassance, C., Piwowarski, B., & Clinchant, S. (2022). From distillation to hard negative sampling: Making sparse neural IR models more effective. Proceedings of the 45th International ACM SIGIR Conference, 2353--2359.

Gu, A., Sala, F., Gunel, B., & Ré, C. (2019). Learning mixed-curvature representations in product spaces. International Conference on Learning Representations.

Gutiérrez, B. J., Shu, Y., Gu, Y., Yasunaga, M., & Su, Y. (2024). HippoRAG: Neurobiologically inspired long-term memory for large language models. Advances in Neural Information Processing Systems.

Jevons, W. S. (1865). The coal question: An inquiry concerning the progress of the nation, and the probable exhaustion of our coal mines. Macmillan.

Johnson, M. K., Hashtroudi, S., & Lindsay, D. S. (1993). Source monitoring. Psychological Bulletin, 114(1), 3--28.

Khattab, O., & Zaharia, M. (2020). ColBERT: Efficient and effective passage search via contextualized late interaction over BERT.

Proceedings of the 43rd International ACM SIGIR Conference, 39--48. Kusupati, A., Bhatt, G., Rege, A., Wallingford, M., Sinha, A., Ramanujan, V., Howard-Snyder, W., Chen, K., Kakade, S., Jain, P., & Farhadi, A. (2022). Matryoshka representation learning. Advances in Neural Information Processing Systems.

Lee, J., Dai, Z., Duddu, S. M. K., Lei, T., Naim, I., Chang, M.-W., & Zhao, V. (2024). Rethinking the role of token retrieval in multi-vector retrieval. International Conference on Machine Learning.

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems.

Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12, 157--173.

Nickel, M., & Kiela, D. (2017). Poincaré embeddings for learning hierarchical representations. Advances in Neural Information Processing Systems.

O'Reilly, R. C., & McClelland, J. L. (1994). Hippocampal conjunctive encoding, storage, and recall: Avoiding a trade-off. Hippocampus, 4(6), 661--682.

Roediger, H. L., & McDermott, K. B. (1995). Creating false memories: Remembering words not presented in lists. Journal of Experimental Psychology: Learning, Memory, and Cognition, 21(4), 803--814.

Santhanam, K., Khattab, O., Saad-Falcon, J., Potts, C., & Zaharia, M. (2022). ColBERTv2: Effective and efficient retrieval via lightweight late interaction. Proceedings of NAACL, 3715--3734.

Sarthi, P., Abdullah, S., Tuli, A., Khanna, S., Goldie, A., & Manning, C. D. (2024). RAPTOR: Recursive abstractive processing for tree-organized retrieval. International Conference on Learning Representations.

Tulving, E., & Thomson, D. M. (1973). Encoding specificity and retrieval processes in episodic memory. Psychological Review, 80(5), 352--373.

Yassa, M. A., & Stark, C. E. L. (2011). Pattern separation in the hippocampus. Trends in Neurosciences, 34(10), 515--525.

Zacks, J. M., Speer, N. K., Swallow, K. M., Braver, T. S., & Reynolds, J. R. (2007). Event perception: A mind-brain perspective.

Psychological Bulletin, 133(2), 273--293.

Raj Sakthi — Founder and Managing Partner, LatentGeneration.ai. Co-Founder (DurgAi) and Head of AI, Freshriver.ai. Three decades in knowledge engineering, applied AI/ML, and regulated-domain deployments, with a particular weakness for the foundations everyone else thinks are someone else's problem.

Next in the series — Article II: The Substrate Silence (From Shadows to Organized Objects). The work the conference circuit has been ignoring because it does not photograph well. Coming June 2026.