Diagnosing RAG Failures

  • Every RAG failure is either a topic gap (the content doesn't exist in your corpus) or a capability gap (the system can't process what's there) -- Jason Liu's framework
  • Take your last 20 failed queries, sort them into those two buckets -- you'll find patterns that point to specific fixes
  • Silent failures (corpus shrinkage from encoding bugs, stale embeddings, ingestion errors) are the deadliest because nothing looks broken
  • Evaluate what you didn't retrieve, not just what you did -- false negatives are harder to spot than bad answers

Part 1 showed a one-line tool call that did hybrid retrieval with condition-aware filtering, and it looked clean -- a tool fires, documents come back, the model generates an answer. This post is about what's inside that call, and more importantly, how to know when it's broken. Because the retrieval layer fails quietly in ways that the generation layer never does: a bad generation is obviously wrong, but a bad retrieval just looks like a plausible answer built on the wrong evidence.

I want to start with diagnosis rather than architecture for a specific reason. Most RAG tutorials open with "here's how to build the pipeline" and add evaluation as an afterthought. But if you can't tell why your system is failing, you can't tell which architectural improvement will actually help. A better embedding model won't fix a corpus that's missing the content users are asking about. A fancier reranker won't help if your chunks were garbled during ingestion. The guiding principle for this entire post is that retrieval quality is the ceiling on everything downstream -- you can have the best prompt engineering in the world and the most capable model on the market, and none of it matters if the right documents never make it into the context window.

Topics vs. Capabilities

Jason Liu's diagnostic framework gives you a clean way to sort failures into two buckets. A topic gap means the content your users need simply doesn't exist in the corpus. No amount of retrieval engineering can find a document that was never indexed. A capability gap means the content exists but the system can't surface it correctly -- maybe the chunking splits a critical table across two fragments, maybe the embedding model doesn't understand domain-specific abbreviations, maybe the query was too vague for semantic search to latch onto.

The distinction matters because the fixes are completely different. Topic gaps are content problems; you solve them by expanding and curating the corpus. Capability gaps are engineering problems; you solve them by improving chunking, retrieval strategies, query understanding, or the model itself. I've watched teams spend weeks fine-tuning embedding models when the real issue was that nobody had ever ingested the product documentation from the last three releases. Knowing which bucket a failure lives in saves you from optimizing the wrong thing.

Case Study: Construction Blueprint QA

Liu describes a construction company whose RAG-based assistant handled questions about building blueprints. When they sorted their failure logs, a clear pattern emerged -- roughly 20% of all queries were some variant of "how many windows are on the north elevation?" or "count the fire exits on floor 3." These object-counting questions were a topic gap disguised as a capability gap -- the team initially assumed their retrieval was failing, but the real problem was that their corpus contained blueprint PDFs with no extracted count data. The blueprints had the visual information, but the text extraction pipeline didn't produce anything a text-based retrieval system could match against.

Once they identified the pattern, the fix was targeted: they added a vision model to the ingestion pipeline that extracted object counts and spatial relationships from blueprint images, then indexed those as structured annotations alongside the original documents. User satisfaction on those query types went from ~50% to 87% in one week -- not because the retrieval got smarter, but because the content finally existed in a retrievable form.

That case also illustrates something I find consistently true about RAG debugging: the failure taxonomy tends to be lumpy, not evenly distributed. You don't usually have 50 different failure modes each accounting for 2% of errors. You have two or three dominant patterns that together explain the majority of bad retrievals. Find those and you get outsized returns from fixing them.

The Silent Failure Problem

The scariest RAG failures are the ones where everything appears to work fine. Liu describes a medical chatbot where the ingestion pipeline was processing documents encoded as UTF-8, but a subset of the corpus -- older clinical reference PDFs -- used Latin-1 encoding. The system didn't throw errors; it just silently garbled special characters, diacritics, and certain punctuation marks during text extraction, which degraded the embedding quality for those chunks enough that they effectively vanished from retrieval results. The team discovered the issue during a routine audit and found that 21% of the corpus had been functionally invisible since launch. For weeks, users asking questions that required those documents got confidently wrong answers built from the remaining 79% -- and nobody noticed because the system never returned empty results, it just returned the wrong ones.

This is the false negative blind spot that makes RAG evaluation fundamentally harder than generation evaluation. When a model hallucinates, someone eventually notices -- the output is wrong in a visible way. But when retrieval misses the right document and substitutes a plausible alternative, the answer can look perfectly reasonable while being subtly incorrect. You have to evaluate what didn't show up in the retrieval set, which requires knowing what should have shown up, and most teams don't have that ground truth readily available. (I'll cover how to build it systematically in Section 6.)

Applying the Framework

When I looked at the early failure patterns in my own diabetes coaching agent, this two-bucket framing made the problems legible almost immediately. The most common failure class was topic gaps. I had users asking about insulin pump troubleshooting -- how to recalibrate after a site change, what to do when the CGM readings diverge from finger sticks by more than 20% -- and the corpus simply didn't have pump-specific articles. It had plenty of general Type 2 diabetes management content, so the retriever dutifully returned those documents, and the model generated answers that were technically about diabetes but useless for someone wearing an Omnipod. The condition filter from Part 1 helped route Type 1 vs. Type 2 content correctly, but it couldn't conjure documents that didn't exist.

The capability gap showed up in drug interaction queries. A user would ask something like "can I take metformin with lisinopril?" and the system needed to cross-reference medication data with condition-specific contraindication information. Both halves of that information existed in the corpus -- we had medication fact sheets and we had clinical guideline documents -- but they lived in separate document collections with different schemas, and the retrieval layer had no mechanism for joining across them. The content was there; the system just couldn't compose it into a coherent answer. That's a capability gap, and the fix wasn't more content but better retrieval architecture (which I'll dig into in Section 3).

If you take one practical step from this section, make it Liu's exercise: pull your last 20 failed queries (or your last 20 thumbs-down ratings, or your last 20 cases where the model hedged with "I don't have enough information"), and sort them into topic gaps vs. capability gaps. You'll probably find that one bucket is much larger than the other, and that within each bucket, two or three specific patterns account for most of the failures. That's your roadmap -- not a general-purpose "improve retrieval quality" initiative, but a ranked list of concrete problems with concrete fixes. And once you've shipped those fixes, do it again in two weeks, because the distribution shifts as your corpus and user base evolve.

Query Understanding

  • Structured extraction -- parse user questions into typed fields (date ranges, filters, keywords) before hitting any search backend
  • Query transformation techniques (HyDE, decomposition, RAG-Fusion) can boost retrieval accuracy 8-10% and comprehensiveness 30-40%
  • Better queries beat better embeddings almost every time -- a well-structured query with a mediocre embedding model will outperform a raw query with a state-of-the-art one
  • Adds 500-700ms latency per query, but enables temporal reasoning, filter routing, and multi-backend dispatch that you can't get any other way

The single highest-ROI improvement I've made to any RAG system wasn't a better embedding model, a fancier reranker, or a smarter chunking strategy. It was spending an LLM call to actually understand the query before sending it to retrieval. Jason Liu makes this argument forcefully, and I think he's right -- structured query extraction matters more than moving from one embedding model to another, because it addresses the fundamental mismatch between how users ask questions and how documents are stored.

What Liu calls "dumb RAG" -- the pattern where you embed the user's raw question and do a nearest-neighbor lookup -- is that it breaks in at least four ways. First, there's a query-document mismatch: the user asks "what should I eat before a morning run?" but the corpus stores content titled "pre-exercise nutrition guidelines for Type 2 diabetes," and the semantic gap between those two phrasings is wider than most embedding models can bridge reliably. Second, you're stuck with a monolithic search backend that treats every query the same way, when some questions need keyword matching ("metformin side effects") and others need conceptual search ("how to manage blood sugar spikes after meals"). Third, text search alone can't handle queries with implicit temporal or structural constraints -- "recent guidelines" means the user cares about publication date, but raw embedding search has no concept of recency. And fourth, there's the missing user context problem: "what should I eat?" means something completely different for a patient with Type 1 diabetes on an insulin pump versus someone managing prediabetes through diet alone, and that context needs to inform the query before retrieval happens, not after.

class SearchQuery(BaseModel):
    """Structured query extracted from natural language."""
    rewritten_query: str
    date_range: Optional[tuple[date, date]] = None
    source_filters: list[str] = []
    keywords: list[str] = []
    document_type: Optional[str] = None

That SearchQuery model is the interface between the LLM's understanding of the user's intent and the retrieval layer's filtering capabilities. An LLM call parses the raw user question into those typed fields, and now you can route a query like "any new ADA guidelines on insulin dosing from the last year?" to a date-filtered search over clinical guideline documents specifically, instead of embedding the entire sentence and hoping cosine similarity does the right thing. The extraction cost is one additional LLM call per user query -- typically 200-400ms with a fast model -- and it pays for itself immediately.

In Part 1, I showed the search_knowledge_base tool with its condition_type, tags, and document_type parameters. What I didn't show was the query enrichment logic that runs before that tool fires. The LLM doesn't just pass through the user's raw question; it builds a structured query that takes advantage of what we know about the patient's profile and the shape of our corpus.

async def enrich_query(raw_query: str, patient: PatientProfile) -> SearchQuery:
    """Transform raw user question into a structured retrieval query."""

    # Auto-detect condition type from patient profile
    condition_type = patient.diagnosis  # e.g., "type_2", "type_1", "prediabetes"

    # Route document_type based on query intent
    doc_type = None
    if any(kw in raw_query.lower() for kw in ["dose", "medication", "drug", "side effect"]):
        doc_type = "medication_reference"
    elif any(kw in raw_query.lower() for kw in ["guideline", "recommendation", "standard"]):
        doc_type = "clinical_guideline"
    elif any(kw in raw_query.lower() for kw in ["recipe", "meal", "eat", "food", "diet"]):
        doc_type = "nutrition"

    # Tag routing: map query topics to corpus tag taxonomy
    tag_map = {
        "insulin": ["insulin", "injection", "basal", "bolus"],
        "exercise": ["physical_activity", "exercise", "fitness"],
        "monitoring": ["cgm", "blood_glucose", "a1c", "monitoring"],
    }
    matched_tags = []
    for topic, tags in tag_map.items():
        if topic in raw_query.lower():
            matched_tags.extend(tags)

    # Build the structured query
    return SearchQuery(
        rewritten_query=await rewrite_with_llm(raw_query, patient),
        source_filters=[condition_type] if condition_type else [],
        keywords=extract_medical_terms(raw_query),
        document_type=doc_type,
    )

The keyword-based routing here is deliberately simple -- pattern matching on known terms rather than a classifier. I tried a fine-tuned intent classifier early on and found that it added 150ms of latency for a marginal accuracy improvement over keyword matching, because the diabetes domain has a compact enough vocabulary that simple string matching catches 90%+ of cases correctly. The rewrite_with_llm call handles the harder part, rephrasing the query to match corpus terminology and injecting patient-specific context that the user left implicit. When a patient with Type 2 diabetes asks "what should I eat before exercise?", the rewriter produces something like "pre-exercise nutrition recommendations for Type 2 diabetes, oral medication management" -- because the retriever needs that specificity to pull the right documents, even though the user would never phrase it that way.

Query Transformation Techniques

Structured extraction handles the "what do you mean?" problem, but there's a second class of techniques that handle the "how do I find it?" problem -- transforming the query itself to improve retrieval recall. I've tested four of these in production, and each solves a different failure mode.

HyDE (Hypothetical Document Embeddings) is the one that surprised me most. The idea is counterintuitive -- you ask the LLM to generate a hypothetical answer to the user's question, then embed that answer instead of the question. Even if the generated answer is factually wrong (and it often is), it captures the vocabulary, phrasing patterns, and topical neighborhood of a real answer far better than the original question does. A question lives in "question space" and a document lives in "answer space," and those two regions of the embedding manifold don't always overlap well. HyDE bridges that gap by converting the question into something that looks like a document. I found it helps most when the query is short or abstract ("managing fatigue") and the corpus contains detailed clinical prose -- the hypothetical answer expands the sparse query into something rich enough for meaningful similarity matching. It doesn't help much when the query is already specific and technical, because the original embedding is already close to the target region.

Query decomposition handles multi-hop questions that require information from multiple documents. When a user asks "how does metformin interact with the ketogenic diet for someone with kidney concerns?", that's actually three sub-questions packed into one: metformin's mechanism and dietary considerations, the metabolic effects of a ketogenic diet on blood glucose, and renal function monitoring with metformin. No single document in the corpus is likely to answer all three. By decomposing the query into focused sub-questions and retrieving for each, you pull from three different regions of the corpus and assemble a more complete answer. The trick is knowing when to decompose -- I use a simple heuristic where queries containing two or more distinct medical concepts (detected via the keyword extraction from the structured query step) get automatically decomposed.

Step-back prompting works in the opposite direction. Instead of making the query more specific, you make it more conceptual. A user asking "why did my blood sugar spike to 280 after eating white rice?" is really asking about glycemic index, postprandial glucose response, and carbohydrate metabolism. The step-back reframes the question to its conceptual core, which retrieves broader, more foundational documents that provide the explanatory framework rather than just a narrow factual answer. I use this selectively -- it's most useful for "why" questions where the user needs understanding, not just facts.

RAG-Fusion is the brute-force option, and honestly the most consistently effective. You generate 3-5 variations of the original query via an LLM call, retrieve separately for each variation, then fuse the results using Reciprocal Rank Fusion (RRF). The variations capture different phrasings and angles that a single query would miss. The research numbers are solid -- Rackauckas (2024) reports +8-10% accuracy improvement and +30-40% on comprehensiveness metrics. But there's a practical ceiling: I cap at 5 sub-queries because beyond that, the retrieved sets start drifting off-topic, and the fusion step surfaces documents that are only tangentially related. The latency cost scales linearly with the number of variations (since you're doing N separate retrieval calls), so 3 variations is my default and 5 is reserved for complex queries that trigger decomposition.

Shifting Work from Runtime to Ingestion

One of Liu's observations that I think deserves more attention is that you can shift much of the query understanding cost to ingestion time. Instead of embedding raw text chunks, you run an LLM over each chunk at ingestion to generate the questions that chunk would answer, then embed the question-answer pairs. When a user's question comes in at runtime, it's now being compared against other questions in embedding space rather than against document prose -- a much tighter semantic match. This is essentially HyDE in reverse. Instead of transforming the query to look like a document at query time, you transform the documents to look like queries at ingestion time. The computation is the same, but you pay it once during indexing rather than on every request. For a corpus that changes weekly but gets queried thousands of times a day, the math is obvious.

All of these techniques share a tradeoff. Query understanding adds 500-700ms of latency to the retrieval path -- the structured extraction call, the optional query transformation, and potentially multiple retrieval passes for fusion or decomposition. In my diabetes agent, the median query path with enrichment runs about 600ms before the first retrieval call even fires. But the alternative is sending raw queries to a dumb similarity search and hoping for the best, which is faster but wrong more often in ways that are expensive to debug later. And the structured query unlocks capabilities you simply can't get without it: temporal reasoning (filtering by publication date), filter routing (sending medication queries to the drug reference collection and nutrition queries to the recipe corpus), and multi-backend dispatch (keyword search for drug names, semantic search for conceptual questions). For every production RAG system I've worked on, the 600ms has been worth it.

Building the Retrieval Pipeline

  • Chunking -- start with 400-512 tokens, 10-20% overlap; Anthropic's contextual chunking (prepend LLM-generated context per chunk) cuts retrieval failures by 49%, 67% with reranking on top; embedding models silently truncate oversized inputs, so chunking matters even with 200K context windows
  • Hybrid retrieval -- combine dense embeddings + sparse BM25 via Reciprocal Rank Fusion; 18% improvement over dense-only, and BM25 is 10x faster on keyword-heavy queries
  • Reranking -- cross-encoder rerankers add +33-40% accuracy for ~120ms latency; over-fetch 20-50 candidates, rerank down to 5-10
  • Keep everything in one database (full-text search + vector embeddings + SQL filters) rather than maintaining separate indices per collection
  • Put your most relevant chunks first and last in the context window -- the lost-in-the-middle effect shapes ordering strategy

With query understanding handling the "what does the user actually want?" question, we can turn to the retrieval pipeline itself -- the machinery that takes a structured query and returns ranked, relevant chunks. I'll walk through the four stages in order -- chunking the corpus, retrieving candidates via hybrid search, reranking the results with cross-encoders, and ordering the final set for the context window. Each stage has a surprisingly large impact on end-to-end answer quality, and most of the gains come from getting the fundamentals right rather than chasing exotic techniques.

Chunking

Start boring. I mean this seriously -- the temptation to build a sophisticated semantic chunker on day one is strong, but a RecursiveCharacterTextSplitter at 400-512 tokens with 10-20% overlap will get you surprisingly far. That's roughly a paragraph and a half of prose, which tends to be a natural unit of coherent thought in most technical documents. The overlap ensures you don't lose context at chunk boundaries, where a key sentence might straddle two pieces. I spent a week building a custom markdown-aware splitter for the diabetes agent before discovering that the recursive character splitter with slightly tuned parameters performed within 2% on retrieval recall. Sometimes the boring solution is the right one.

But there's a silent failure mode that catches people off guard, and it's worth flagging before we go further. Embedding models truncate inputs that exceed their maximum sequence length, and they do this without raising an error or warning. If your model supports 512 tokens and you hand it a 900-token chunk, it embeds the first 512 tokens and ignores the rest. The embedded representation doesn't reflect the full content of the chunk, which means similarity search is working with a degraded signal for every oversized input. This is why chunking still matters even in a world of 200K context windows -- the context window limit affects the generation model, but the embedding model's sequence length limit is what constrains your retrieval quality, and those limits are typically much smaller (512-8192 tokens depending on the model). Always check your embedding model's max sequence length and ensure your chunks fit within it.

Once the basics are solid and you've identified chunking as a genuine bottleneck through the diagnostic framework from Section 1, there are three techniques worth investing in.

Anthropic's contextual chunking was the biggest single improvement I found when I tested it on the diabetes corpus. The idea is elegant. For each chunk, you give the full source document to an LLM along with that specific chunk, and ask it to generate 50-100 tokens of context that situates the chunk within the larger document. That context gets prepended to the chunk before embedding. So a chunk that originally read "Reduce dosage by 25% if eGFR falls below 30 mL/min" becomes "This excerpt is from the ADA 2024 clinical guidelines on metformin use in patients with chronic kidney disease. It discusses dosage adjustments based on renal function. Reduce dosage by 25% if eGFR falls below 30 mL/min." The embedding now captures what the chunk is about, not just what it literally says. Anthropic's benchmarks show contextual embeddings alone reduce retrieval failure rates by 35%. Combine them with contextual BM25 (where the same prepended context improves keyword matching) and failures drop by 49%. Stack reranking on top and you're at 67% reduction. The cost is $1.02 per million document tokens with prompt caching, which for most corpora is a rounding error compared to the retrieval quality gains.

Late chunking, from Jina AI, takes a different approach. Instead of adding context text to each chunk, you process the entire document through the transformer model first -- letting self-attention propagate information across the full document -- and then split the final-layer embeddings into chunks. Each chunk's embedding inherits document-wide context through the attention mechanism rather than through prepended text. The reported improvement is a 3.5% relative gain on nDCG, with gains scaling as document length increases (because longer documents benefit more from full-document attention). It's a cleaner solution architecturally than contextual chunking, but it requires embedding models that support long sequences and a different ingestion pipeline.

Proposition-based chunking decomposes documents into self-contained atomic propositions -- each chunk is a single factual claim with enough context to stand on its own. The EntityQuestions benchmark shows 17-25% improvement on Recall@5, which sounds impressive. But I want to add a strong caveat here, because the benchmarks don't tell the whole story. When researchers tested proposition-based chunking on clinical decision support tasks -- closer to what I'm building with the diabetes agent -- it was among the worst performers. The problem is that fragmenting clinical guidelines into atomic propositions strips away the surrounding qualifications, dosage ranges, and conditional logic that make the information actually useful. A proposition like "Metformin should be discontinued" is technically self-contained, but without "when eGFR drops below 30 mL/min in patients with Stage 4 CKD," it's dangerous. Smaller fragments diluted accuracy rather than improving it. This technique is highly domain-dependent, working well for factoid QA over encyclopedic content and poorly for anything where context and qualification carry clinical or legal weight.

For the diabetes agent, I settled on contextual chunking with a 400-token base size. The contextual prefix adds 50-80 tokens per chunk (well within the embedding model's 512-token budget), and the retrieval recall improvement over bare chunks was immediately visible in evaluation -- we went from roughly 72% Recall@10 to 84% on our test set, primarily because medication chunks that previously lacked disease-context now matched properly against condition-specific queries. The ADA guideline chunks were the biggest winners, since they're dense with abbreviations and implicit context that the contextual prefix made explicit.

Hybrid Retrieval

Running dense vector search alone leaves gaps that are embarrassing once you notice them. Try searching for "HbA1c" using only embeddings trained on general text, and you'll find the cosine similarity scores are mediocre because the embedding model hasn't seen enough endocrinology literature to place that abbreviation close to "glycated hemoglobin" or "blood sugar control" in its vector space. Keyword search handles this effortlessly -- if the document contains "HbA1c" and the query contains "HbA1c," BM25 gives you a strong match regardless of semantic understanding. The reverse failure is equally common -- a user asking about "blood sugar management strategies" won't keyword-match a document titled "glycemic control interventions," but dense search bridges that semantic gap easily.

The standard production pattern is hybrid retrieval -- run both dense (embedding) search and sparse (BM25) search in parallel, then fuse the results. The combination consistently outperforms either method alone; I've measured an 18% improvement in Recall@20 over dense-only on the diabetes corpus, and that number aligns with what others report across different domains. BM25 has a practical advantage beyond accuracy, too -- it's roughly 10x faster than embedding search for keyword-heavy queries, because it's just doing inverted index lookups rather than nearest-neighbor search across high-dimensional vectors. On queries where the user types exact medical terms, BM25 returns results in under 5ms while the vector search is still computing distances.

Reciprocal Rank Fusion (RRF)

Given retrieval results from multiple rankers r, the fused score for document d is:

$$\text{Score}(d) = \sum_{r} \frac{1}{k + \text{rank}(r, d)}$$

where k = 60 (the standard constant from the original Cormack et al. paper).

Why this works: RRF ignores the raw scores from each retriever entirely -- it only cares about rank positions. This is the key property, because you don't need to normalize or calibrate scoring systems that may operate on completely different scales (BM25 scores in the range 0-25, cosine similarity in 0-1). A document ranked #1 by one system and #3 by another gets a higher fused score than a document ranked #2 by both. The formula is parameter-free beyond the constant k, which controls how much weight top-ranked results get relative to lower-ranked ones. In practice, k = 60 works well across domains without tuning.

One architectural decision that saved me a surprising amount of operational pain: keep everything in a single database, and after running separate Pinecone and Elasticsearch instances for three months before consolidating, I agree completely. Liu argues for this strongly, and after running separate Pinecone and Elasticsearch instances for three months before consolidating, I agree completely. A database like PostgreSQL with pgvector gives you full-text search (BM25 via tsvector), vector similarity search, and SQL filtering in one system. You can run a single query that finds chunks semantically similar to X, filtered by condition_type and publication date, also ranked by BM25 relevance -- all without orchestrating across multiple services. Fewer moving parts means fewer failure modes, simpler deployment, and the ability to do hybrid retrieval in a single round-trip rather than fanning out to multiple backends and merging the results. The separate-index-per-collection pattern that many tutorials recommend creates operational overhead that isn't justified unless you're operating at a scale where a single database genuinely can't handle the load.

Reranking and Authority

Retrieval gets you candidates; reranking sorts them by actual relevance. The pattern is straightforward -- over-fetch 20-50 candidates from your hybrid retrieval stage, then pass each (query, chunk) pair through a cross-encoder model that scores how well the chunk answers the query. Cross-encoders are more accurate than bi-encoders (the embedding models used in retrieval) because they process the query and document jointly through the transformer, allowing full attention between query tokens and document tokens. The tradeoff is latency, since cross-encoders can't pre-compute document embeddings, so they scale linearly with the number of candidates. Across different benchmarks, reranking adds +33-40% to retrieval accuracy for roughly 120ms of additional latency. That's the best accuracy-per-millisecond trade in the entire pipeline, and the reason I rerank in every production system I build. After reranking, take the top 5-10 and discard the rest.

RerankerELONotes
Zerank-1HighestCurrent leaderboard leader; proprietary API
Voyage Rerank 2.5HighBest quality-to-latency ratio; good all-rounder
Cohere Rerank 4 Pro162732K context window; strong on long documents
BGE Reranker v2 M31467.6Open-source; self-hostable; multilingual

Reranking fixes the ordering problem, but it doesn't fix a deeper issue with embedding-based retrieval: semantic similarity and authority are not the same thing. Liu gives an example that stuck with me -- if you search for "how to fix a Python import error" using embedding search, a casual Reddit comment phrased in natural conversational language will often score higher than the official Python documentation, because the Reddit comment is linguistically closer to how the query is phrased. The official docs use precise technical vocabulary; the Reddit comment uses the same colloquial framing as the user's question. Cosine similarity rewards linguistic resemblance, not expertise or reliability. In a medical context, this failure mode is genuinely dangerous -- you don't want a patient's question about metformin dosing answered with a chunk from a diabetes forum when the ADA clinical guideline exists in your corpus and says something more precise.

The production solution is learning-to-rank (LTR), where you train a model (typically XGBoost or a small neural net) to combine multiple signals into a final relevance score. Those signals include BM25 score, vector cosine similarity, but also domain authority (is this from a peer-reviewed guideline or a Reddit post?), citation count (how many other documents reference this one?), freshness (when was it last updated?), and user engagement (click-through rates, upvote counts from previous retrievals). Perplexity.ai uses exactly this approach -- their retrieval isn't just semantic search, it's a trained ranker that weighs authority and freshness alongside similarity. For the diabetes agent, I implemented a simpler version with tiered authority scores where ADA guidelines get a 1.5x boost, peer-reviewed journal articles get 1.2x, curated educational content gets 1.0x, and community-sourced content gets 0.7x. The boosts are applied as multiplicative weights on the RRF scores before final ranking. It's crude compared to a full LTR model, but it solved the most pressing failure mode -- ensuring that an ADA guideline about insulin titration always outranks a patient forum post covering the same topic in casual language.

In regulated domains, you can go further by maintaining specialized indices. A "high authority" index containing only vetted clinical guidelines, a "recency" index with documents from the past 12 months, and a primary full-text index for everything else. At query time, you retrieve from the appropriate index based on the structured query's metadata (a question about "current ADA recommendations" hits the high-authority index first, a question about "recent research on GLP-1 agonists" hits the recency index). This adds routing complexity but guarantees that critical queries are answered from authoritative sources.

Context Window Ordering

Once you've chunked, retrieved, and reranked, there's one more decision that affects answer quality more than you'd expect -- the order in which you place chunks in the context window. The lost-in-the-middle effect -- which I covered in Part 1 -- shows up to 73% performance degradation when critical information sits in the middle of a long context. The practical implication for chunk ordering is straightforward. Put your most relevant chunks first and last, and put supporting or supplementary chunks in the middle. This "bookend" pattern ensures the model's attention mechanism gives maximal weight to your highest-confidence retrievals.

Anthropic's contextual retrieval experiments found another useful data point here: including the top 20 chunks was more effective than the top 5 or top 10, even though chunks 6-20 are individually less relevant. The additional chunks provide surrounding context that helps the model reason about the answer, and the contextual chunking prefix ensures each chunk carries enough metadata to be useful even when it's not a direct hit. The lesson is that more context is better than less, up to a point -- but only if you order it so that the most relevant pieces don't get lost in the middle of the pack.

async def retrieve(query: SearchQuery, top_k: int = 10) -> list[Chunk]:
    """Full retrieval pipeline: hybrid search, rerank, order for context."""

    # Stage 1: Hybrid retrieval -- over-fetch candidates
    fetch_k = top_k * 5  # e.g., fetch 50 for final 10

    dense_results = await vector_search(
        query.rewritten_query,
        filters={"condition_type": query.source_filters, "doc_type": query.document_type},
        top_k=fetch_k,
    )
    sparse_results = await bm25_search(
        query.keywords + [query.rewritten_query],
        filters={"condition_type": query.source_filters, "doc_type": query.document_type},
        top_k=fetch_k,
    )

    # Stage 2: Reciprocal Rank Fusion
    fused = reciprocal_rank_fusion(
        [dense_results, sparse_results],
        k=60,
    )

    # Stage 3: Apply authority boost before reranking
    for chunk in fused:
        chunk.score *= AUTHORITY_WEIGHTS.get(chunk.source_tier, 1.0)
        # e.g. {"clinical_guideline": 1.5, "journal": 1.2, "educational": 1.0, "community": 0.7}

    # Stage 4: Cross-encoder reranking on top candidates
    candidates = sorted(fused, key=lambda c: c.score, reverse=True)[:fetch_k]
    reranked = await cross_encoder_rerank(
        query=query.rewritten_query,
        chunks=candidates,
        top_k=top_k,
    )

    # Stage 5: Order for context window (bookend pattern)
    # Most relevant first and last; supporting in the middle
    ordered = bookend_order(reranked)
    return ordered


def reciprocal_rank_fusion(
    ranked_lists: list[list[Chunk]], k: int = 60
) -> list[Chunk]:
    """Merge multiple ranked lists using RRF. Ignores raw scores, uses rank only."""
    scores: dict[str, float] = {}
    chunk_map: dict[str, Chunk] = {}

    for ranked_list in ranked_lists:
        for rank, chunk in enumerate(ranked_list, start=1):
            scores[chunk.id] = scores.get(chunk.id, 0) + 1.0 / (k + rank)
            chunk_map[chunk.id] = chunk

    for chunk_id, score in scores.items():
        chunk_map[chunk_id].score = score

    return sorted(chunk_map.values(), key=lambda c: c.score, reverse=True)


def bookend_order(chunks: list[Chunk]) -> list[Chunk]:
    """Place most relevant chunks at start and end of context (lost-in-the-middle)."""
    if len(chunks) <= 2:
        return chunks
    # Chunks arrive sorted by relevance (best first).
    # Put #1 first, #2 last, fill the middle with the rest in descending order.
    first = [chunks[0]]
    last = [chunks[1]]
    middle = chunks[2:]
    return first + middle + last

When Graphs Beat Vectors

  • Most teams don't need GraphRAG -- vector RAG with hybrid search handles simple factual lookups well enough, and graph indexing costs roughly 10x more than vector indexing
  • Scale compounds the pain: entity resolution alone hits ~15% error rates, and one team (Particula, 12M nodes) spent 3 months reaching 99.3% accuracy on entity deduplication
  • Where graphs genuinely win: multi-hop reasoning (87% accuracy vs 23% for traditional RAG), corpus-wide thematic queries, and entity-relationship traversal across documents
  • Budget-friendly alternatives exist -- LazyGraphRAG cuts query cost by 700x, KET-RAG reduces indexing cost by 82%, and PathRAG beats full GraphRAG in 60% of head-to-head comparisons

I'm going to take a position that might be unpopular with the knowledge graph crowd. Most teams building RAG systems today don't need GraphRAG. If your retrieval pipeline from Section 3 is working -- hybrid search, reranking, contextual chunking -- you're handling the vast majority of production queries well enough. Don't reach for a graph until you've hit a specific wall that vectors can't solve, because the engineering cost of getting graphs right is significantly higher than most blog posts and conference talks would have you believe.

That said, there are situations where graphs genuinely earn their keep, and I want to be precise about which ones. The rest of this section is an honest accounting of the costs, the failure modes, and the narrow-but-real cases where graph-based retrieval outperforms everything else.

The Scale Problem

Let's start with money, because that's what kills most GraphRAG projects before they ship. Vectorizing the full text of The Wizard of Oz costs about $0.006 using a standard embedding API. Building a knowledge graph from the same text -- entity extraction, relationship classification, community detection -- costs about $0.06 with GPT-4o-mini. That's a 10x markup for a children's book. Scale this to a real corpus and the numbers get uncomfortable fast. A 1,000-page PDF runs roughly $120 at GPT-4-Turbo rates for full graph construction, and that's before you've spent a dollar on query-time inference. For most teams, the indexing budget alone is a hard stop.

But the dollar cost isn't even the biggest problem. Entity resolution -- the process of deciding that "Dr. Smith," "J. Smith, MD," "Smith (2024)," and "the lead researcher" all refer to the same person -- is where the real engineering pain lives. LLM-based entity extraction hits roughly a 15% error rate on entity boundaries and coreference, which sounds manageable until you realize those errors compound across every downstream operation. Every duplicated entity spawns incorrect relationships, pollutes community detection, and degrades the traversal results that the whole system depends on. Particula, working with a 12-million-node graph, found that entity resolution consumed the majority of their engineering effort. The same customer appeared under four or more different names across their document corpus. It took their team three months of iterative development -- including 2,400 manually verified training matches -- to reach 99.3% accuracy on entity deduplication. And that's a team with Neo4j expertise building on a mature graph platform, not a general ML team bolting GraphRAG onto an existing pipeline.

Query execution introduces its own class of problems. If you're using Cypher (Neo4j's query language), the natural approach is to have your LLM generate queries from natural language. In practice, LLM-generated Cypher is only about 77% correct -- roughly one in four queries either returns wrong results or fails outright. One production team I'm aware of abandoned LLM-generated Cypher entirely and fell back to 30 manually maintained query templates with 96%+ accuracy, adding 2-3 new templates per month as usage patterns emerged. That's a workable approach, but it's fundamentally a hand-crafted system with a graph backend, not the flexible natural-language-to-graph-query dream that the demos show.

And then there's graph explosion. Queries that seem innocuous -- "find all connections between Company X and the pharmaceutical industry" -- can touch hundreds of thousands of nodes in a densely connected graph. Production systems need traversal budgets (maximum depth, maximum nodes explored) and result caps (Particula used a 500-node limit) to prevent queries from consuming unbounded compute. These constraints are necessary but they mean your graph queries are seeing a sample of the full graph neighborhood, not the complete picture, which somewhat undermines the point of having a graph in the first place.

When Graphs Genuinely Win

So after all those caveats, where does graph-based retrieval actually outperform vector search by a margin that justifies the cost? I see four cases, and they're more specific than the GraphRAG marketing suggests.

Multi-hop reasoning is the clearest win. When answering a question requires connecting information across three or more documents -- "Which drugs interact with the medications prescribed to patients who also have condition X?" -- vector search falls apart because no single chunk contains the full answer. You'd need to retrieve the patient's medication list, cross-reference each medication's interaction profile, and filter by the co-occurring condition. Benchmark results show graph-based approaches achieving 87% accuracy on multi-hop questions where traditional RAG manages only 23%. That's not a marginal improvement; it's a qualitative difference in capability.

Corpus-wide thematic queries are the second case. Questions like "What are the main themes across all customer complaints from Q4?" or "Summarize the key research directions in our patent portfolio" don't have a good answer in vector space, because they require aggregating patterns across the entire corpus rather than finding specific matching chunks. GraphRAG's community detection algorithm (based on Leiden clustering) groups related entities into hierarchical communities and pre-computes summaries at each level, which gives you the ability to answer these synthetic, thematic questions. No amount of clever retrieval over flat chunks will produce a coherent thematic summary -- you need structure.

Entity-relationship questions that require traversal are the third case. "Which suppliers share warehouses with our top 3 customers?" needs you to walk a specific path through the graph: customer nodes to warehouse nodes to supplier nodes. Vector similarity between the query and any individual document chunk won't get you there, because the answer lives in the connections between entities, not in any single chunk's content.

The fourth case is temporal reasoning across documents, where you need to track how something evolved over time. "How has the treatment protocol for Type 2 diabetes changed over the past five years?" requires connecting time-stamped entities across multiple guideline versions, identifying what changed between each pair, and synthesizing a coherent narrative. A graph with temporal edges makes this a structured traversal problem; without it, you're hoping the LLM can piece together the evolution from a handful of retrieved chunks that may or may not span the right time period.

If you're trying to decide whether your use case justifies the investment, this is the framework I'd use:

Your situationUse thisWhy
Simple factual lookups, keyword-rich dataVector RAG + BM25Cheaper, faster, sufficient
Need entity relationships across documentsGraphRAG or LightRAGTraversal required
Multi-hop reasoning, connecting 3+ docsHippoRAG or PathRAGCheaper graph alternatives
Corpus-wide thematic summarizationGraphRAG Global Search or LazyGraphRAGCommunity detection needed
Budget-constrained, <1,000 docsDon't use graphROI doesn't justify complexity
>10K docs, frequent updatesLightRAG or KET-RAGFull GraphRAG too expensive to re-index

Cheaper Alternatives

Full GraphRAG isn't your only option even when you do need graph structure, and the alternatives are evolving fast enough that the cost calculus changes every few months. LazyGraphRAG is the most interesting recent development -- it indexes at vector-RAG parity (roughly 0.1% of full GraphRAG's indexing cost) by deferring the expensive entity extraction and community summarization to query time, using a blend of NLP and LLM-based extraction only when a query actually needs it. Query costs come in at 4% of GraphRAG's Global Search, which works out to roughly 700x cheaper per query. The tradeoff is higher query latency on first access, but for corpora that are updated frequently, avoiding the full re-index is worth it.

KET-RAG takes a different approach to the indexing cost problem by being selective about which entities matter. Instead of extracting every entity and relationship from the corpus, it uses PageRank (with a selectivity parameter β=0.8) to identify the most structurally important nodes and focuses extraction effort there. The result is 18.3% of competitor indexing cost with superior retrieval coverage -- 81.6% versus 74.6-79.6% for full-extraction approaches. The insight is that most entities in a knowledge graph are low-degree leaf nodes that rarely participate in useful query paths, so you can skip them without losing much.

PathRAG achieved a 60% win rate against full GraphRAG and 57% against LightRAG in head-to-head evaluations, focusing on optimizing the graph traversal itself rather than the construction. And HippoRAG, which models its retrieval after the hippocampal memory indexing theory, reports 20% improvement on multi-hop tasks while being 10-30x cheaper than iterative retrieval approaches that achieve similar accuracy by running multiple sequential search passes.

For the diabetes agent, I don't use graph retrieval. The queries are predominantly single-hop factual lookups -- "what's the recommended HbA1c target for elderly patients?" or "metformin dosage adjustment for renal impairment" -- and hybrid vector search with reranking handles these well. But I can see exactly where I'd introduce it: if I were building a broader medical knowledge base that needed to connect treatment protocols across conditions, link drug interactions to clinical trial outcomes, and traverse patient-condition-medication-contraindication paths, that's a graph problem. The structure of the data itself demands traversal, not similarity search. Until I hit that wall, vectors are cheaper, simpler, and good enough.

I'll close this section with a warning that I think is worth repeating. Jason Liu lists "jump to knowledge graphs before trying simpler solutions" as anti-pattern #12 in his guide on how to build a terrible RAG system. I've seen this play out in practice -- teams that adopted GraphRAG because it sounded sophisticated, before they'd even measured whether their vector retrieval was working or diagnosed why it was failing. The graph didn't fix their actual problem (usually bad chunking or missing query understanding), and they ended up maintaining two complex systems instead of one. Diagnose first, reach for graphs only when the diagnosis points specifically at a multi-hop or relationship-traversal failure mode that simpler retrieval genuinely can't solve.

Self-Correcting Retrieval

  • Traditional RAG is fire-and-forget -- retrieve once, generate, hope for the best; next-gen systems evaluate retrieval quality during generation and correct course when it's poor
  • CRAG routes through three corrective paths (correct / incorrect / ambiguous) that refine, replace, or supplement retrievals; +19 points on PopQA when integrated with Self-RAG
  • Self-RAG trains four reflection tokens (Retrieve, ISREL, ISSUP, ISUSE) that let the model decide when to retrieve, evaluate relevance, and check whether generation is actually supported by evidence
  • RAG paradoxically reduces a model's ability to abstain -- more context means more confidence, which means more hallucination when that context is insufficient; sufficiency checking before generation improves correct answers by 2-10%

Everything we've built so far -- query understanding, hybrid retrieval, reranking, even graph traversal -- shares a common assumption. You retrieve once, stuff the results into the context window, and let the model generate. If the retrieval was bad, the generation will be bad, and you won't know until a user complains or an eval catches it downstream. I've been calling this "fire-and-forget retrieval," and it's how most production RAG systems work today. The systems in this section break that assumption. They treat retrieval as something that can be evaluated, corrected, and repeated mid-generation -- turning what was a single pass through a search index into something closer to a reasoning loop.

The shift matters because retrieval failures aren't always obvious at query time. Sometimes the retrieved documents look relevant but contain outdated information. Sometimes they cover the topic but miss the specific angle the query needs. Sometimes the top-ranked chunk is genuinely relevant but the second and third chunks are noise that dilutes the signal. A system that can evaluate "did I get good enough context to answer this?" and take corrective action when the answer is no -- that's a qualitatively different kind of RAG pipeline than one that blindly trusts whatever the retriever returns.

CRAG: Corrective Retrieval-Augmented Generation

Corrective RAG (Yan et al., 2024) introduces a lightweight evaluator that sits between the retriever and the generator and makes a three-way routing decision. Given a query and the retrieved documents, a T5-large classifier scores overall retrieval quality and routes to one of three corrective paths. If the evaluator judges the retrieval as Correct (confidence above an upper threshold), it refines the documents -- decomposing each retrieved chunk into smaller knowledge strips, scoring each strip for relevance, and recomposing only the relevant pieces. This sounds simple, but in practice it strips a surprising amount of noise. A chunk that was retrieved because one sentence matched the query might contain five paragraphs of irrelevant context; the decompose-recompose step surgically removes that padding.

If the evaluator judges the retrieval as Incorrect (confidence below a lower threshold), CRAG does something more aggressive -- it discards the retrieved documents entirely and falls back to web search, using the original query to fetch fresh results from the open web. This is the path that handles the failure mode we discussed in Section 1 -- topic gaps where the corpus simply doesn't contain what the user needs. Rather than generating a confident-sounding answer from irrelevant context, CRAG recognizes the gap and goes looking elsewhere.

Between those two extremes sits the Ambiguous case -- confidence between the two thresholds -- where CRAG combines both strategies. It refines the retrieved documents (keeping the good parts) and supplements them with web search results, giving the generator a richer context set that hedges against both noise in the retrieval and gaps in the corpus. In my experience, this ambiguous middle ground is where most real queries actually land; users rarely ask questions that perfectly match a single chunk, and they rarely ask questions with zero relevant content in the corpus.

Performance-wise, CRAG improves short-form question answering by +3.6 points on PopQA and +4.3 on PubHealth over vanilla RAG baselines. But the real gains show up when you stack it with other techniques: CRAG integrated with Self-RAG (which I'll cover next) achieves +19 points on PopQA, +14.9 on Biography generation, and +36.6 on PubHealth. The paper positions CRAG as plug-and-play -- you can couple it with any existing RAG pipeline as a post-retrieval correction layer, which makes it one of the more immediately adoptable ideas in this section.

Self-RAG: Teaching Models When to Retrieve

Self-RAG (Asai et al., 2023) takes a fundamentally different approach. Instead of adding an external evaluator on top of the pipeline, it trains the language model itself to emit special reflection tokens that control retrieval behavior and evaluate generation quality. There are four token types, and each one answers a specific question during generation:

Retrieve answers "should I search right now?" At each generation step, the model can decide whether the current segment needs external evidence or whether it can continue generating from its own parameters. This is the key capability that separates Self-RAG from standard RAG -- retrieval isn't a one-time event at the beginning of generation but a decision the model makes adaptively as it writes. For factual claims, it retrieves. For connective prose or reasoning it's confident about, it skips retrieval and keeps generating.

ISREL answers "is this retrieved passage actually relevant to what I'm trying to say?" After retrieving, the model evaluates each candidate passage before incorporating it. Irrelevant passages get filtered out rather than polluting the generation with off-topic context. ISSUP answers the harder question: "is my generation supported by the evidence I retrieved?" This is a faithfulness check -- the model evaluates whether what it just wrote is grounded in the retrieved content or whether it drifted into hallucination. And ISUSE provides an overall quality assessment of the complete response.

The training process works by fine-tuning the model on a corpus where these reflection tokens have been inserted by a critic model (GPT-4 in the paper). At inference time, the model generates reflection tokens naturally as part of its output, and a tree-beam search over multiple retrieval-generation paths selects the final output based on the reflection scores. A 7B Self-RAG model shows roughly 40% relative improvement on PopQA over standard RAG baselines, and on some benchmarks it outperforms ChatGPT despite being dramatically smaller. The catch, of course, is that you need to train (or fine-tune) the model with the reflection tokens -- you can't bolt this onto an existing API-based model the way you can with CRAG.

Search-R1 (Jin et al., 2025) pushes self-correcting retrieval further by training search behavior directly through reinforcement learning. The model learns when and what to search during its chain-of-thought reasoning, issuing search queries as actions within the reasoning trace and incorporating results before continuing. The key technical innovation is retrieved token masking -- when computing the RL loss, the gradient is applied only over tokens the LLM itself generated, excluding all retrieved content from gradient updates. This prevents the model from being rewarded or penalized for the quality of search results it can't control, focusing the learning signal purely on the model's search decisions and reasoning. With Qwen2.5-7B as the base model, Search-R1 achieves 41% improvement over RAG baselines on multi-hop question answering benchmarks. The approach requires RL training infrastructure (they use GRPO, a variant of PPO), but it points toward a future where retrieval isn't a pipeline stage bolted onto generation but a learned capability integrated into the reasoning process itself.

FLARE: Uncertainty-Driven Retrieval

Forward-Looking Active REtrieval (Jiang et al., 2023) takes a more pragmatic approach to the "when should I retrieve?" question. Instead of training special tokens, FLARE monitors the model's own confidence during generation. It works in two steps: the model generates a tentative next sentence, and if any token in that sentence falls below a confidence threshold (measured by token probability), FLARE triggers a retrieval using the tentative sentence as the query, then regenerates with the retrieved context. If all tokens are high-confidence, it keeps the generation as-is and moves forward. The approach is particularly valuable for long-form generation -- articles, reports, multi-paragraph answers -- where the model might be confident about introductory context but uncertain about specific facts or figures deeper in the response. And because it uses the model's own probability distribution as the retrieval trigger, it requires no additional training, just access to token-level logprobs (which most inference APIs now provide).

The Context Sufficiency Problem

There's a counterintuitive failure mode in all of these systems that I think deserves more attention than it gets. Google Research (2024) at Google found that RAG actually makes models worse at knowing when they don't know something. Without retrieval, a model that encounters an unfamiliar question will often hedge or abstain -- "I'm not sure about that" or "I don't have enough information." But add a retrieval step, even one that returns marginally relevant or outright irrelevant context, and the model's confidence spikes. It sees documents in its context window and interprets their presence as evidence, even when those documents don't actually contain the answer. More context means more confidence means more hallucination on questions where the retrieved context is insufficient.

The numbers are striking. Proprietary LLMs in their experiments chose to output incorrect answers rather than abstaining, even when the retrieved documents were clearly off-topic, at rates that would be unacceptable in any production system handling medical or financial queries. Their proposed fix is a sufficiency check before generation: a lightweight classifier that evaluates whether the retrieved context actually contains enough information to answer the query, and routes to an "I don't have enough information to answer that" response when it doesn't. Adding this single gate improved the fraction of correctly answered queries by 2-10% across their benchmarks -- not by making the retrieval better, but by preventing the model from generating when retrieval had failed. I think this is one of the most underrated ideas in the RAG literature, because it directly addresses the failure mode that users trust least, a system that sounds confident while being wrong.

What's Actually Production-Ready

I want to be honest about where these techniques sit on the research-to-production spectrum, because the gap is wider than most papers acknowledge. CRAG is the most immediately implementable -- it's an evaluator-plus-routing layer that you can add on top of any existing retrieval pipeline without retraining anything. If you have a reasonably capable classifier (even a prompted LLM works as the evaluator) and a web search fallback, you can build a basic CRAG loop in a weekend. The gains on noisy or incomplete corpora are real, and the plug-and-play architecture means you don't have to rearchitect your stack to try it.

FLARE is my pick for the most practical "next step" for teams that want to move beyond fire-and-forget retrieval but aren't ready to retrain models. You need token-level logprobs (available from OpenAI, Anthropic, and most open-source inference servers), a confidence threshold that you'll need to tune per-domain, and a mechanism to re-query and regenerate. The implementation is straightforward, and the uncertainty-based trigger means you only pay the retrieval latency cost when the model actually needs it -- which for many queries is never, since the initial retrieval was sufficient.

Self-RAG produces the most impressive benchmark results, but requires fine-tuning with reflection tokens, which puts it out of reach for most teams that are calling model APIs rather than running their own training infrastructure. If you're already fine-tuning models for your domain (and some teams absolutely should be), Self-RAG is worth investigating. If you're building on top of Claude or GPT-4, it's not an option today.

And Search-R1 is, to me, the most intellectually exciting direction -- RL-trained retrieval as a learned reasoning capability rather than a pipeline bolt-on -- but it requires both RL infrastructure and the willingness to run your own models. It's a research preview of where production systems might land in a year or two, not something I'd recommend building on today. (Though I keep checking the repo for updates, because the retrieved token masking trick is genuinely clever and I suspect it'll show up in other systems soon.)

The RAG Flywheel

  • Q-C-A framework: every RAG system has a Question, Context, and Answer -- 6 conditional relationships between these three components cover all the evaluation you need
  • Liu's 9-step flywheel takes you from synthetic data generation through production monitoring to continuous iteration -- the loop matters more than any individual step
  • About 90% of teams adding sophisticated retrieval or reranking techniques performed worse when properly evaluated against baselines
  • Build your own eval with ~300 labeled chunks, iterated 4-5 times; MTEB leaderboard rankings don't predict performance on your domain
  • Focus on leading metrics (weekly experiments conducted, eval suite execution time) over lagging metrics (overall quality scores that move too slowly to be actionable)

I've spent the last five sections showing you techniques that improve retrieval quality -- query transformations, hybrid search, reranking, graph traversal, self-correcting loops. But I've been holding back a dirty secret of RAG engineering, which is that most teams that add these techniques without proper evaluation make their systems worse, not better. The missing piece isn't a fancier retriever or a cleverer chunking strategy. It's knowing whether what you built actually works, and having a systematic process to find out. Jason Liu's RAG flywheel gives you that process, and it starts with a deceptively simple observation about evaluation structure.

The Q-C-A Evaluation Framework

Every RAG system has exactly three components that matter for evaluation: a Question (what the user asked), a Context (what the retriever returned), and an Answer (what the model generated). That's it. And between those three components, there are six conditional relationships -- each one measuring something distinct about system behavior. I learned this framing from Liu's work on RAG complexity levels, and it's become the backbone of how I think about evaluation. Once you see it, you can't unsee it, because every RAG metric you've ever encountered maps onto one of these six relationships.

The Q-C-A Evaluation Framework

TierMetricWhat it measuresWhen to use
FoundationPrecision & Recall @KRetrieval accuracyDaily, no LLM needed
PrimaryContext Relevance (C|Q)Do retrieved chunks address the question?Every eval run
PrimaryFaithfulness (A|C)Does the answer stay within context? (hallucination)Every eval run
PrimaryAnswer Relevance (A|Q)Does the answer address what was asked?Every eval run
AdvancedContext Support (C|A)Does context contain everything the answer needs?Deep analysis
AdvancedAnswerability (Q|C)Can the question be answered from this context?Deep analysis

In practice, the three Primary metrics are the ones you should be running on every evaluation cycle. Context Relevance (C|Q) tells you whether retrieval is working -- did you pull documents that actually address the user's question? This is the most fundamental metric, because if retrieval fails, everything downstream fails with it. Faithfulness (A|C) catches hallucination -- the model generating claims that aren't supported by the retrieved context. And Answer Relevance (A|Q) checks the end-to-end result, confirming that the final output actually addresses what was asked. If you only have time to measure three things, measure those three.

The Foundation tier -- Precision and Recall @K -- is where I recommend starting, because you can compute it without any LLM calls. You need a labeled set of queries paired with their "correct" chunks (I'll explain how to build this cheaply in a moment), and then you just check whether those chunks show up in your retrieval results. It's fast, it's deterministic, and it gives you a tight feedback loop for experimenting with retrieval changes. The Advanced metrics (Context Support and Answerability) are for deeper analysis sessions where you're trying to understand why specific queries fail, not for routine monitoring.

Choosing the Right @K

One thing that tripped me up early was treating @K as a single number. Different values of K tell you fundamentally different things about your pipeline, and conflating them will mislead you. Recall@5 is what your users actually experience -- it measures whether the right content shows up in the handful of chunks that make it into the generation context. Recall@25 tests your reranker, because if the right chunk exists in the top 25 but not the top 5, your reranker is the bottleneck. Recall@50 is a system quality indicator that tells you whether your embedding model and indexing strategy are in the right ballpark. And Recall@100 is essentially an existence check -- if the right chunk doesn't appear in the top 100 results, it's either not in the corpus or so badly chunked that no embedding model will ever surface it.

There's a subtle trap here that I've seen catch multiple teams, and it's a form of Simpson's Paradox applied to retrieval metrics. Your overall Recall@5 might look great at 85%, but if you break it down by query category, you might find that simple factoid questions are at 98% and multi-hop reasoning questions are at 40%. The aggregate number hides the fact that you're completely failing on the query types that matter most. Always segment your metrics by query type, topic area, and complexity. The aggregate number is almost always misleading.

The operational process that ties evaluation to continuous improvement is what Liu calls the RAG flywheel -- nine steps that cycle from initial baseline through synthetic data, real-world feedback, and targeted fixes:

  1. Initial implementation -- build a baseline system, however simple. You need something running to measure against.
  2. Synthetic data generation -- for each chunk in your corpus, generate plausible questions that chunk should answer. This gives you a labeled evaluation set without manual annotation, and Liu reports it typically achieves ~97% baseline accuracy (since the questions are derived directly from the chunks).
  3. Fast evaluations -- run precision and recall against your synthetic dataset. No LLM calls needed, so this executes in seconds. Speed is the point; if your eval takes 30 minutes, you won't run it after every change.
  4. Real-world data collection -- log production queries. The gap between synthetic questions and what users actually ask is where most of the real improvement opportunity lives.
  5. Classification and analysis -- cluster your real queries by topic, intent type, and capability requirement. This is where the topic-gap vs. capability-gap framework from Section 1 becomes operational.
  6. Targeted improvements -- fix failures by category, not globally. If medical dosage queries are failing, improve the medical content and chunking strategy, don't retune your entire embedding model.
  7. Production monitoring -- track your Q-C-A metrics continuously, with alerts on regressions. A change that improves one query category can silently degrade another.
  8. User feedback integration -- collect structured feedback (more on what "structured" means in a moment).
  9. Iterate -- go back to step 2. The synthetic dataset grows, the real-world dataset grows, and each cycle through the loop improves your understanding of where the system fails.

Step 3 deserves emphasis because it's the one most teams get wrong. If your evaluation suite takes 20 minutes to run, you'll run it once a day at best. If it takes 15 seconds, you'll run it after every change -- and that tighter feedback loop is worth more than any individual technique improvement. Build your fast eval path first (precision/recall on synthetic data, no LLM calls), and only layer on the slower LLM-judge metrics for weekly deep dives.

The 90% Statistic

This is the number that changed how I approach RAG engineering. Liu and Skylar Payne both report that roughly 90% of implementations adding sophisticated retrieval or reranking techniques performed worse when properly evaluated against baselines. Not "didn't help" -- actively performed worse. The teams that added hybrid search, or swapped in a fancier reranker, or implemented query expansion, and then measured with a real eval suite, found that more often than not, the added complexity introduced regressions that outweighed the gains.

I think the explanation is straightforward, even if it's uncomfortable. Most teams add techniques because they sound good in a blog post or paper (guilty as charged -- I just spent four sections describing them), not because their evaluation told them that specific technique would help. And without measurement, you can't distinguish between "this helped on the queries where I eyeballed it" and "this helped on net across all query types." That Simpson's Paradox issue bites hard: a reranker that improves factoid queries by 10% but degrades multi-hop queries by 25% looks great when you test it on three cherry-picked examples, and terrible when you run it across a real evaluation set. The takeaway isn't "don't add sophisticated techniques" -- it's "measure before and after, across all query categories, or you're probably making things worse."

Leading vs. Lagging Metrics

Most teams I've talked to obsess over lagging metrics -- overall answer quality, user satisfaction scores, aggregate Recall@5. These numbers move slowly, they're influenced by dozens of variables at once, and by the time they shift meaningfully, you've already shipped multiple changes and can't attribute the improvement to any one of them. I've found it more useful to track leading metrics that predict quality improvement rather than measure it after the fact. The ones I pay attention to: how many retrieval experiments did we run this week? Is our eval suite execution time under 30 seconds? How many new labeled query-chunk pairs did we add to the evaluation set? Are we segmenting metrics by query category or just looking at aggregates?

The weekly experiment count is the one I care about most. A team that runs 10 retrieval experiments a week with fast evaluation will outperform a team that runs one experiment a month with a beautiful dashboard of lagging metrics. The speed of iteration dominates the sophistication of any individual technique, and that speed depends entirely on having an eval loop that's fast enough to not be a bottleneck.

The Feedback Problem

Most RAG systems collect user feedback through thumbs up/down buttons or 5-star ratings, and most of that data is nearly useless for improving retrieval. A thumbs-down on an answer could mean the retrieval was bad, or the generation hallucinated despite good retrieval, or the answer was correct but the tone was wrong, or the response was too slow, or the user just didn't like the formatting. You can't diagnose what to fix from a one-dimensional signal that conflates five different failure modes.

Liu's recommendation (and I've adopted this after trying it both ways) is to reframe the feedback question entirely. Instead of "rate this response," ask: "Did we answer your question correctly? Yes or no." This single binary question isolates correctness from tone, latency, and presentation, which means a "no" response actually tells you something actionable about retrieval or generation quality. You can supplement it with an optional freeform "what was wrong?" field for users who want to elaborate, but the binary question alone gives you a cleaner signal than a 5-star rating ever will. I've seen teams cut their feedback analysis time by 60-70% after making this switch, because they stopped trying to interpret what a 3-star rating meant and started counting binary correct/incorrect outcomes segmented by query category.

The MTEB Trap

When I first started building RAG systems, I'd go to the MTEB leaderboard and pick whatever embedding model sat at the top, assuming that the best benchmark performer would be the best performer for my use case. I was wrong, and I wasn't alone. Weights & Biases discovered that their top-ranked MTEB model actually underperformed alternatives on their domain-specific benchmark -- the model that was "best" on averaged academic benchmarks wasn't best on their actual retrieval tasks. MTEB evaluates models across a diverse set of tasks and domains, and the aggregate score washes out domain-specific strengths and weaknesses in exactly the same way that aggregate Recall@5 hides per-category performance.

The fix is to build your own evaluation benchmark, and it's less work than you'd think. Start with ~300 labeled query-chunk pairs from your actual domain -- queries that real users have asked (or that your synthetic generation step produced), paired with the chunks that correctly answer them. Run your candidate embedding models against this set, measure Recall@5 and Recall@25, and pick the winner. Then iterate on the benchmark itself 4-5 times, adding failure cases you discover in production. After those iterations, you'll have a benchmark that actually predicts how models will perform on your data, which is something MTEB by design cannot do. I'd rather have 300 well-curated domain-specific pairs than 10,000 academic benchmark results when making an embedding model decision.

Automated evaluation frameworks: RAGAS and BenchmarkQED both provide implementations of the Q-C-A metrics described above, with LLM-as-judge scoring and synthetic test generation. They're useful starting points, especially RAGAS's faithfulness and answer relevance scorers. But I'd still prioritize building your own domain-specific eval suite over relying on a generic framework -- the framework tells you how to measure, but only your labeled data tells you what to measure against. Use the frameworks for the LLM-judge plumbing; supply your own ground truth.

When Not to Use RAG

  • Grep beats embeddings for code: Augment reached the top of SWE-Bench Verified using grep and find, not embedding-based retrieval -- agent persistence compensates for simple tools
  • In agent systems, the model becomes the orchestrator choosing which retrieval tools to invoke, not a passive consumer of a fixed pipeline
  • RAG is shifting from QA (save time) to report generation (inform decisions) -- a $20K system informing a $5M decision is cheap; a free system saving 10 minutes is nearly worthless
  • Embeddings remain essential for unstructured corpora, fuzzy conceptual queries, and non-text media where keyword search genuinely can't reach

The finding that surprised me most while researching this post had nothing to do with chunking strategies or reranking models. It came from Augment Code, which reached the top of SWE-Bench Verified -- the hardest benchmark for autonomous coding agents -- using grep and find as their primary retrieval tools. Not a vector database. Not a hybrid search pipeline. Not any of the embedding-based approaches I've spent the last six sections describing. Plain text search over a codebase, driven by an agent that knew how to formulate good queries and try again when results were bad.

Once you think about it, the explanation makes sense for code specifically. Source code is highly structured, with distinctive identifiers, function names, class hierarchies, and import paths that act as natural keywords. When you're looking for how a particular function handles edge cases, grep -rn "handleTimeout" gets you there faster than an embedding similarity search that might surface semantically related but textually different code. Nik Pash, who built Cline, went further and abandoned RAG entirely for coding agents after finding that embedding-based retrieval added latency without improving the results that a persistent search agent could already achieve. Colin Flaherty at Augment described the test they apply to any new retrieval tool: "If I was a human working on this use case, a really persistent human that never got tired, would having this other search tool help me? If yes, it's probably useful for the agent." For structured code, the answer kept coming back no.

But the deeper lesson isn't about grep versus embeddings -- it's about a control inversion in how retrieval works inside agent systems. In a traditional RAG pipeline, retrieval is a fixed stage: query comes in, retriever fetches documents, generator produces an answer. The pipeline is designed once by an engineer, and every query flows through the same path. In an agent system, the model itself decides which tools to invoke, in what order, with what parameters. It might grep for a function name, read the file that contains it, then grep for everywhere that file is imported, then read one of those files to understand the calling context. The model is the orchestrator, not the consumer. And that inversion means simple, composable tools -- grep, find, file read, SQL queries -- become surprisingly effective when the model can chain them intelligently and retry when a search comes back empty.

This doesn't mean embeddings are dead. They remain essential for the cases where keyword search genuinely can't reach: large unstructured corpora like Slack threads, internal wikis, and email archives where the vocabulary is inconsistent and the same concept gets expressed dozens of different ways; third-party content that falls outside the model's training data, where you can't rely on parametric knowledge to fill gaps; non-text media like images, audio, and diagrams that need to be embedded into a shared vector space for cross-modal retrieval; and fuzzy conceptual queries ("what's our company's approach to incident response?") where the user doesn't know the right keywords and the answer is distributed across multiple documents. If your corpus is messy, your queries are vague, and your content isn't in the training data, you still need the full retrieval pipeline I've been describing. The question is whether you should default to that pipeline, or start simpler and add complexity only when measurement tells you it's needed.

The shift I find most interesting is one that Liu articulated toward the end of his RAG complexity framework: RAG is moving from question-answering to report generation, and the economics of the two are completely different. RAG-as-QA derives its value from time saved -- instead of searching through documents yourself, the system finds the answer for you. That's useful, but it's a one-dimensional value proposition. You save 5 minutes, maybe 10. The ceiling on that value is low. RAG-as-report-generation derives its value from decision quality. A system that synthesizes information from hundreds of documents into a coherent analysis -- pulling together financial filings, market research, competitive intelligence, internal performance data -- enables decisions that a human analyst couldn't make in the same timeframe, or at the same level of coverage. A $20K/month RAG system that informs a $5M investment decision is absurdly cheap. A free RAG chatbot that saves someone 10 minutes answering HR interview questions is nearly worthless. The value isn't in the retrieval mechanism; it's in what decisions the retrieved information enables.

I'm still not sure where exactly the boundary settles between "just use grep and an agent loop" and "you actually need the full embedding-plus-reranking pipeline." My best guess is that it depends on the structure of your corpus and the predictability of your queries more than on the volume of data. Highly structured data with consistent naming conventions (code, SQL databases, API documentation) rewards simple tools and agent persistence. Unstructured data with high vocabulary variance (support tickets, meeting transcripts, research papers across subfields) rewards the semantic generalization that embeddings provide. But I wouldn't bet too heavily on that distinction surviving contact with the next generation of models, which keep getting better at formulating their own search strategies.

Retrieval handles external knowledge -- documents, databases, APIs, everything outside the model's weights. But what about knowledge the system has learned from this specific user over weeks of interaction? Preferences, past decisions, patterns that worked, things that didn't. That's memory, and memory engineering is where context persistence gets genuinely hard. Part 3 will go there.

If you spot errors, have war stories about RAG systems that defied the patterns I've described, or found approaches that work better than what I've covered here -- I'd genuinely like to hear about it. Reach out on X or LinkedIn. I treat these posts as living documents, and reader corrections have made every previous piece better.

References

  1. Liu, J. "RAG is More Than Embeddings." jxnl.co, September 2023.
  2. Liu, J. "How to Build a Terrible RAG System." jxnl.co, January 2024.
  3. Liu, J. "When to LGTM@K." jxnl.co, February 2024.
  4. Liu, J. "Systematically Improving Your RAG." jxnl.co, May 2024.
  5. Liu, J. "Low-Hanging Fruit for RAG Search." jxnl.co, May 2024.
  6. Liu, J. "Predictions for the Future of RAG." jxnl.co, June 2024.
  7. Liu, J. "The RAG Flywheel." jxnl.co, August 2024.
  8. Liu, J. "Decomposing RAG Systems to Identify Bottlenecks." jxnl.co, November 2024.
  9. Liu, J. "Authority in RAG Systems." jxnl.co, March 2025.
  10. Liu, J. "Fine-Tuning Embedding Models for Enterprise RAG." jxnl.co, March 2025.
  11. Liu, J. "There Are Only 6 RAG Evals." jxnl.co, May 2025.
  12. Liu, J. "Systematically Improving RAG with Raindrop and Oleve." jxnl.co, May 2025.
  13. Liu, J. "RAG Anti-Patterns with Skylar Payne." jxnl.co, June 2025.
  14. Liu, J. "Facets: Context Engineering." jxnl.co, August 2025.
  15. Liu, J. "Why Grep Beat Embeddings." jxnl.co, September 2025.
  16. Liu, J. "Why I Stopped Using RAG for Coding Agents." jxnl.co, September 2025.
  17. Liu, J. "Text Chunking Strategies" (with ChromaDB). jxnl.co, September 2025.
  18. Liu, J. "Stop Trusting MTEB Rankings" (with Chroma). jxnl.co, September 2025.
  19. Anthropic. "Introducing Contextual Retrieval." anthropic.com, 2024.
  20. Jina AI. "Late Chunking in Long-Context Embedding Models." arXiv:2409.04701, 2024.
  21. Chen et al. "Dense X Retrieval: What Retrieval Granularity Should We Use?" Weaviate.
  22. Sarthak et al. "RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval." ICLR 2024, arXiv:2401.18059.
  23. Rackauckas et al. "RAG-Fusion." arXiv:2402.03367, 2024.
  24. Gao et al. "HyDE: Hypothetical Document Embeddings." arXiv:2212.10496, 2022.
  25. Microsoft Research. "GraphRAG: A Modular Graph-Based RAG System." GitHub, 2024.
  26. Microsoft Research. "LazyGraphRAG: Setting a New Standard for Quality and Cost." Microsoft Research Blog, 2025.
  27. Li et al. "KET-RAG: A Cost-Efficient Multi-Granular Indexing Framework for Graph-RAG." arXiv:2502.09304, 2025.
  28. Yang et al. "HippoRAG: Neurobiologically Inspired Long-Term Memory for LLMs." NeurIPS 2024, arXiv:2405.14831.
  29. Zhang et al. "PathRAG." arXiv:2502.14902, 2025.
  30. Particula. "GraphRAG Implementation: What 12 Million Nodes Taught Us." particula.tech, 2025.
  31. Microsoft Tech Community. "GraphRAG Costs Explained: What You Need to Know." techcommunity.microsoft.com, August 2024.
  32. Yan et al. "Corrective Retrieval Augmented Generation." arXiv:2401.15884, 2024.
  33. Asai et al. "Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection." ICLR 2024, arXiv:2310.11511.
  34. Jin et al. "Search-R1: Training LLMs to Reason and Leverage Search Engines." COLM 2025, arXiv:2503.09516.
  35. Jiang et al. "FLARE: Active Retrieval Augmented Generation." EMNLP 2023, arXiv:2305.06983.
  36. Google Research. "Sufficient Context: Improving Retrieval Augmented Generation." ICLR 2025, arXiv:2411.06037.
  37. Cohere. "Reranking Best Practices." docs.cohere.com.
  38. Agentset. "Reranker Leaderboard." agentset.ai.
  39. RAGAS Documentation. docs.ragas.io.
  40. Microsoft. "BenchmarkQED." GitHub, 2025.
  41. Weaviate. "Hybrid Search 2.0." weaviate.io.
  42. Assembled. "Better RAG Results with RRF and Hybrid Search." assembled.com.