Your RAG pipeline is probably useless. Here’s a better alternative

Share

# Entry

Search-enhanced generation (RAG) has emerged as a standard approach for linking documents to gigantic language models (LLM).

The formula is elementary: embed the corpus, get the most relevant fragments based on vector similarity, insert them into the tooltip. Works great in demos and many production systems. It also fails in predictable and documented ways that only become apparent on a gigantic scale.

Here’s what these failure modes look like and the alternatives engineers turn to to solve them.

RAG pipeline

# When RAG fails in production

The most common failure pattern is recovery irrelevance. A user asks about the parental leave policy. Retriever returns the 2022 version, the 2024 version and the cultural blog post. Each snippet scores high in embedding distance because it shares vocabulary with the query. None of them answer the question asked by the user.

RAG pipeline

The model does not know that the downloaded content is dated or off-topic. It combines fragments into a confident, detailed answer that is factually incorrect. This is a thematic similarity with no connection to the facts and is the dominant failure mode in production RAG systems.

A more subtle version is contextual poisoning. Enterprise knowledge bases they often keep the same policy document in multiple versions. When the retriever returns pieces from both, the model reveals no contradiction. He chooses one, combines both, or presents some synthesis. The reader receives the answer. The answer may be wrong. Neither the user nor the model knows this.

The root cause is a structural conflict in the fragment embedding and retrieval pipeline. A good recall requires compact chunks, approximately 100 to 256 tokens, for targeted retrieval. Good understanding of context requires gigantic chunks, 1024 or more tokens, to ensure consistency. Each RAG designer chooses one and accepts the compromise.

# Common (wrong) correction: overworking

When standard RAG doesn’t work, a common solution is to make it more complicated: higher-dimensional embeddings, more sophisticated re-sorting, multi-step retrieval. This makes the problem worse.

AND global manufacturing company allocated PLN 400,000 for the RAG system. dollars. The cost for the first year was $1.2 million. Final accuracy of technical documentation queries: 23%. The project has been completed. In the sixth month, the healthcare company reached 75,000. dollars per month in vector database costs. These results reflect a broader pattern: In 2025, enterprise RAG deployments had a first-year failure rate of 72%.

RAG pipeline

Higher embedding dimensions and more sophisticated vector models do not automatically improve performance. They escalate computational costs and delay the more useful question of whether recovery architecture was actually the right choice.

# Alternatives in the event of RAG failure

// Prompting in a long context

The most direct alternative to overdesigning a struggling RAG pipeline is to omit recovery altogether.

If the body fits in the model’s context window, load it and let the model read it. AND comparative study found that long-context LLM tools consistently outperformed RAG on QA tasks when computation was available, and the highest latencies were for fragment-based retrieval.

The cost trade-off is significant. At 1 million tokens, the latency is 30 to 60 times slower than the RAG pipeline, which is about 1250 times the query cost. With quick caching for high-traffic applications, long context can become cost-competitive.

A common decision rule: if the corpus fits within the context window and the number of queries is moderate, long-context hints are a cleaner starting point. Only add downloads when the corpus exceeds the window, latency violates service level objectives (SLOs), or request volume exceeds the economic break-even point.

// Memory compression

If the corpus is too gigantic to accommodate the context window, summarize it before downloading. Summary-based retrieval compresses documents before injecting them, rather than extracting raw fragments. They show benchmarks this approach performs comparably to full long context methods, while fragment-based retrieval consistently lags behind both.

One specific result: The order-preserving RAG approach using 48,000 well-selected tokens outperformed full context retrieval with 117,000 tokens by 13 F1 points, with one-seventh of the token budget. A well-compressed relevant document is better than a raw dump of tangentially related fragments.

// Structured download

When retrieval has the right architecture, the solution is to route by query type rather than uniformly apply better embeddings.

Research from EMNLP 2024 introduced Self-Route, which allows the model to classify whether a query requires full context or targeted retrieval before running. Basic fact-based searches go to a targeted RAG. Convoluted questions that require multiple hops and require global understanding go into a long context.

The result: better overall accuracy with lower computational costs. Adaptive systems using this hybrid approach they showed Improve search precision by 15 to 30% with hybrid search and re-ranking.

The key change is the explicit definition of routing. Each query is classified before the download is triggered, and the system stops treating all queries as identical embedding issues.

// Graph-based reasoning

For queries that require understanding relationships in a dataset rather than retrieving a specific fragment, vector retrieval will fail by default.

Here are multi-step questions: Which decisions did management change in the third quarter, and what was the reason given each time? No passage answers this. The answer lies in the connections between documents.

Microsoft research introduced ChartRAG in 2024. The system builds a knowledge graph from the corpus and then looks at relationships between entities rather than matching vectors.

RAG pipeline

It directly addresses a failure case that standard RAG cannot handle: multi-document synthesis requiring relational reasoning.

The trade-off is cost. Knowledge graph extraction is 3 to 5 times more steep than base RAG and requires domain-specific tuning. GraphRAG is worth the overhead of thematic analysis and multi-hop reasoning. This is not the case when searching for facts in a single passage.

# Application

RAG is a reasonable default in many exploit cases.

RAG pipeline

It also breaks down in predictable ways: retrieval irrelevance when the vocabulary matches but the semantics are divergent, context poisoning when conflicting versions exist in the corpus, and structural limitations when the chunk size cannot provide both recall and coherence. Adding complexity to a damaged recovery project makes these problems more steep.

There are four better paths, depending on the situation:

  1. If the corpus matches the context window, prompting in the long context completely avoids the download problem.
  2. If context compression is necessary, summarizing before downloading is more effective than downloading raw fragments.
  3. If queries vary by type, explicit structured fetch routing improves both accuracy and cost.
  4. If queries require relational synthesis across documents, graph-based reasoning is the appropriate architecture.

Match the architecture to the query type.

Nate Rosidi is a data scientist and product strategist. He is also an adjunct professor of analytics and the founder of StrataScratch, a platform that helps data scientists prepare for job interviews using real interview questions from top companies. Nate writes about the latest career trends, gives interview advice, shares data science projects, and discusses everything related to SQL.

Latest Posts

More News