RAG Is Not a Search Engine
Every tutorial makes RAG look like three steps. After building retrieval pipelines that actually work in production, I can tell you it's more like thirty.
I've built search interfaces for twenty years. Site search, faceted product search, internal knowledge bases, marketing analytics dashboards with complex filtering. I thought that experience would translate directly to building Retrieval-Augmented Generation systems.
It didn't. Not even close.
The tutorial version of RAG is seductive in its simplicity: chunk your documents, embed them into vectors, retrieve the top-k most similar chunks when a user asks a question, stuff them into a prompt, and let the LLM synthesize an answer. Three steps. Maybe four if you count the embedding part separately. I had a working prototype in an afternoon.
The prototype worked beautifully on demo questions. It fell apart within a week of real usage. And the months I spent fixing it taught me that RAG is an entirely different discipline from search — one where most of the hard problems are invisible until you hit them in production.
Chunking Is the Ceiling
Here's a truth that took me too long to internalise: the quality of your RAG system can never exceed the quality of your chunks. If the right information isn't in the retrieved chunk — or if it's split across two chunks that don't get retrieved together — no amount of prompt engineering or model capability will save you.
Every tutorial tells you to chunk by token count. Pick a size — 512 tokens, maybe 1024 — add some overlap, and go. This works for demo purposes. It's catastrophic for real documents.
Warning
I was building a RAG system over internal documentation — product specs, architecture decision records, runbooks. Token-count chunking would split a numbered list across two chunks, separate a code example from the paragraph explaining it, or cut a procedure in half right at the critical step. The LLM would get chunk fragments that were technically relevant but practically useless.
The fix was semantic chunking — splitting documents along meaningful boundaries. Section headers, paragraph breaks, logical topic shifts. A chunk should be a complete thought, not a fixed-width slice of text. This meant writing custom chunking logic for different document types: Markdown got chunked by headers, API docs by endpoint, runbooks by procedure. It was tedious work, and it tripled the quality of our answers overnight.
I also learned the hard way that chunk size is a retrieval trade-off, not just a context window consideration. Small chunks are more precise — they match queries tightly. But they lack context. Large chunks carry more context but match more loosely and eat up your prompt budget. There's no universal right answer. I ended up running different chunk sizes for different document collections and tuning based on actual retrieval quality metrics, not vibes.
Retrieval Is the Bottleneck, Not Generation
This was the biggest mental shift. I spent my first few months focused on the generation side — prompt templates, system messages, output formatting. The model wasn't giving good answers, so I assumed the model was the problem.
It wasn't. The model was fine. The model was getting garbage context and doing its best with it.
When I started logging what was actually being retrieved — printing the chunks that went into each prompt — the picture became painfully clear. For maybe 40% of real user questions, the retrieved chunks were either irrelevant, partially relevant, or missing the one piece of information that would have made the answer correct.
That 40% number changed everything about where I spent my time. I stopped tweaking prompts and started obsessing over retrieval quality. Better chunking. Better embeddings. Better retrieval strategies. The generation side got modest improvements from prompt work. The retrieval side got transformational improvements from architectural changes.
If you're building RAG and your answers aren't good enough, I'd bet money the problem is retrieval. Check what's actually going into your prompts before you touch anything else.
Semantic Similarity Is Not Relevance
This one is subtle and it burned me repeatedly. Vector similarity search finds chunks that are semantically similar to the query. Similar is not the same as relevant.
A user asks: "What's the deployment process for the payments service?" The embedding search returns chunks about deployment processes for other services, because those chunks are semantically similar — they're about deployment, they use similar vocabulary, they exist in the same conceptual neighbourhood. But the user wanted a specific service, and the most similar chunks aren't the most relevant ones.
This is the fundamental limitation of dense retrieval. Embeddings capture semantic meaning at a high level, but they're lossy. They compress a chunk of text into a fixed-dimensional vector, and that compression throws away specifics. Entity names, exact version numbers, specific service identifiers — these often don't survive the embedding process with enough fidelity to discriminate between similar-but-wrong and actually-right.
I started thinking of vector search as a recall-optimised first pass, not an answer. It casts a wide net. You need something else to sort the catch.
Embedding Models Have Blind Spots
This one will keep you up at night once you see it. Embedding models — even the good ones — have systematic blind spots.
Negation. "Services that do NOT require authentication" and "services that require authentication" produce nearly identical embeddings. The word "not" barely registers in the vector space. I discovered this when a RAG system confidently told a user that a public endpoint required auth, because it retrieved a chunk saying "this endpoint does not require authentication" and the embedding model had essentially ignored the negation.
Temporal references. "The current deployment process" and "the old deployment process" embed similarly. If your documents contain both current and deprecated procedures, vector search will happily retrieve the deprecated version because it's semantically similar.
Quantitative comparisons. "Teams with more than 50 members" and "teams with fewer than 50 members" — same embedding neighbourhood. The directional relationship gets flattened.
These aren't edge cases. They show up constantly in real-world usage, and they erode trust in the system quietly. Users get wrong answers, lose confidence, and stop using the tool. You don't even know it's happening unless you're actively monitoring retrieval quality.
Hybrid Retrieval Catches the Edge Cases
The single biggest improvement to my RAG systems came from hybrid retrieval: combining semantic vector search with traditional keyword search (BM25).
Vector search is great at understanding intent and finding conceptually related content. BM25 is great at exact matching — specific terms, names, identifiers, code references. Together, they cover each other's blind spots.
When a user asks about "the payments-service deploy process," vector search finds chunks about deployment processes. BM25 finds chunks containing the literal string "payments-service." The intersection of those two result sets is dramatically more relevant than either one alone.
I use Reciprocal Rank Fusion to combine the results — it's simple, effective, and doesn't require training a separate model. The improvement was immediate and substantial. Questions that previously returned vaguely-related content started returning precisely-right content. The kinds of queries where embedding blind spots caused problems — negation, specific entities, exact terminology — suddenly worked because BM25 was catching what vectors missed.
The engineering cost of adding BM25 alongside vector search was maybe two days of work. The quality improvement was weeks of prompt engineering I no longer had to do.
Vector-Only Retrieval
Vector search only — finds semantically similar chunks but misses exact entities, negation, and specific terminology.
Hybrid Retrieval
Hybrid retrieval (vector + BM25) — semantic search captures intent while keyword search catches exact terms. Reciprocal Rank Fusion combines both, dramatically improving relevance.
Reranking Changes Everything
Even with hybrid retrieval, you're still working with a first-pass ranking. The initial retrieval is optimised for recall — get the right chunk into the candidate set. But it's not optimised for precision — putting the best chunk first.
Enter cross-encoder reranking. Instead of independently embedding the query and each chunk (bi-encoder), a cross-encoder takes the query and a candidate chunk together and scores their relevance as a pair. It's dramatically more accurate because it can attend to the fine-grained relationship between the question and the content.
It's also dramatically more expensive. You can't run a cross-encoder against your entire document collection — it's too slow. But you can retrieve 50 candidates with fast bi-encoder search and then rerank those 50 with a cross-encoder to pick the top 5. This two-stage approach gives you the recall of vector search with the precision of cross-attention.
When I added reranking to our pipeline, the answer quality improvement was startling. Not incremental — startling. The right chunk moved from position 3-8 in the initial retrieval to position 1-2 after reranking, consistently. The LLM was getting better context, and the answers reflected it immediately.
The Retrieval-to-Generation Compute Ratio
Here's my rule of thumb after building half a dozen RAG systems: if you're spending more compute on generation than retrieval, your architecture is backwards.
The best RAG systems I've built spend significant effort on retrieval — semantic chunking, hybrid search, reranking, metadata filtering — and then hand clean, precise, highly-relevant context to the LLM for a relatively straightforward generation step. The generation prompt doesn't need to be clever because the context is good.
The worst RAG systems do the opposite: fast, sloppy retrieval followed by elaborate prompts that try to compensate for noisy context. You end up with long system messages full of instructions like "ignore irrelevant context" and "only answer based on the provided information" — and the model still hallucinates because the relevant information simply isn't in the context window.
I think of it like cooking. You can be a brilliant chef, but if your ingredients are bad, the dish suffers. Retrieval is your ingredient sourcing. Generation is your cooking. Invest accordingly.
What I Wish I'd Known at the Start
RAG is not a search engine with an LLM bolted on. It's a retrieval engineering discipline that happens to use generation as its interface. The skills that matter are information retrieval, document processing, relevance scoring, and evaluation — not prompt engineering. The hard problems are chunking, embedding quality, retrieval strategy, and relevance measurement — not model selection or temperature tuning.
Every hour I spent on retrieval quality paid back tenfold in answer quality. Every hour I spent on prompt tricks for the generation step was mostly wasted until the retrieval was solid.
If you're starting a RAG project, build your evaluation pipeline before you build your retrieval pipeline. Know how you'll measure relevance. Log your retrieved chunks from day one. And expect to spend most of your time on the thirty steps between "embed your documents" and "get good answers" — because that's where the actual engineering lives.