Context Windows Are a Lie (Sort Of)
128k tokens is a marketing number. The usable window is a fraction of that — and I learned the hard way that how you fill it matters more than how much you have.
I built a RAG pipeline last autumn that I was genuinely proud of. It retrieved documents from a vector store, ranked them by relevance, and stuffed them into the context window alongside the user's query. The model had 128k tokens to work with. I was using maybe 40k. Plenty of headroom. Retrieval scores looked great. The architecture was clean.
One problem: the answers were wrong about 30% of the time.
Not hallucinated-out-of-thin-air wrong. Worse. The model would reference real information from real documents — just not the right documents. It would pull facts from the third or fourth most relevant source and ignore the most relevant one entirely. Users would get a confident, well-cited, completely misleading answer.
It took me two weeks to figure out what was happening. The answer was the lost-in-the-middle problem, and once I understood it, I felt stupid for not knowing sooner.
The middle is a dead zone
There's research on this that I wish I'd read before building anything. When you load a bunch of documents into a context window, the model pays disproportionate attention to what's at the beginning and what's at the end. The stuff in the middle? It might as well not be there.
My pipeline was retrieving twenty documents, sorting them by relevance score (highest first, which seemed logical), and packing them all in. The most relevant document was at the top — great. The second most relevant was right after it — fine. But by document seven or eight, we were deep in the middle of the context, and the model was effectively ignoring everything there. Then at the end of the context, just before the user's query, sat documents fifteen through twenty — the least relevant ones — and the model was paying close attention to those.
So the model was anchoring on the best document and the worst documents, and glazing over everything in between. No wonder the answers were inconsistent.
The fix was embarrassingly simple
I restructured the context layout. Most relevant document first. Second most relevant document last, right before the query. Third most relevant second. Fourth most relevant second-to-last. Zigzag pattern, placing important content at the positions where the model actually pays attention.
Accuracy went from roughly 70% to 85% overnight. Same documents. Same model. Same prompt. Different ordering.
70%
Before (naive ordering)
85%
After (zigzag ordering)
93%
After (pruned to top 3)
Then I made a more radical change: I stopped stuffing twenty documents in at all.
Three paragraphs beat thirty pages
This was the harder lesson, and it goes against every instinct. When you've built a retrieval pipeline that surfaces relevant documents, the temptation is to give the model as much context as possible. More information means better answers, right?
Wrong. More information means more noise. More noise means more opportunities for the model to latch onto something irrelevant. More opportunities means less predictable outputs.
I started aggressively pruning. Instead of twenty full documents, I now retrieve ten, re-rank them, take the top three, and extract only the most relevant sections from each. The total context went from 40k tokens to about 4k tokens. A 90% reduction.
Old approach
20 full documents, 40k tokens, relevance-sorted top to bottom. Accuracy: 70%.
New approach
Top 3 documents, 4k tokens, zigzag ordering with structured delimiters. Accuracy: 93%.
Accuracy went from 85% to 93%.
I want to repeat that because it still surprises me. I cut the context by 90% and accuracy improved by eight percentage points. Less was dramatically more.
Key Insight
The reason is straightforward once you think about it: with less context, there's less for the model to get confused by. The signal-to-noise ratio is everything. Three highly relevant paragraphs give the model a clear, focused basis for answering. Thirty pages of loosely related content give it a haystack to wander through.
Context window size is a ceiling, not a target
The marketing around context windows drives me slightly crazy. Every model announcement trumpets a bigger number. 128k. 200k. A million tokens. And developers — myself included, initially — treat that as an invitation. Oh, we have 200k tokens? Let's use them.
But a context window is a ceiling, not a target. The fact that you can fit an entire codebase into the context doesn't mean you should. The fact that you can include every customer support ticket from the last quarter doesn't mean the model will make good use of them.
I think about this the same way I think about screen real estate. When widescreen monitors became standard, a certain kind of designer responded by making everything wider. Content stretched edge to edge. Text lines ran to 200 characters. It was technically using the available space. It was also unreadable. Good designers understood that the extra space was for breathing room, for layout, for letting content sit comfortably — not for cramming in more stuff.
Context windows are the same. The extra capacity is there so you don't have to make hard tradeoffs on short contexts. It's not an invitation to skip curation.
Structured delimiters matter more than you think
The other change that made a measurable difference was how I formatted the context. Early on, I was concatenating documents with simple newlines between them. The model had to figure out where one document ended and another began, what the source was, how to attribute information.
Now every document chunk goes in with explicit structure:
<source id="1" relevance="high" title="Document Title">
Content here
</source>
This sounds like a small thing. It's not. Structured delimiters do two things: they help the model understand the boundaries between pieces of information, and they give it a framework for attribution. When the model can say "according to source 1" it's doing something different cognitively than when it's trying to synthesise a blob of undifferentiated text.
I also add a brief instruction at the top of the context: "The following sources are ordered by relevance. Prefer information from earlier sources when sources conflict." Obvious to a human reader. Not obvious to a model unless you say it explicitly.
What I do now
My current RAG architecture looks nothing like what I started with. Here's the pipeline:
- Retrieve a broad candidate set (top 20 documents by vector similarity)
- Re-rank with a cross-encoder model
- Take the top 3-5 documents only
- Extract relevant sections from each (not full documents)
- Format with structured delimiters and metadata
- Order strategically — most important first and last, supporting material in between
- Include explicit instructions about source priority
- Keep total context under 6k tokens for most queries
That last point still gets pushback when I talk about it. Six thousand tokens? You have 128k available! Yes. And 6k of sharp, curated, well-structured context outperforms 60k of loosely relevant material every single time in my testing.
The bigger lesson
Twenty-five years of building things has taught me that the constraint is rarely what the vendor tells you it is. The database can handle a million rows but your queries fall apart at fifty thousand because of missing indexes. The CDN can serve terabytes but your site is slow because you're loading twelve render-blocking scripts. The context window can hold 128k tokens but your answers degrade at 10k because you're not curating what goes in.
The marketing number is never the engineering number. The theoretical capacity is never the practical capacity. And the gap between them is where all the actual work lives.
128k tokens is a lie in the same way that "up to 1Gbps" on your internet plan is a lie. Technically true. Practically misleading. The real number — the number that determines whether your system works — is much smaller, and it's entirely in your control.
Fill the window thoughtfully. Less context, better context, structured context. That's the whole lesson. I just wish it hadn't taken me a month and a 30% error rate to learn it.