Technology

The Billion-Token Tender: Why RAG Isn't Fading, It's Gearing Up

The Billion-Token Tender: Why RAG Isn't Fading, It's Gearing Up

Every few weeks, a new headline seems to toll the bell for Retrieval-Augmented Generation (RAG). With language models now boasting context windows of a million tokens or more, the argument goes, why bother with the complexity of retrieving information? Why not just put the entire library in the prompt?

It’s a seductive idea. A world of effortless, boundless context where you can ask an AI to reason over an entire corporate archive in a single go. But from where we stand, in the digital trenches of the construction industry, this vision isn't just distant—it's a mirage.

The truth is, RAG isn't a temporary crutch for models with poor memory. It’s a foundational strategy for anyone serious about applying AI to real-world, industrial-scale problems. And two colossal roadblocks stand in the way of the "just stuff it in the context" dream: performance and price.

The Needle in a Haystack Factory

First, let's talk performance. Even the most advanced models suffer from what's been called "context rot" or the "lost in the middle" problem. When you provide a model with a massive, undifferentiated block of text, its ability to pinpoint and reason about specific details degrades significantly. It's like asking a CEO to recall a specific clause from page 782 of a 1,000-page due diligence report they skimmed once. The information is technically there, but it’s buried.

Effective AI doesn't just need access to data; it needs focused, relevant data. And when your "context" is measured in gigabytes, you need more than a bigger prompt. You need a better strategy.

This is where the term Context Engineering becomes more fitting than simple prompt engineering. The art isn't just in asking the right question, but in surgically delivering the right information to the model at the right time.

“That’s Great, But We Don’t Deal With Projects That Small”

Let's move from the theoretical to the tangible. One of our early landmark projects was for a hospital tender in France. The data package was over 12 GB, which our engine at the time processed into roughly 100 million tokens of text. Even two years ago, this was an astronomical figure, orders of magnitude beyond any model's capacity.

We were proud of our ability to handle this complexity. So, you can imagine our surprise when we presented this case study to a prospective client, a major player in international infrastructure, and their response was:

“Well that’s great, but we generally don’t deal with projects that small.”

That conversation was a profound lesson in scale. Our largest project to date came in at just over 100 GB, spread across nearly 19,000 files. When tokenized, that's a veritable mountain of context: approximately 1.2 billion tokens.

This is the reality of the construction world. Tenders aren't novels; they are sprawling ecosystems of documents.

The $26,000 Question

Now, let's talk about the price. Setting aside the fact that no model can handle 1.2 billion tokens today, let's indulge in a thought experiment. What if one could?

We can extrapolate the cost based on current pricing. The table below shows what a single query with a "long" context of 200,000 input tokens and 5,000 output tokens costs on various popular models.

ModelContext Input (Tokens)Max Output (Tokens)Input PriceOutput PriceCost for 200k/5k Query
o4-mini200,000100,000$ 1.1$ 4.424.2 ct
o3200,000100,000$ 2$ 844 ct
gpt-4.11,000,00032,768$ 2$ 844 ct
gemini-2.5-pro1,048,57664,000$ 2.5$ 1557.5 ct
claude-sonnet-4-20250514200,000128,000$ 3$ 1567.5 ct
grok-4-0709256,00064,000$ 3$ 3075 ct
o1200,000100,000$ 15$ 60330 ct
claude-opus-4-20250514200,00032,000$ 15$ 75337.5 ct
o3-pro200,000100,000$ 20$ 80440 ct

Now, let's take our 1.2-billion-token project. That's 6,000 times larger than the 200,000-token context in our example. If we naively extrapolate the cost, a single question about this tender would be financially ruinous:

  • On the high-performance o3-pro, it would cost around $26,400.
  • Using the more balanced Claude Sonnet, it would still be $4,050.
  • Even with the most economical model, o4-mini, you’d burn over $1,450.

For one question.

And that's before we consider speed. A prompt of this size would likely take days to process. On a real-world project that runs for years and requires hundreds of queries, this approach isn't just expensive; it's non-functional.

The Future is Intelligent, Not Just Big

This is why RAG —or more broadly, Context Engineering— will remain the cornerstone of professional-grade AI systems. The goal isn't to make the model drink from a firehose. The goal is to give it a glass of precisely the water it needs.

By intelligently searching, ranking, summarizing, and structuring the vast sea of data before it ever reaches the language model, we accomplish three things:

  1. We improve accuracy by reducing noise and focusing the model's attention.
  2. We control costs by sending only kilobytes of relevant data, not gigabytes.
  3. We ensure speed, delivering answers in seconds, not days.

So, is RAG dead? Far from it. As the world's data continues to explode, the need for intelligent, efficient, and surgical retrieval has never been more critical. The context windows can grow all they want; we'll be here, engineering the context that matters.