Beyond RAG - methods for context disclosure

Context is everything. Frontier LLMs have remarkable breadth and depth of knowledge baked-in, but that knowledge is inherently flaky. The fix is surprisingly simple: give the LLM the right resources for the job, including a way for it to validate its own work. Suddenly, these unpredictable models can be orchestrated into dependable agents.
By the end of 2025 we started to see such agents being used at scale for software engineering, spawned on by two main developments:
- The release of the latest generation of frontier models (Opus 4.5, GPT 5.2, Gemini 3 Pro)
- The rapid evolution of command-line based agent harnesses (Claude Code, OpenCode, Codex, Gemini CLI)
The challenge
These agents are only as good as the context they're given. Suppose we've collected a bunch of resources (documents and tools) which could be useful for someone working in a particular domain. And suppose we employ an agent to work in that domain. Until the agent starts interacting with a user or an environment, we don't know exactly what its first task is going to be or what resources it might need.
Our fundamental challenge is this: how do we disclose the relevant resources to the LLM's context without overwhelming it? Let's pick a specific use-case to make things concrete.
Example: web development agent
A developer is using a coding agent to develop a web application using the web framework Next.js. The agent is using a frontier LLM which already knows about Next.js, since the framework has been written about extensively online for years.
But Next.js is constantly evolving, so the LLM doesn't know exactly which features correspond to exactly which version of the framework. The LLM trained on data that was curated before the most recent Next.js versions even existed, so it doesn't know about those at all.
The developer gives the coding agent a specific task related to Next.js. You don't need to understand what it means:
Create a placeholder edge API route handler without using the app directory
Depending on what version of Next.js the developer is using, the coding agent may or may not work well out-of-the box. However, if the agent had access to the relevant parts of the official Next.js documentation, we'd expect it to complete the task perfectly.
So how do we feed the LLM the context it needs to avoid making mistakes?
1. Pre-loaded context
Handpicked
Before hitting enter on the query above, let's say the developer visits the Next.js docs site and reads enough to understand which parts are relevant to the task. They copy-paste those parts into the LLM's system instruction, or directly into the user message alongside the original query:
Here are some snippets from the Next.js docs:
<docs>...</docs>
Create a placeholder edge API route handler without using the app directory
This would lead to good results, but the burden is on the developer to search for the relevant docs. Depending on the task, that could take significant mental bandwidth and time. Putting the grunt work on the developer like this undermines the power of the coding agent.
In bulk
What if the developer had access to a bulk download of the entire Next.js docs? Instead of handpicking the relevant parts for every task, they could include the whole thing in the system instruction for every task. The LLM would always have the necessary context upfront, whatever the Next.js-related task. What could go wrong?
Context window limit
The entire Next.js documentation amounts to millions of tokens. The context window of most frontier models from 2025 (that is, the maximum number of tokens they can handle in one session) is between 200k and 1m tokens. So the first problem with pre-loading everything is that it simply wouldn't fit.
Performance, cost and time
If you really want a friend to remember to bring bread to a picnic, just mention the bread, don't list what everyone else is bringing. That's to say, the more noise we cram into context, the less reliable the LLM becomes. Even if the entire docs did fit, we'd find that the bits relevant to our task would be massively diluted, and the agent wouldn't perform so well.
Finally, at this scale, both the cost of inference and the waiting time before the LLM starts to respond are roughly proportional to the context size. So by pre-loading the entire docs we've hiked our bill by many orders of magnitude, and made the agent feel sluggish. Context-caching can reduce these effects, but it's not a silver bullet.
2. Retrieval augmented generation (RAG)
So we want to feed the LLM handpicked context, but we don't want the user to have to handpick it themselves. Enter RAG, the much-hyped buzzword of 2023.
RAG who?
RAG is a fancy acronym for this simple concept: before giving the developer's query to the LLM, let's search for the most relevant docs, then give those docs to the LLM together with the original query.
It's just like the "handpicked" example above, but this time a computer searches for the relevant docs.
There are many different approaches to the search step (aka information retrieval). It could be a Google search, or a BM25 bag-of-words algorithm running over text documents, or a customer lookup SQL query, or an embedding-similarity vector search. In any case, it's a programmatic search over a bunch of data without an LLM.
The problem with RAG
RAG can be super effective, and it's the only feasible approach if you have billions of data points. But here's the thing: information retrieval is hard. Just ask Google. Finding the most relevant information for any given query, and filtering out the stuff that seems relevant but actually isn't, is a complex science.
Take our example query above which included the instruction "without using the app directory". Most search algorithms will really want to return docs about solutions with the app directory because we literally said the words "app directory". Agents can do clever things with LLM-based subagents to deal with this - they could run multiple searches with different wordings of the developer's query, and filter through the top results. But it doesn't change the fact that information retrieval is hard.
RAG is useful but hard to implement and maintain, and in practice the results can be underwhelming.
3. In-context agentic retrieval
For many use-cases, we can do better. The solution? LLMs of course. LLMs are very effective semantic search engines.
Huh? How can we use an LLM to search over the very same documents that we don't want to fully disclose in the first place?
Compression
It turns out that LLMs are so good at semantic search that you don't need to give much away. Suppose we compressed each Next.js doc to a title and summary:
**Document 1**
URL: `/docs/15/app/getting-started/route-handlers-and-middleware`
Title: `Route Handlers and Middleware`
Summary: `Explains Route Handlers (custom request handlers using Web APIs defined in route.js files within the app directory) and Middleware (code that runs before requests are completed, defined in a middleware.ts file at the project root for tasks like redirects and header modifications).`
**Document 2**
URL: `/docs/15/pages/building-your-application/routing/api-routes`
Title: `API Routes`
Summary: `How to create server-side API endpoints by placing files in the pages/api folder, which are mapped to /api/* routes and provide built-in helpers for handling HTTP requests, responses, and various configurations.`
**Document 3**
URL: `/docs/15/pages/building-your-application/routing/dynamic-routes`
Title: `Dynamic Routes`
Summary: `How to create dynamic routes using square bracket notation in file/folder names (like [slug] or [...slug]), which allows you to handle URL paths with variable segments that can be accessed via the router's query parameters.`
**Document 4**
...
If we give the LLM these one-sentence summaries alongside our original query, it could tell us that document 2 is worth reading. To be honest, it could probably tell us that just from looking at the URLs, which contain hints. By contrast, an equivalent BM25 or vector search algorithm would need to search over the entire document contents and even then would struggle to rank document 2 as more relevant than document 1, which contains similar vocabulary.
Tool use
What if we pre-loaded the ID and summary of every doc into the system instruction, and also included a readDocument tool? Now, anytime the agent thinks one of the documents seems useful, it can go ahead and read it. Our agent has access to all the relevant context and we're using orders of magnitude fewer tokens than if we'd disclosed the full 1m+ tokens upfront.
But pre-loading the summaries of every document still consumes 10k+ tokens, which seems wasteful. How can we do better?
Progressive disclosure through hierarchy
Before learning about snow leopards you should probably have a grasp on cats in general, and before that it might be useful to know roughly how vertebrate mammals work. That's to say, domain knowledge is normally structured hierarchically.
Take a look at these pages from the Next.js docs:
graph TD
%% Define Classes for Styling
classDef default fill:#f5f2ee,stroke:#856e5e,stroke-width:1px,color:#856e5e;
classDef highlighted fill:#fff9ed,stroke:#e6a00c,stroke-width:2px,color:#333;
linkStyle default stroke:#856e5e,stroke-width:1px;
%% Define Tree Structure (Nodes and Links)
Root(Next.js) --- A(v13)
Root(Next.js) --- B(v14)
Root(Next.js) --- C(v15)
Root(Next.js) --- D(v16)
A(v13) --- A1(...)
C(v15) --- C1(Pages)
C(v15) --- C2(App)
D(v16) --- D1(...)
C1(Pages) --- C1A(API References)
C1(Pages) --- C1B(Building)
C1(Pages) --- C1C(Architecture)
C1(Pages) --- C1D(Community)
C1A(API References) --- C1A1(...)
C1B(Building) --- C1B1(Rendering)
C1B(Building) --- C1B2(Data Fetching)
C1B(Building) --- C1B3(Routing)
C1B3(Routing) --- C1B3A(API Routes)
C1B3(Routing) --- C1B3B(Authenticating)
C1B3(Routing) --- C1B3C(Dynamic Routes)
%% Apply Highlighted Class to specific paths (nodes)
class Root,C,C1,C1B,C1B3,C1B3A highlighted;
%% Style the links for highlighted paths
linkStyle 2 stroke:#e6a00c,stroke-width:2px;
linkStyle 5 stroke:#e6a00c,stroke-width:2px;
linkStyle 9 stroke:#e6a00c,stroke-width:2px;
linkStyle 15 stroke:#e6a00c,stroke-width:2px;
linkStyle 16 stroke:#e6a00c,stroke-width:2px;
The top level pages give a high-level overview of each version of the web framework. By reading a bit we learn that there's a distinction between the Pages Router and the App Router. As we drill down further into the practical details of building a website we find more information on routing and API route handlers in particular.
Instead of disclosing every single page's summary upfront, what if we only disclosed the top level summaries? For each task, as before, the LLM can "search" over these summaries in-context. Once it decides that it wants more information about v15 for example, then we disclose more context from that page, as well as summaries of its sub-pages like Building and Architecture . The LLM can pick the most relevant sub-page and so on, until it finds what it needs for the task. After just a few tool calls it has traversed the entire tree of information.
Structuring your resources hierarchically and only disclosing the top level upfront can reduce the number of tokens by another couple of orders of magnitude. And the flexibility of the LLM means it doesn't have to get the search right on the first go. It can make multiple traversals of the tree, reading fragments into context each step of the way, gradually building up a fuller picture of the overall domain.
How does it scale?
Better than you might think. An information tree with 5 top-level resources, each of which has 5 child resources and so on, down 5 levels deep, has nearly 4,000 resources in total. If instead there were 10 top-level resources, each containing 10 more, down 6 levels deep? That totals more than a million.
The number of resources you can expose really depends on how logically the tree is structured and how distinct each branch is. For many practical use cases, using agents to do hierarchical retrieval in-context is very powerful indeed, as well as being token-efficient.
In practice
These three methods for context disclosure above each have their merits. We can pick and choose between them, or combine them, according to the particular use case. Here are some questions to ask which inform which approach to take:
- How big is the problem space the agent is working in? Is it being tasked with specific problems or open problems?
- Does your agent's system instruction have any dynamic variables, or is it consistent across all tasks?
- How much resource data do you want to make accessible to the agent?
- What deterministic methods do you have for searching through your resources?
- What are your limits for latency and cost?
- Do your resources have names or summaries? Could you generate those?
- Could your resources be structured hierarchically?
Want to experience progressive disclosure for your agents with enterprise-grade security built-in?