When Search Is Not Enough

Local retrieval made BENEDICT.EXE much faster.

But faster search does not automatically mean better answers.

Search can find candidate sources. It can tell you that a piece of writing, a self-knowledge note, or a photo description looks relevant to the user’s query. That is necessary, but it is not always enough.

Some questions need more than one source to answer well.

One source might contain the direct answer. Another might provide background context. Another might be a concrete example or artefact that makes the answer feel grounded.

The problem becomes:

Which small set of sources should travel together into the model prompt?

I think of this as budgeted context packing.

Not “retrieve everything that might be relevant.”

Not “stuff the prompt until the model figures it out.”

But also not “only pass the top search result and hope it has enough context.”

The goal is to include the smallest useful set of sources that makes the answer coherent.

In plain English, the prompt budget is the limited space I can afford to give the model before the answer gets slower or more expensive. A source graph is a small reviewable map of which sources are worth packing together, not a magic memory system.

Pixel-art engineering notebook illustration showing Search, Source Graph, and Packed Context flowing into a terminal answer. — Context packing is the difference between passing isolated search hits and passing a small connected evidence set.

A concrete example#

Imagine a user asks:

what shaped your thinking about balancing ambition and life?

A direct self-knowledge note can contain the stance: I tend to be ambitious and careful about decisions, but I do not want to lose sight of family, emotional care, and life outside work.

That source is useful.

But a reflective essay about feeling lost in life can provide lived context behind that stance. It shows the actual tension: ambition, freedom, work, travel, meaning, and the fear of waking up one day having lived someone else’s definition of a good life.

Either source alone can answer something.

Together, they answer better.

Here is the toy version of the pattern:

Query shape	Direct source	Context source	Why both help
Ambition and balance	A personal stance note	A reflective life essay	One gives the view, one gives lived context
Food and family	A family food story	A food-preference note	One gives the scene, one gives the pattern
Travel memory	A travel habit note	A photo or trip artefact	One states the pattern, one makes it concrete
Meaningful work	A work-values note	A project experience story	One gives the value, one shows it in practice

This is not about making the model smarter by dumping more text into it.

It is about making the prompt more coherent.

Sources, not claims#

My first instinct was to model relationships through claims.

Something like:

source -> extracted claims -> related claims -> grouped concepts

That sounds neat, but the more I thought about it, the more brittle it felt.

Claims are subjective. The moment you decide what the “claim” of a source is, you are already interpreting it. Two people might summarise the same essay differently. A model might extract a claim that is technically plausible but framed in a weird way. If the claim is too broad, too narrow, or just not the way I would describe it, every relationship downstream inherits that brittleness.

So instead of making claims the foundation, I moved the unit of judgement down to source pairs.

Given source A and source B:
1. Are they related?
2. Would including both help answer a user query?

That is still not perfect. But it removes one layer of subjective interpretation.

I do not need to first decide the one “correct” claim of a source. I can compare the two actual sources and ask whether they should be packed together.

Relatedness is not usefulness#

This was an important distinction.

Two sources can be related without being useful together in the same answer.

Source A:
A reflection about life balance.
 
Source B:
A family meal story.
 
Related?
Slightly. Both touch personal meaning.
 
Useful to pack together?
Usually no. It probably distracts from the answer.

Now compare that with:

Source A:
A story about a family food tradition.
 
Source B:
A self-knowledge note about food as memory.
 
Related?
Strongly.
 
Useful to pack together?
Yes, especially for questions about food, family, or traditions.

That is why the graph separates relatedness from co-pack usefulness.

Relatedness asks:

Are these two sources meaningfully connected?

Co-pack usefulness asks:

If one source is retrieved, does including the other improve the answer enough to justify the prompt budget?

Those are not the same question.

And because latency and cost matter, the second question is the one that really controls runtime behaviour.

Keep the graph reviewable#

The graph should not become “AI said these are related, so ship it.”

It needs to be reviewable, small, and bounded.

A relationship record should read more like a judgement note than a magic embedding edge:

source A:
a reflection about ambition and life balance
 
source B:
a self-knowledge note about personal priorities
 
relatedness:
high
 
co-pack usefulness:
high
 
decision:
include together when one appears and prompt budget allows
 
reason:
one gives the stance, the other gives lived context
 
review status:
accepted

That makes it auditable. If an edge exists, I can inspect why. If it is wrong, I can remove it. If a new batch of knowledge is added, an LLM can propose relationships, but the graph should still be validated and reviewed.

This was the part I cared about: relationships should be methodical enough that future knowledge can be added without the whole thing becoming vibes.

The graph has to stay cheap#

Prompt budget still matters.

Local retrieval being cheap only means the search step is cheap. It does not mean the model can read unlimited context for free.

Every extra evidence chunk increases prompt size. Prompt size affects prefill. Prefill affects TTFT. TTFT affects whether the chat feels alive or dead.

So the graph has to stay disciplined.

The rule of thumb is:

Strong direct source
  -> maybe add one or two high-value neighbours
  -> never expand into a giant context blob

The graph is useful only if it stays cheap.

That is also why I compare search-only behaviour against graph-assisted packing instead of assuming the graph should always help. If the graph adds context but makes answers slower, noisier, or less precise, that is not an improvement.

In the tests I ran, graph-assisted packing helped recall a little when a strong source was already found and a nearby source could make the answer more coherent. It did not solve broad-query precision by itself.

That distinction matters.

This is not a rescue system for weak retrieval. Graph-added evidence should support good direct hits, not hide a bad search layer.

Search only versus context packing#

The difference is small but important.

Search only:
query local corpus
take top local hits
pack top snippets
answer from whatever won search
 
Context packing:
query local corpus
map hits to source-level items
add a few reviewed high-value neighbours
pack concise connected evidence
answer with a more coherent context set

Search answers: “what looks relevant?”

Context packing asks: “what evidence set would help the answer make sense?”

That distinction matters for a digital-me style product.

If someone asks a factual question, direct evidence may be enough. But if someone asks how I think about work, why certain things matter to me, or what kind of memories I value, the best answer might need a stance, an example, and a personal artefact.

The model can synthesise those nicely if the harness gives it the right pieces.

That sentence is basically the whole project:

For the portfolio-chat tasks I tested, the model was usually capable enough. The hard part was building the harness around it.

The graph is part of that harness.

It does not make the model more intelligent.

It makes the context more intentional.

And as I found out later, many of the hardest failures in this project were not model intelligence problems at all. They were harness problems: state, streaming, automated checks, fallbacks, and all the boring software engineering around the model.

Benedict.

When Search Is Not Enough

A concrete example#

Sources, not claims#

Relatedness is not usefulness#

Keep the graph reviewable#

The graph has to stay cheap#

Search only versus context packing#

Keep reading