Benedict.

Back to posts

Correct Tool, Wrong Scale

10 min read
contents

Sections

At one point, BENEDICT.EXE used Qdrant for retrieval.

That sounds reasonable. If you are building a RAG system, a vector database is one of the default tools people reach for. Embed the documents, embed the query, compare vectors, retrieve the most similar chunks, pass them to the model.

It is a good pattern.

It was also not the right pattern for my scale.

Qdrant did nothing wrong. I was the one using it like I had a much bigger problem.

I did not have millions of documents.

I had a small curated portfolio corpus - roughly around a hundred indexed pieces, ranging from project notes and writing to photo descriptions and self-knowledge cards.

That is tiny compared to the scale where vector infrastructure really starts to earn its keep.

The old retrieval shape#

The vector search flow looked roughly like this:

user query
  -> embed query
  -> call remote vector database
  -> retrieve similar chunks
  -> rank / filter
  -> pack evidence into prompt
  -> model answers

This makes sense when your corpus is large enough that direct search is expensive, or when fuzzy semantic matching is much more important than exact wording.

But for my use case, every remote step mattered.

Embedding the query took time. Calling a remote service took time. Ranking and filtering took time. More importantly, the system became harder to reason about. When a result was bad, debugging meant looking at embeddings, scores, chunks, and remote index state.

That might be acceptable for a large knowledge base.

It felt silly for a personal portfolio with a small, curated set of information.

Pixel-art engineering notebook illustration comparing a large Vector DB to a small Local Corpus and Right Scale choice.
The mistake was not using a vector database. The mistake was paying remote-search complexity for a corpus small enough to inspect locally.

A small corpus changes the rules#

The important shift was not only moving data from remote infrastructure to local files.

The important shift was admitting what the data actually was.

This was not a giant public internet index. It was a curated set of portfolio material: experiences, projects, writing, photos, and newer self-knowledge notes about my preferences, views, memories, and ways of thinking.

The self-knowledge layer is the interesting part. It is not just raw diary text dumped into a prompt. The process is closer to an interview that gets distilled into usable retrieval notes.

Here is a simplified example:

prompt:
What food do you keep coming back to in Singapore, and why?
 
raw answer:
I talk about hawker food, specific stalls, family meals,
and why those places feel tied to memory.
 
owner note:
Food preferences are not just about taste. They are tied to
Singapore, family, routine, and concrete food memories.
 
retrieval card:
When someone asks about food in Singapore, surface the specific
preferences and the personal reason behind them.

That matters because the local corpus is not searching random strings. It is searching a curated representation of the portfolio and the “digital me” substrate.

At this size, direct local search becomes attractive. The corpus is small enough to warm in memory, inspect by hand, and search broadly without paying a network round trip every time a user says something.

How local retrieval works#

When people hear “local retrieval,” it can sound like a toy version of proper RAG.

That is not how I see it.

For a small curated corpus, local retrieval can be the more serious option because it matches the shape of the problem better.

The local retriever does not need to call an embedding model on the hot path. It builds warm lexical indexes over the local corpus and scores documents using several cheap signals.

Pixel-art educational diagram showing a query split into words, phrases, and n-grams before scoring food memory, travel note, and camera gear documents.
A toy version of the local retriever: split the query into simple signals, then score nearby documents directly.

The idea is simple:

SignalWhat it catchesWhy it helps
BM25-style token relevanceImportant query words appearing in a documentStrong direct text match
Alias matchingCasual wording like “u”, “ur”, “fav”, or “sg”Handles terminal-style language
Phrase matchingNearby words such as “food singapore”Rewards exact local phrasing
Character n-gramsSmall overlapping text slicesGives typo and shorthand tolerance
Type hintsPhoto, work, writing, or self-knowledge cuesNudges clearly scoped questions toward the source

In plain English, BM25 rewards documents that contain important query words without letting repeated words dominate. Aliases catch casual phrasing. N-grams are tiny overlapping character slices that give the search a bit of typo and shorthand tolerance.

The scoring weights were not mathematically sacred. They were empirical.

I started with the obvious priority: direct word relevance should matter most. Aliases should matter because people type casually in a terminal. Phrase matches should help when the wording is very local. N-grams should help fuzzy matching, but they should not dominate, because almost everything shares some tiny character overlap.

The rough shape became:

score =
  0.46 * token relevance
  + 0.22 * alias match
  + 0.20 * phrase match
  + 0.10 * n-gram match
  + small domain boosts

Those values came from trial, failure cases, and test snapshots. The goal was not to discover a universal retrieval formula. The goal was to make the local ranking behave sensibly for my corpus: direct matches win, casual phrasing still works, fuzzy signals help at the edges, and debugging stays inspectable.

The food question that exposed the blind spot#

One of the clearest failures was a simple question:

what food do u love in singapore

The system had information about food I liked in Singapore.

So the answer should have used that.

Instead, the old architecture could treat the question as smalltalk or route it into the wrong knowledge lane. Once that happened, the relevant self-knowledge never reached the model. The model then did what models do when they do not have the right evidence: it gave a generic chatbot answer.

That was exactly what I did not want.

Old route-first shape:
classify as smalltalk
skip the food/self-knowledge notes
model answers without the evidence
 
New broad local shape:
search the local corpus broadly
find food-related self-knowledge
pack a few concise evidence notes
model answers once with the right context

The model did not fail to use the evidence.

The evidence never reached it.

That is a system bug.

After broad local retrieval, the packed evidence looked like the kind of thing I actually wanted the model to see:

query:
what food do u love in singapore
 
packed evidence:
- a note about Singapore hawker food as a repeated preference
- a note about a laksa memory and why it stands out
- a note about chicken rice and why it feels personal

Once those sources reached the prompt, the model no longer had to improvise a generic answer. It had the actual food evidence and the reasons those memories mattered.

Broad retrieval beats brittle routing#

The early architecture had this instinct:

First decide what kind of question this is. Then retrieve from the matching section.

That seems reasonable. If someone asks about work, search work. If someone asks about photography, search photos. If someone says hi, do not retrieve anything.

The problem is that real questions do not respect my buckets.

A question about Singapore food might be smalltalk, self-knowledge, travel, family, or blog memory. A question about robotics might involve education, work experience, project taste, worldview, and career direction. A question about “what kind of work do you enjoy?” might need both project evidence and personal values.

If the system chooses one bin too early, it can hide useful evidence from the model.

Old:
query -> choose one door -> search inside that room
 
New:
query -> cheap broad local search -> pass useful evidence forward

Broad retrieval was only possible because local retrieval became cheap.

With remote vector search, retrieving broadly from multiple sections can add latency and complexity. With local retrieval over a small corpus, searching across the whole shelf is cheap enough to do by default.

That changed the architecture.

Instead of spending effort deciding whether to retrieve, the system can run a cheap evidence sensor first. If nothing useful is found, pass little or no evidence. If useful self-knowledge appears, include it. If the query touches multiple areas, let the evidence pack reflect that.

The real lesson#

Local retrieval is not magic.

It does not make the model sound like me.

It does not automatically know which evidence should be used as direct support versus background context.

It does not fix every ambiguous query.

It also does not mean I will never use a vector database again. If the corpus grows much larger, if lexical matching starts missing good evidence, or if the queries become more semantically indirect, I would reconsider vector or hybrid retrieval.

The point is not “vector DB bad.”

The point is “correct tool, wrong scale.”

QuestionVector databaseLocal corpus
Best fitLarge, fuzzy, high-volume corpusSmall curated corpus
Query costQuery embedding plus remote callLocal lexical scoring
LatencyNetwork/API dependentMillisecond-level hot path
DebuggabilityHarder to inspect why a hit wonEasier to inspect matched terms
My caseMore infrastructure than neededBetter scale fit

The important shift was not just replacing Qdrant with local code.

The important shift was designing for the actual product.

My portfolio chat is not an enterprise knowledge assistant with millions of documents and thousands of concurrent users. It is a public conversational interface over a small, curated set of information about me.

That changes the right architecture.

It means latency matters more than theoretical scale. It means inspectability matters because I am the person curating the knowledge. It means broad retrieval can be better than fragile routing. It means a simple local index can be more mature than a heavyweight system copied from a different problem.

Build for the shape of your problem, not the shape of the architecture diagram.

Local retrieval made the system faster.

But it also unlocked something more important: it made broad retrieval cheap enough that the model could see more of the right evidence without another classifier deciding too early.

Search got faster.

Answers got more personal.

And the architecture got smaller.

But search alone was still not the full story. Some answers need more than one source. A piece of writing might provide lived context. A self-knowledge note might provide the direct stance. A photo might be the artefact that makes the answer concrete.

That is where the next layer came in: not better search, but better context packing.

Next in Building BENEDICT.EXEWhen Search Is Not Enough8 min readFast retrieval found candidate sources. The next problem was deciding which sources should travel together.