The Hardest Chat Bugs Were Software Bugs

The funniest part of building BENEDICT.EXE is that many of the painful failures did not come from the model being incapable.

They looked like model failures because the user-facing thing was a model response. The chat said something weird, missed context, stayed in the wrong mode, or showed a strange terminal status.

So the natural instinct was to blame the model.

But over and over again, the actual issue was normal software engineering: stale state, loose stream contracts, evidence not reaching the model, workflow gates making decisions too early, automated checks trusting the wrong signal, or fallback logic hiding the real failure.

The model was not always innocent, of course. Models can still be generic, verbose, overly polished, or just wrong.

But many of the painful bugs came from the harness around the model.

Pixel-art engineering notebook illustration showing a chat failure flow from Symptom to Harness to Regression Test. — Most failures looked like model weirdness from the outside, but the fixes lived in the harness: state, streaming contracts, retrieval, logging, and automated checks.

The terminal looked broken even when the answer was fine#

One bug captured this perfectly.

The model would answer, and the answer itself could be fine. Then the terminal would show something like:

STATUS: That got interrupted mid-reply. Send your message again and I'll rerun it.

That message made the whole interaction feel broken even when the model had already produced a valid answer.

The user does not care whether the server technically generated text. The terminal experience is the product. If the UI says the reply was interrupted, the product feels wrong.

The fix was not “make the model smarter.”

It was to tighten the stream protocol. The server and client had to agree on what done, error, trailing chunks, and terminal statuses meant. If that contract was loose, the UI could show nonsense after a valid answer.

That is such a normal software bug.

It just happened to sit around an AI response.

The harness is the product#

When I say “harness,” I mean all the software around the model.

The model receives a prompt and returns tokens. The product decides what context the model gets, how much state carries across turns, what counts as a workflow continuation, what gets streamed to the user, what gets logged, and what gets checked afterwards.

For BENEDICT.EXE, the user experience depends on whether this harness can answer quickly, use my actual context, preserve state, and not embarrass itself in the terminal.

That includes routing, retrieval, evidence packing, prompt construction, workflow state, streaming, logging, and automated checks.

The model is important, but it is not the whole product.

The harness is what turns model capability into a usable interface.

Bad answers are often missing context#

One clean example was the Singapore food question:

what food do u love in singapore

The system had self-knowledge about food in Singapore.

Expected behaviour:

retrieve food-related self-knowledge
answer personally

Buggy behaviour:

classify as smalltalk
skip the relevant retrieval lane
model sees no food evidence
answer generically

If you only look at the final answer, it feels like the model failed to be personal.

But the model never had the personal evidence.

That is not a model capability problem. That is a routing and retrieval problem.

This pattern showed up repeatedly. The answer quality was bad not because the model could not synthesise, but because the surrounding system made the wrong decision before generation.

State bugs had the same shape.

If a user is halfway through scheduling a meeting and then replies with “tomorrow works,” the system should continue the active meeting flow. If the client loses session context, the server still needs to recover the persisted workflow state before deciding where the message should go.

If it fails to do that, the assistant suddenly feels forgetful.

From the user’s perspective, it looks like bad memory.

From the system’s perspective, the state machine made a routing decision before loading the state it needed.

Those are very different diagnoses.

The second one is much more useful.

Automated checks can lie#

I also learned not to blindly trust automated evaluation results.

An automated check is just another piece of software. If the judge logic is wrong, the score can be wrong too.

Here is the general version of a bug I hit:

judge verdict:
fail
 
failed dimensions:
none listed
 
bad harness interpretation:
no failed dimensions means pass
 
correct interpretation:
the explicit verdict says fail, so investigate the mismatch

That kind of mistake is subtle but dangerous.

The whole point of an automated evaluation is to tell you whether the system improved. If the evaluation harness accidentally flips a failure into a pass, the benchmark starts inflating your confidence.

That is worse than having no evaluation.

No evaluation tells you that you are blind.

A lying evaluation tells you that you can see.

The lesson is not “do not measure.” The lesson is that automated checks need contracts too. The pass/fail rule should be explicit. Inconclusive cases should be labelled as inconclusive. Latency metrics should not be mixed with answer-quality metrics. Source-retrieval metrics should not be treated as the same thing as conversation quality.

Otherwise, the measurement layer becomes another source of bugs.

Fallbacks are not free#

AI applications have a strong tendency to grow fallbacks.

If the model output is weird, repair it. If retrieval misses, try another backend. If a judge fails, use another judge. If parsing breaks, guess the missing structure.

Sometimes that is the right call.

But too many fallbacks can make a system harder to understand.

They can hide failures, inflate metrics, add latency, and create strange behaviour that only appears on edge paths.

The question I started asking was:

What exact failure does this fallback handle?
Is that failure common enough to justify the complexity?
Does the fallback preserve the same contract?
How will I know when it ran?
Could it silently make the result worse?

If I could not answer those, the fallback was suspect.

That was a recurring theme in this whole build. More branches did not automatically mean more robustness. Sometimes more branches just meant more ways to be wrong before the model even answered.

What I would do differently now#

If I restarted this project today, I would begin by writing down the actual shape of the product.

This is a small public portfolio. Conversations are usually short-lived. Traffic is low. First impression matters a lot. The corpus is small and curated. The chat needs grounding, but not enterprise-scale RAG. It needs safety, but not a giant multi-agent compliance pipeline. It needs to feel fast because nobody owes a portfolio chat their patience.

That would have saved me a lot of architecture cosplay.

The mistake was not trying to build something production-quality. The mistake was importing production patterns from a different scale and problem shape.

Production quality for this project meant measurement, clear contracts, explicit state transitions, inspectable retrieval, failure logging, behaviour checks, rate limits, safety boundaries, and a hot path small enough to reason about.

It did not mean adding more agents, more model calls, more validators, or a vector database just because RAG diagrams usually have one.

That is the growth of the project, honestly.

I started by thinking more engineering complexity would make the system better.

I ended up learning that the better engineering was knowing what to remove.

BENEDICT.EXE is not finished.

It is much faster than the first version. The retrieval architecture fits the corpus better. The answers are more grounded and personal. The system is easier to reason about than the over-engineered version I started with.

But it still does not fully sound like me.

That is the next frontier.

Prompting can only go so far. Retrieval can tell the model what I know, what I like, and what I have written. It cannot fully teach the model how I speak, when I would be casual, what jokes I would make, or how I would phrase something in a real conversation.

The long-term stretch goal is a local or fine-tuned model that feels much closer to talking to me directly.

But before I get there, I am glad I went through this stage.

Because this project taught me that the hard part of a “digital me” is not just feeding a model facts about myself.

It is building the system around the model well enough that the experience feels coherent, fast, grounded, and worth using.

That is the part people underestimate.

The model is only one part.

The harness is the product.

Benedict.