The First Rule of AI Latency Is Measuring the Wait

The first version of BENEDICT.EXE worked.

Technically.

You could type a question. It would think. It would call the model. It would maybe retrieve some information. Eventually, it would reply with something that was usually reasonable.

But the experience felt broken.

Some messages took 10 to 15 seconds.

That is not “a bit slow.” That is enough time for a visitor to assume the site is broken, lose trust, or leave. On a portfolio, the feature itself is part of the signal. If the thing that is supposed to show technical ability feels unusable, it becomes negative signal.

And there was another annoying thought:

ChatGPT is fast. Mine is much simpler. Why is mine so much slower?

That question forced me to stop treating latency as an implementation detail and start treating it as part of the product.

Silence is expensive#

For chat interfaces, silence feels different from normal page loading.

If I click a page and it takes two seconds to load, that is not ideal, but I understand the interaction. I clicked. The browser navigates. Something appears.

In chat, the user sends a message and waits in a conversational rhythm. A long silent pause feels like the system missed you. It feels like the message got lost. It feels dead.

That is why I started caring about time to first token as much as total latency.

User sends message
  -> silent wait
  -> first token appears
  -> rest of answer streams

If the first token appears quickly, the user knows the system is alive. Even if the full answer takes a bit longer, the interaction feels less broken.

If nothing appears for five seconds, the product feels dead even if the final response is technically correct.

The user-facing metric was not just “how long until everything is done.” It was “how long does the user have to stare at nothing?”

Define the wait before optimising it#

One of the earlier mistakes was treating “response time” like one number.

It is not one number.

Term	Plain meaning	Why I cared
TTFT	Time until the first streamed token at the boundary I am measuring	Best proxy for when the terminal feels alive
Prefill	Model-side work before it starts generating	Strongly affected by prompt and evidence size
Decode	Model generating the visible answer	Strongly affected by output length
p50	Median case	What normal turns feel like
p95	Tail case	What makes the product feel randomly broken

In plain English, TTFT is the awkward silence before the terminal shows the first visible token. Prefill is the model reading the prompt before it can speak. Decode is the model writing the answer after it starts.

In my server-side generation traces, TTFT is not a perfect browser submit-to-visible-text metric, but it was the best local proxy I had for model prefill and first-token wait.

The corrected mental model was simpler:

TTFT = waiting for the model to start
Decode = model generating the visible answer
Total = everything until the answer is complete

Once I separated those, the numbers started making more sense.

Short replies were not necessarily slow at decoding. They just had a bigger fixed prefill cost relative to their output length.

That sounds obvious in hindsight, but it matters. If your metric is wrong, your conclusions will also be wrong.

Pixel-art engineering notebook illustration showing a chat latency flow from Send to First Token to Done. — The useful latency split was not one giant response-time number. It was the quiet wait before first token, then the visible stream after it.

Vibes are a terrible profiler#

I already knew V1 was not optimised. At that stage, I was focused on getting the main functionality out: natural language chat, retrieval, session memory, basic safety, and the terminal experience.

So I expected optimisation work.

But “make it faster” is not a plan.

The useful work started when I had enough logging and automated checks to answer more concrete questions:

How long did rate limiting take?
How long did retrieval take?
How long before the first token?
How long until the full response completed?
Which queries were normal cases, and which ones created tail spikes?
Were we slow because of the model, my code, remote services, or unnecessary orchestration?

Without those measurements, every optimisation would have been vibes.

And vibes are a terrible profiler.

The receipts#

The numbers below are not a single perfect production time series. They came from local snapshots while the system was changing. I am including them because the direction matters more than pretending the measurements came from a clean lab.

Snapshot	What changed	p50 total	p95 total	What it showed
Original baseline	Multi-stage chat brain with remote retrieval	9,470ms	14,954ms	The feature felt broken
Staged control path	Same general architecture, measured more carefully	5,354ms	9,924ms	Better, but still paying a planning tax
Single-pass spike	Removed the extra planning call	2,811ms	5,976ms	Architecture mattered more than micro-optimisations
Cleaned single-pass path	Local retrieval, prompt cleanup, fewer hot-path steps	1,853ms	3,692ms	The remaining wait was mostly model-shaped
Local retrieval only	Retrieval measured by itself	1ms	2ms	Search was no longer the bottleneck
No extra reasoning budget	Main answer call without hidden reasoning budget	1,559ms	1,930ms	Faster while still passing the covered checks

The no-extra-reasoning result needs context.

Originally, I thought leaving model thinking on would make answers feel more curated and personal, especially when the system had fetched evidence about me. The worry was that without that extra reasoning budget, the chat would become flat or generic.

In practice, for these portfolio-chat turns, the extra hidden reasoning hurt latency a lot and did not buy enough visible quality. In the 28-case run I kept, 27 cases were conclusive under the quality-check contract while latency dropped into the range I was aiming for.

That does not mean every real user turn will always land there. Cold starts, long evidence, provider variance, photo-heavy answers, and network conditions still matter.

But the shape had changed.

The system was no longer spending multiple seconds doing avoidable work before the model could even answer.

The hot path had to become ruthless#

The biggest wins did not come from making every component slightly faster.

They came from asking whether certain components needed to run at all.

Pixel-art engineering notebook diagram showing rate limit, intent, workflow, retrieval, remote RAG, verification, and logging all feeding into an overloaded chat brain. — The early hot path had too many things pointing at the brain before the user could see an answer.

Some of these pieces were useful. Some were necessary. Some were only necessary for certain turns. Some were there because I had copied the shape of a more serious production system without fully asking whether my portfolio needed that shape.

I am not saying production quality does not matter. The useful production parts stayed: automated checks, logging, telemetry, rate limits, validation, latency breakdowns, and regression checks.

What did not survive were the heavyweight defaults that did not match the scale of the product.

The project did not need every message to go through a multi-stage brain just to answer a simple portfolio question. It did not need an expensive verifier to re-check curated facts against themselves. It did not need remote retrieval to search a tiny corpus.

The hot path had to become boring.

And boring was good.

At first, latency work looked like a performance problem.

Eventually, it became clear that it was an architecture problem.

Pixel-art before and after architecture diagram comparing a staged runtime with a simpler local search, context packing, one model call, and streaming path. — The biggest latency drop came from removing avoidable stages, not from shaving milliseconds off every small function.

The old system could pay for a semantic planning call and then another grounded generation call. The simplified runtime cut out that staged tax. Once the extra stage was gone, the remaining bottlenecks became more honest: model prefill, prompt size, evidence context, and provider behaviour.

That was a better place to be.

Hidden orchestration latency is frustrating because it is self-inflicted. Model latency is still annoying, but at least it is closer to the actual cost of asking the model to do useful work.

Another important shift was streaming earlier. Some earlier answers waited for too much work to finish before showing anything. That made the system safer in a literal sense, but the user could stare at a blank terminal for several seconds.

For a portfolio chat, that was painful.

Bad feel:
wait 6s
full answer appears
 
Better feel:
wait 700ms
answer starts streaming
full answer completes later

Same backend time? Maybe.

Same user experience? Definitely not.

I did not choose the 1-2 second target through some perfect UX research process. It just felt conversational. At 10 to 15 seconds, the system felt broken. Around 3 seconds, it was usable but still noticeably slow. Around 1 to 2 seconds for normal turns, it started to feel like something you could actually chat with.

Measurement made deletion safe#

The biggest reason I could simplify the system was not courage.

It was measurement.

Automated checks were there from the start, but they were built as needed. When I worked on conversation behaviour, I wrote conversation checks. When latency became the issue, I added latency measurements. When the brain architecture became messy, I added operation checks.

They were not perfect. They were not some grand benchmarking platform. But they made the system observable enough that I could delete pieces without flying blind.

Measure the current wait
Split total latency into rate limit, retrieval, generation, TTFT, decode, and post-processing where possible.
Find self-inflicted work
Separate model/provider cost from work the application chooses to do before the model can answer.
Delete before optimising
If a verifier, classifier, fallback, or remote call does not need to run on a turn, the fastest version is not running it.
Keep the quality gate
After deleting, rerun behaviour and latency checks. A faster system that gives worse answers is not a win.

That is the real lesson from the latency work.

Not “use this model.”

Not “use this database.”

Not “parallelise everything.”

The lesson is:

You cannot optimise a wait you have not decomposed.

Once I could see where the time went, the path forward became much clearer. Cut the extra model calls. Move slow non-critical work out of the hot path. Stop verifying things that did not need verification. Stop doing remote retrieval for a tiny corpus. Stream earlier. Keep the critical path small.

The result was not just faster numbers.

It was a portfolio chat that finally started to feel usable.

And once it was usable, the next question became obvious: why was I using a vector database to search around a hundred indexed items in the first place?

Benedict.

The First Rule of AI Latency Is Measuring the Wait

Silence is expensive#

Define the wait before optimising it#

Vibes are a terrible profiler#

The receipts#

The hot path had to become ruthless#

Measurement made deletion safe#

Keep reading