How I Built an Evaluation Harness for My RAG Chatbot

5 min read

How I Tested the Chatbot I Built

Last year I wrote about building Gus and Mitch, the RAG-powered dog chatbots on my portfolio site. That post covered the architecture: Pinecone for vector storage, GPT-3.5-turbo for generation, and a retrieval pipeline that pulls relevant context before answering questions.

What it did not cover was how I knew any of it was actually working well.

That question stayed in the back of my head. Every time I tweaked a prompt or adjusted a parameter, I was relying on vibes. I would ask a few questions, decide it seemed fine, and move on. That is not a great engineering process for a system where the failure modes are subtle and the outputs are non-deterministic.

So I built an evaluation harness called puppy-evals.

What it measures

The harness measures three things independently, because combining them into a single score hides the failure modes.

Retrieval quality: Did the right Pinecone chunks come back? This is pure math. For each test question, I defined which knowledge base sections should appear in the results. The score is recall@k: of the top K chunks retrieved, what fraction of the expected ones showed up.

Factual grounding: Did the answer use the retrieved context, or did it invent things? This is scored 1-5 by Claude Haiku, given the question, the retrieved chunks, and the chatbot's answer.

Persona consistency: Did Gus sound like Gus? Did Mitch sound like Mitch? Also scored 1-5 by Claude Haiku, given each puppy's persona description and the answer.

I used Claude Haiku as the judge rather than GPT (the model being evaluated) to avoid same-model bias. A model judging its own outputs tends to agree with itself more than it should.

The golden set

I wrote 60 test questions across both puppies: 30 for Gus, 30 for Mitch. Each question fell into one of four categories: factual (questions with known answers in the knowledge base), personality (questions testing whether the voice holds up), boundary (things the chatbots should decline to answer), and adversarial (prompt injection attempts and off-topic provocations).

Writing these questions by hand was deliberate. The golden set is not generated data. Each entry reflects a real thing a visitor might ask, or a real failure mode I wanted to catch.

What the baseline showed

On the first complete run, both puppies retrieved the expected chunks 86.7% of the time. Grounding was strong for Gus (4.20 out of 5) and notably weaker for Mitch (2.93 out of 5). Persona consistency was mediocre for both.

That Mitch grounding gap was interesting. Her enthusiastic style was causing the model to add details beyond what the retrieved chunks actually supported. More confident tone, less factual precision.

What changed after two experiments

Increasing the retrieval depth from top 3 chunks to top 5 eliminated all retrieval misses for both puppies. The 13% gap at baseline meant expected chunks existed in Pinecone but were ranked just below the cutoff. Mitch's grounding also jumped from 2.93 to 3.43 with more context available, confirming the earlier suspicion: she needed more to work with.

Rewriting Gus's system prompt had the largest single impact of any change. The original prompt listed vocabulary words to use. The revised version included two concrete example responses showing his sentence structure and rhythm. Persona scores jumped from 3.47 to 4.13 for Gus. Showing what the voice looks like outperformed describing it.

The dashboard

All results are stored in Postgres and surfaced in a small Next.js dashboard. The most useful view is comparison: select two runs, see exactly which questions regressed, which improved, and by how much. That is the moment when eval infrastructure stops feeling like overhead.

Why this matters beyond my chatbot

Most LLM-powered applications ship without any systematic evaluation. Teams make changes, do a quick manual check, and deploy. When something regresses, they often do not find out until a user notices.

Evaluation infrastructure is not glamorous to build. It is also genuinely rare. The ability to measure whether a change to a prompt, a model, or a retrieval parameter actually improved things is a meaningful skill as more products get built on top of language models.

The code and full experiment writeups are on GitHub at github.com/rockwellwindsor/puppy-evals.