On Evaluating the Quality of Retrieval

First of all, I rambled a little bit in this Notion page. I didn't want to use any AI to clean it up, other than grammar fixes, because I think it's better for me to go off the top of my head. I'm recording this with Wispr Flow and talking through it out loud. That said, I’ll add a few AI-generated keypoints from my rambles here:

Key Points:

I was thinking a bit about the question you asked about how I would evaluate TwinMind's retrieval, so I looked into the code I had left again and did some research. At its core, TwinMind has two distinct things we need to retrieve on: fact lookup that searches via keywords, and an agent loop lookup.

“Before I ship a change to a multi-stage data pipeline, how do I know it did not make correctness worse, and how do I know the accuracy/cost tradeoff is acceptable?”

Regarding the A/B testing idea, I think a more important move is to pick up signals from each user regarding the actual retrieval. I would evaluate a few signals you can capture on a daily basis over time, rather than A/B testing. For TwinMind, those signals would be: how fast the user copies the answer, the sentiment of their replies to the LLM (tracking frequency of negative sentiment), and clicking 'regenerate' on answers. Even with 80,000 users, you get a good idea of how they interact with it. You basically get a weekly aggregate to measure the variation and see if it's trending up or down.

I would also make a testing user with a fake memory background. It would have a curated, manually made set of information so we know exactly what we are retrieving, along with a set of manually curated questions. In CI, every time we change the data pipelines, we check the variation in the replies our agents give to those specific questions. (In your specific case with VOYGR, I would approach this by grabbing information from places we know for sure, putting together that curated data with curated answers, running your data-gathering pipeline, and comparing the results.)

Another important factor is to evaluate and only make changes to one step at a time, so we can isolate what caused the quality improvement or worsening. For example, looking through the TwinMind backend, most of the memory retrieval actually comes from transcript summaries. That's a step that has to be evaluated separately. If something was never put into the summary from the transcript, it's never going to be retrieved. (And I assume this goes for data that VOYGR is pulling as well, because if you never pulled the data, we can't really eval whether the output is good).

You asked how you'd know if one answer is better than another. You can check it with an LLM as a judge. This is one way I can think of doing that:

I would first manually check the quality of the judgment on those specific data sets where we know the answers.
I would then trust it to check the new answers we get on the retrieval every time we make changes.

We can use an LLM as a judge to score how good an answer is, or to compare pairings of old/new answers. Obviously, you still have to figure out how to set up the judge so you actually trust its answers by manually evaluating a lot of them first. If you don't trust it, you tweak the judge.

If we want to catch more subtle changes, we need a lot more questions and testing data, which scales over time. But there are cheaper ways to measure quality too. One way is keyword ranking, where we manually pre-define keywords that need to be in any given answer. A lot of simple answers can be evaluated properly like this, and it's extremely cheap because we're not calling any APIs. The next test could be checking embedding similarity against the data we made, which also doesn't require an LLM as a judge.

The LLM-as-a-judge idea reminded me of an auto-improvement loop I've been building for content. It's a skill that launches three sub-agents and a judge. The basic idea is: sub-agent critiques text → generates improved version → judge rates them without knowing which is which or what model produced it. From then on, it can tell you which version is the best.

Which, in the end, is basically the same idea.