Eval First, Model Second: Inside the TrialCore Iteration Loop

TL;DR

No public benchmark matches the queries our users actually send, so we built a custom eval for TrialCore.
Ground truth comes from three tracks (citation-based, manual, structured-metadata), assembled into a 460-query graded set.
The candidate pool is unioned from deliberately diverse sources. TrialCore itself contributed only 1.6%, so the system isn't grading its own output.
The LLM judge fills out a per-facet binary checklist and we derive the grade in code; a router sends the hardest slices to a stronger model.
A CLI drives the iteration loop. A change only ships if it wins on the iteration slice and survives a re-run on the full eval set.

You can't improve what you can't measure, and most public benchmarks don't measure what your users actually care about. This post is about how we built the evaluation infrastructure for TrialCore, our clinical-trials retrieval service, and the iteration loop on top of it. We'll explain the jargon as it shows up. Note that the methodology generalizes well beyond search.

Why a custom eval

TrialCore answers questions like "phase 2 trials of GLP-1 agonists for NASH" or "head-to-head pembrolizumab vs chemotherapy in NSCLC". There is no public benchmark that matches the shape of these queries, the graded relevance we need, or the archetypes our users actually send. So we built one with the goals in mind:

Measure recall against a saturated pool of candidates, not a pool produced by the system being tested
Use graded labels so we can distinguish "found a related trial" from "found the pivotal trial"
Stratify across query archetypes so a regression on one slice can't hide behind a win on another
Be cheap enough to iterate on weekly without compromising label quality

That last constraint drives most of the interesting tradeoffs.

What we measure

Each query in the eval comes with a graded set of identifiers plus metadata like archetype, complexity, and therapeutic area. The metrics, in plain English:

Recall@k - of the trials a domain expert would call relevant, what fraction did we surface in the top k?
Precision@k - of the top k, what fraction are relevant?
NDCG@k - a graded ranking score that rewards putting highly relevant items above merely relevant ones
Recall grade 2@k - recall restricted to the highest-relevance trials

We optimize primarily for average recall grade 2@10: of the trials a domain expert would call exactly what the query asked for, what fraction shows up in the top 10? That maps to our agent-pipeline use case, where a downstream LLM re-reads the top-K; exact position matters less than presence.

Ground truth: three tracks

Hand-labeled retrieval ground truth is slow and expensive. We use three sources with different bias profiles and tag every row with its track:

Track 1 — Review. Systematic reviews are expert-curated bundles of trials on a topic. We sample them, extract the cited IDs, classify each review's archetype, and use an LLM to generate synthetic queries from the title and abstract in two styles (clean LLM-generated, messy user-style) and three complexity levels. High-confidence seed positives at low cost; the bias is toward well-cited trials.

Track 2 — Manual. A small CSV of queries we wrote by hand with curated ID bundles. Things like "find glp-1 triagonist trials" or just a list of IDs the retriever should round-trip. Small but high-signal for the messy queries real users actually send.

Track 3 — Faceted. The first two tracks share a citation bias: a trial only shows up if someone thought it was relevant. To probe blind spots, we generate queries from structured metadata in the underlying database itself, then use the matching trials as graded ground truth by construction. This surfaces queries about specific phases, sponsor classes, and non-oncology subgroups the review track tends to miss.

Assembled, deduped, and verified against the live index, this gives us a 460-query evaluation set with graded ID bundles across the three tracks. Iterations run on a sampled slice for speed; confirmation runs against the full set.

The pool problem

Here's a trap: if your pool of candidate "correct answers" only contains results from the system being tested, your recall numbers are artificially high. You're measuring "of the things my system returned, what fraction did my system rank highly" this is close to meaningless.

To get an honest read, we union deliberately diverse sources when building the candidate pool, domain citations, a keyword baseline, the system under test itself, web search, and a handful more chosen specifically for low overlap with the others.

The TrialCore retriever contributed only 1.6% of the unique trials in the pool — the system isn't grading its own homework. One of the blind-spot sources contributed ~45%: it enumerates pivotal trials that the citation-based tracks miss entirely.

After the union, an LLM judge grades each candidate against the query, and the system under test is evaluated against that graded pool.

+----------------------+
   |  diverse sources     |
   |  per query:          |
   |                      |
   |  - domain citations  |
   |  - a keyword baseline|
   |  - the system itself |
   |  - web search        |
   |  - ...               |
   +----------+-----------+
              |
              v
   +----------------------+
   |  union + dedupe      |
   +----------+-----------+
              |
              v
   +----------------------+
   |  LLM judge           |
   |  (graded per query)  |
   +----------+-----------+
              |
              v
   +----------------------+
   |  per-query graded    |
   |  ground truth        |
   +----------+-----------+
              |
              v
   +----------------------+
   |  measure system      |
   |  under test          |
   |  -> recall@k, NDCG@k |
   |  -> segment by slice |
   +----------------------+

The judge

Using an LLM to grade relevance is the standard cheap-grading move, but the judge is the single biggest source of label noise, every downstream metric depends on it. We spent a disproportionate fraction of our eval budget on it. Three lessons stand out.

Booleans, then derive the grade. Our first judge asked the LLM to emit a grade (0/1/2) directly. Cheap models would routinely short-circuit: latch onto the first dimension that didn't match, dump a plausible-sounding reason, and emit grade 0, without ever evaluating the other dimensions. The fix was to split the task: the LLM commits to a few per-dimension booleans (each capturing one independent facet of relevance), and subsequently compute the grade from those booleans. Same model, same cost, but the schema makes short-circuiting impossible. Cohen's κ (0–1 agreement score) against a strong-model reference jumped from 0.43 to 0.62. Whenever you ask an LLM for a structured judgment with independent dimensions, force it to commit to each one separately and compute the rollup in code.

When the prompt changes, regenerate the reference. We benchmark a cheap judge against a "gold" checkpoint produced by stronger reference models and human evaluation. When we changed the judge's schema, we initially compared the new output against the old gold and got inflated agreement, because the old gold had the same short-circuit defect we were trying to fix. Whenever you change a prompt or schema in a way that could shift behavior, regenerate the reference. Keep the old one as a backup so you can later answer "did this edit improve absolute label quality, or just shift both sides into a new correlated mistake?"

When prompt edits over-regress, route instead. We tried to close the cheap-vs-strong judge gap with prompt edits. Three targeted ones lifted κ from 0.53 to 0.61. The fourth, a paragraph targeting one of the harder query types, fixed the targeted cluster but regressed everywhere else: overall κ fell to 0.55, because the new rule misfired on superficially similar queries that didn't actually need the special treatment. We rejected the edit and replaced it with a router: on the slices where cheap-judge was worst, fire the expensive model; otherwise, cheap. That lifted overall κ to 0.75 at ~30% of the all-expensive cost. When a global prompt edit can't separate the cases it's meant to fix from the cases it harms, route instead.

From eval to improvement: the experiment loop

A good eval is necessary but not sufficient. Most of the leverage afterwards is in running disciplined iterations, without succumbing to cherry-picking metrics, p-hacking against a small slice, or forgetting which version of the system produced which numbers. We built a small CLI on top of the eval. Each iteration is six steps:

1. propose      ─ Read the journal + code, draft 1–3 ranked hypotheses
2. start id     ─ CLI: create exp/id branch, assert clean tree
3. implement    ─ Edit code on the experiment branch
4. eval  id     ─ CLI: run the eval, capture metrics + diff + verdict
5. analyze      ─ Write a post-run analysis, append to JOURNAL.md
6. promote id   ─ CLI: re-run on the full eval set to confirm

Each hypothesis is a markdown file with a target metric, a predicted direction and magnitude, and explicit guardrails. The verdict is rule-based: win (target moved ≥ ½ the predicted magnitude in the predicted direction, no guardrail broken), null, or regression. The half-magnitude threshold is permissive on the 50-query iteration slice because per-run variance is high; the full eval set is the real bar. A hypothesis that wins on the iteration slice but doesn't survive the full set doesn't ship.

eval complete (50-query iteration slice)
        |
        v
   +----------------------------+
   | guardrail broken?          |-- yes --> regression
   +-------------+--------------+
                 | no
                 v
   +----------------------------+
   | target moved in predicted  |-- no (wrong dir) --> regression
   | direction?                 |-- no (flat)      --> null
   +-------------+--------------+
                 | yes
                 v
   +----------------------------+
   | moved >= 1/2 predicted     |-- no --> null
   | magnitude?                 |
   +-------------+--------------+
                 | yes
                 v
              win on 50-query
                 |
                 v
   +----------------------------+
   | re-run on full eval set -- |-- no --> do not ship
   | still a win vs parent?     |
   +-------------+--------------+
                 | yes
                 v
               ship

The other key piece is a Journal file: an append-only ledger of every attempt, one row each. It's what we read when proposing the next hypothesis. It encodes hard-won negative results so we don't relitigate them six iterations later.

What this produced (and didn't)

Two examples from recent iterations.

A win: moving one of our retrieval signals out of a strict filter and into a soft ranking term. Predicted +1.0pp on the headline metric. Measured +3.74pp on the 50-query iteration slice, then +2.24pp on the full eval set. Shipped.

A rejection: an aggregator change across the keyword scoring fields that looked like a clean structural improvement. Result: a WIN vs the frozen baseline (+3.07pp), but a regression against the immediate parent on every metric, including a 5.7pp collapse on the very archetype the hypothesis was designed to help. Rejected, journaled, and flagged for the next person tempted to try it.

The discipline of "WIN vs baseline is necessary but not sufficient" has already caught two would-be regressions.

Takeaways

Build the eval before the optimization loop. Without trustworthy ground truth and a saturated pool, you'll be optimizing for your own system's quirks, not for the thing you care about.
The judge is the most important model in your stack. Save the reference set, regenerate it when the prompt changes, and never compare against an obsolete reference.
Booleans, then derive. Force structured LLM judgments to commit to each dimension separately and roll up in code. Short-circuiting is otherwise the default failure mode.
Route, don't over-tune. When a prompt edit fixes one slice and breaks another, a router (cheap on easy cases, expensive on hard ones) is almost always cheaper and faster than another round of prompt edits.
WIN vs baseline ≠ ship. A change must also not regress against its parent, and it must survive a run against the full eval set. Keep a journal of every attempt, including failures, so the team doesn't pay for the same dead end twice.

Book our team for a technical discussion around technical fit of the Amass API, evals, and everything around it using the link below:

Book a technical discussion