Curiosity

Relevance Evaluation

How to measure whether search and retrieval are actually getting better — instead of just feeling like they are. This page is the methodology; the levers it tells you to pull are in Ranking tuning and Hybrid search.

Why evaluation matters

Without a metric, every ranking change feels like an improvement. With a metric, you know which changes actually help, which regress something else, and when to stop tuning.

Three things make evaluation tractable in practice:

A golden set of queries with labeled correct results.
A measurement script that runs the golden set against any configuration.
A deployment gate that ships only configurations that beat the baseline.

That's the whole loop.

Building a golden set

The golden set is the contract every ranking change measures against.

Picking queries

Source	Target count	Why
Top queries by volume	50	Catch most user traffic.
Zero-result queries	20	Reveal corpus / alias gaps.
Domain-jargon queries	10	Acronyms, product names, internal terms.
Long-tail / underserved queries	10	The 80/20 long tail still matters at scale.
Known failures from user feedback	10	Real escalations from real users.

Aim for 80–120 queries. Too few and noise dominates; too many and labeling is unmaintainable.

Labels

For each query, label the expected top-3 UIDs, ideally ranked:

query_id,query,expected_uids,priority,notes,added_at
q-0001,"macbook screen flicker","uid-1; uid-2; uid-3",p1,"flagship product",2026-01-15
q-0002,"battery drain overnight","uid-9; uid-12",p2,"known acronym BD",2026-01-15
q-0003,"error 0x80004005","uid-44",p1,"exact identifier",2026-01-15

Conventions that work:

Label by UID, not by text — text drifts, UIDs don't.
Three labelers in parallel, then resolve disagreements. Single-labeler golden sets reflect one person's biases.
Tag queries (priority, category) so you can slice the report.
Date-stamp the labels. Quarterly refreshes are mandatory.

Capturing the golden set in the graph

The CSV is the human-friendly input format. The graph is what the evaluation endpoints actually read from — that way the same labels are versioned alongside the rest of your workspace, queryable, and reachable from any other endpoint.

Node schema — `GoldenQuery`

A GoldenQuery node represents one labeled query in the suite. The QueryId is the natural key so re-imports update the same node.

Field	Type	Notes
`QueryId`	`string` (key)	Stable id from the CSV, e.g. `q-0001`.
`QueryText`	`string`	The user-facing query string passed to search.
`Priority`	`string`	`p1` / `p2` / `p3` — used to weight aggregates and gate regressions.
`Category`	`string`	Free-form tag (`product`, `error-code`, `jargon`, …).
`Surface`	`string`	Which UI surface the query belongs to (`global-search`, `support`).
`ExpectedUids`	`list<string>`	Labeled correct UIDs in expected rank order.
`Notes`	`string`	Free-form labeler note.
`AddedAt`	`Time`	Date-stamp for hygiene refreshes.

Edge schema — `Expects`

Each GoldenQuery also gets one outgoing Expects edge per labeled target. The edge makes the dataset traversable from either side ("which queries label this case?") and survives even when a labeled node is later renamed or moved.

Edge	From	To	Notes
`Expects`	`GoldenQuery`	labeled target node	One per `(query, target)`; order is kept in `ExpectedUids`.

Endpoint: `evaluation/import-golden-set`

Bulk-import the labeling CSV into the graph. Re-runnable — existing GoldenQuery nodes are updated in place.

// Endpoint: evaluation/import-golden-set
// Body: the raw CSV (same shape as the table above).
var csv   = Body;
var lines = csv.Split('\n').Skip(1).Where(l => !string.IsNullOrWhiteSpace(l));
int count = 0;

foreach (var line in lines)
{
    var cols          = line.Split(',');
    var queryId       = cols[0].Trim().Trim('"');
    var queryText     = cols[1].Trim().Trim('"');
    var expectedUids  = cols[2].Trim().Trim('"')
                            .Split(';', StringSplitOptions.RemoveEmptyEntries)
                            .Select(u => u.Trim())
                            .ToList();
    var priority      = cols[3].Trim().Trim('"');
    var notes         = cols.Length > 4 ? cols[4].Trim().Trim('"') : "";

    var query = await Graph.GetOrAddLockedAsync(N.GoldenQuery.Type, queryId);
    try
    {
        query.SetString(N.GoldenQuery.QueryText,    queryText);
        query.SetString(N.GoldenQuery.Priority,     priority);
        query.SetString(N.GoldenQuery.Notes,        notes);
        query.SetStringList(N.GoldenQuery.ExpectedUids, expectedUids);
        query.SetTime  (N.GoldenQuery.AddedAt,      Time.Now);

        foreach (var uid in expectedUids)
        {
            var target     = UID128.Parse(uid);
            var targetType = Graph.GetNodeTypeUIDForUID(target);
            query.AddUniqueEdge(E.Expects, target, targetType);
        }

        await Graph.CommitAsync(query);
        count++;
    }
    catch
    {
        Graph.AbandonChanges(query);
        throw;
    }
}

return Ok($"Imported {count} golden queries");

Metrics

The four numbers worth tracking:

Metric	What it measures	When it's the right metric
Precision@1	Is the first result correct?	Surfaces where users click only the top hit ("I'm feeling lucky").
Precision@3	Are any of the top 3 correct?	Most "user types and scrolls a bit" surfaces.
NDCG@10	Position-weighted quality of top-10. Rewards correct results higher up.	When ordering of the top 10 matters.
Recall@50	Did the correct UID appear anywhere in the top 50?	A floor check: "the engine isn't blind to this query."
Zero-result rate	Fraction of golden queries returning nothing.	A failure mode that ranking can't fix.

NDCG is the most useful single number for "is ranking improving?" — but track all four; they reveal different failure modes.

Endpoint: `evaluation/metrics`

Pure helpers. Doesn't read or write the graph — it exists only to be imported by the scoring endpoints via //ImportEndpoint("evaluation/metrics").

// Endpoint: evaluation/metrics
// Shared metric helpers. Imported by evaluation/score-query and any other
// endpoint that needs to compute Precision@k / NDCG@k against a label set.
public static class Metrics
{
    public static double PrecisionAtK(IReadOnlyList<UID128> retrieved, ISet<UID128> expected, int k)
    {
        if (k <= 0 || retrieved.Count == 0) return 0;
        var hits = retrieved.Take(k).Count(expected.Contains);
        return hits / (double)k;
    }

    public static double ReciprocalRank(IReadOnlyList<UID128> retrieved, ISet<UID128> expected)
    {
        for (int i = 0; i < retrieved.Count; i++)
            if (expected.Contains(retrieved[i])) return 1.0 / (i + 1);
        return 0.0;
    }

    public static double DcgAtK(IReadOnlyList<UID128> retrieved, ISet<UID128> expected, int k)
    {
        double sum = 0;
        var cap = Math.Min(k, retrieved.Count);
        for (int i = 0; i < cap; i++)
            if (expected.Contains(retrieved[i]))
                sum += 1.0 / Math.Log2(i + 2);
        return sum;
    }

    public static double NdcgAtK(IReadOnlyList<UID128> retrieved, IReadOnlyList<UID128> expectedRanked, int k)
    {
        var expectedSet = new HashSet<UID128>(expectedRanked);
        var ideal       = DcgAtK(expectedRanked, expectedSet, k);
        return ideal == 0 ? 0 : DcgAtK(retrieved, expectedSet, k) / ideal;
    }

    public static double RecallAtK(IReadOnlyList<UID128> retrieved, ISet<UID128> expected, int k)
        => retrieved.Take(k).Any(expected.Contains) ? 1.0 : 0.0;
}

Running the suite

Three endpoints, each with a single responsibility:

Endpoint	Reads	Writes	Calls
`evaluation/run-search`	search index	—	—
`evaluation/score-query`	golden set + search index	—	`evaluation/run-search`
`evaluation/run-suite`	golden set	—	`evaluation/score-query` (per query)

Endpoint: `evaluation/run-search`

The one place that knows how search is invoked. Swap this body when you change the retrieval surface (BM25 vs hybrid vs semantic) without touching any of the scoring code.

// Endpoint: evaluation/run-search
// Returns the top-50 UIDs for a single query. Read-only.
public class RunSearchRequest  { public string Query;   public double? Alpha; }
public class RunSearchResponse { public List<UID128> RetrievedUids; }

var req = Body.FromJson<RunSearchRequest>();

var uids = Q()
    .StartSearch(N.SupportCase.Type, N.SupportCase.Body, SearchExpression.Parse(req.Query))
    .Take(50)
    .ToUIDList();

return new RunSearchResponse { RetrievedUids = uids };

Endpoint: `evaluation/score-query`

Pulls the labels for one GoldenQuery out of the graph, calls evaluation/run-search for the same text, and uses the imported Metrics helpers. The endpoint name in the RunEndpointAsync call matches the path declared in the endpoint above.

// Endpoint: evaluation/score-query
//ImportEndpoint("evaluation/metrics")

public class ScoreQueryRequest  { public string QueryId; public double? Alpha; }
public class ScoreQueryResponse
{
    public string  QueryId;
    public double  PrecisionAt1, PrecisionAt3, NdcgAt10, RecallAt50;
    public bool    Zero;
}

var req = Body.FromJson<ScoreQueryRequest>();

var goldenNode = Q().StartAt(N.GoldenQuery.Type, req.QueryId).AsEnumerable().FirstOrDefault();
if (goldenNode is null) return NotFound($"GoldenQuery '{req.QueryId}' not in graph");

var queryText    = goldenNode.GetString(N.GoldenQuery.QueryText);
var expectedList = goldenNode.GetStringList(N.GoldenQuery.ExpectedUids)
                            .Select(UID128.Parse)
                            .ToList();
var expectedSet  = new HashSet<UID128>(expectedList);

var search = await RunEndpointAsync<RunSearchResponse>(
    "evaluation/run-search",
    new RunSearchRequest { Query = queryText, Alpha = req.Alpha });

var retrieved = search.RetrievedUids;

return new ScoreQueryResponse
{
    QueryId      = req.QueryId,
    PrecisionAt1 = Metrics.PrecisionAtK(retrieved, expectedSet, 1),
    PrecisionAt3 = Metrics.PrecisionAtK(retrieved, expectedSet, 3),
    NdcgAt10     = Metrics.NdcgAtK   (retrieved, expectedList, 10),
    RecallAt50   = Metrics.RecallAtK (retrieved, expectedSet, 50),
    Zero         = retrieved.Count == 0,
};

Endpoint: `evaluation/run-suite`

Iterates every GoldenQuery in the graph, defers per-query work to evaluation/score-query, and aggregates. Relay status messages keep long runs observable.

// Endpoint: evaluation/run-suite (Pooling, Read Only)
public class RunSuiteRequest  { public double? Alpha; }
public class RunSuiteResponse
{
    public double PrecisionAt1, PrecisionAt3, NdcgAt10, RecallAt50, ZeroResult;
    public List<ScoreQueryResponse> PerQuery;
}

var req     = Body.FromJson<RunSuiteRequest>() ?? new RunSuiteRequest();
var queries = Q().StartAt(N.GoldenQuery.Type).ToList();
var scored  = new List<ScoreQueryResponse>(queries.Count);

for (int i = 0; i < queries.Count; i++)
{
    var qid    = queries[i].GetString(N.GoldenQuery.QueryId);
    var result = await RunEndpointAsync<ScoreQueryResponse>(
        "evaluation/score-query",
        new ScoreQueryRequest { QueryId = qid, Alpha = req.Alpha });

    scored.Add(result);
    await RelayStatusAsync($"Scored {qid} ({i + 1}/{queries.Count})");
}

double Mean(Func<ScoreQueryResponse, double> f) => scored.Count == 0 ? 0 : scored.Average(f);

return new RunSuiteResponse
{
    PrecisionAt1 = Math.Round(Mean(r => r.PrecisionAt1), 3),
    PrecisionAt3 = Math.Round(Mean(r => r.PrecisionAt3), 3),
    NdcgAt10     = Math.Round(Mean(r => r.NdcgAt10),     3),
    RecallAt50   = Math.Round(Mean(r => r.RecallAt50),   3),
    ZeroResult   = Math.Round(scored.Count(r => r.Zero) / (double)Math.Max(scored.Count, 1), 3),
    PerQuery     = scored,
};

Wire evaluation/run-suite into CI on a nightly schedule. Diff against the previous run.

Regression testing

Aggregate metrics smooth over individual failures. Track per-query regression alongside the aggregate.

Endpoint: `evaluation/per-query-diff`

Takes two prior RunSuiteResponse payloads and returns the per-query regressions worth alerting on.

// Endpoint: evaluation/per-query-diff
public class DiffRequest
{
    public List<ScoreQueryResponse> Baseline;
    public List<ScoreQueryResponse> Candidate;
}

var req     = Body.FromJson<DiffRequest>();
var byId    = req.Candidate.ToDictionary(c => c.QueryId);
var alerts  = new List<object>();

foreach (var b in req.Baseline)
{
    if (!byId.TryGetValue(b.QueryId, out var c)) continue;

    var droppedToZero = b.PrecisionAt3 == 1.0 && c.PrecisionAt3 == 0.0;
    var bigNdcgDrop   = (b.NdcgAt10 - c.NdcgAt10) > 0.5;

    if (droppedToZero || bigNdcgDrop)
    {
        alerts.Add(new
        {
            b.QueryId,
            DeltaPrecisionAt3 = c.PrecisionAt3 - b.PrecisionAt3,
            DeltaNdcgAt10     = c.NdcgAt10     - b.NdcgAt10,
            DroppedToZero     = droppedToZero,
        });
    }
}

return alerts;

Alert on:

Any query that drops Precision@3 from 1.0 to 0.0.
Any query that drops more than 0.5 in NDCG@10.

These are the regressions users will report. Ship without inspecting them and you'll be cleaning up afterwards.

Hybrid retrieval evaluation

For hybrid search, evaluate at multiple α values and pick the peak.

`α`	precision@3	NDCG@10	recall@50
0.0	0.62	0.69	0.84
0.25	0.71	0.75	0.88
0.5	0.78	0.81	0.92
0.75	0.79	0.82	0.91
1.0	0.74	0.78	0.86

The example above suggests α = 0.75 for this corpus. Sweep, plot, lock the winner.

Endpoint: `evaluation/sweep-alpha`

Drives evaluation/run-suite at each α and emits the comparison row by row. The hybrid weight is just a parameter — evaluation/run-search is the one place that has to know what to do with it.

// Endpoint: evaluation/sweep-alpha (Pooling, Read Only)
var alphas = new[] { 0.0, 0.25, 0.5, 0.75, 1.0 };
var rows   = new List<object>();

foreach (var alpha in alphas)
{
    await RelayStatusAsync($"Running suite at α = {alpha}");
    var suite = await RunEndpointAsync<RunSuiteResponse>(
        "evaluation/run-suite",
        new { Alpha = alpha });

    rows.Add(new
    {
        Alpha        = alpha,
        PrecisionAt3 = suite.PrecisionAt3,
        NdcgAt10     = suite.NdcgAt10,
        RecallAt50   = suite.RecallAt50,
    });
}

return rows;

When metrics disagree

It happens. NDCG up, zero-result rate up means the engine got better at the queries it can answer but lost some it could before. Inspect manually before declaring victory.

The hierarchy of trust (from most to least):

Per-query manual inspection — what did a user actually see?
Per-query metric — did this specific query improve?
Aggregate metric — across all queries.
Click-through rate from real users — slowest signal, but the truth.

Golden-set hygiene

Version-control the CSV alongside the code; re-run evaluation/import-golden-set whenever it changes so the graph stays in sync.
One labeler per query for the first pass; second labeler for review of disagreements.
Refresh quarterly. Drop queries whose intended answer no longer exists.
Tag queries by Surface (global-search, product-page-search, support-search) so you can slice the suite output.
Don't sample from logs in a way that under-represents zero-result queries. They matter more than the volume suggests.

Where to go next

Ranking tuning — the changes you'll measure.
Hybrid search — algorithm and tuning levers.
Search optimization — schema and facets.
Creating Endpoints — where to paste the code above.
Auto-generated Helpers — the N., E., and Endpoints. constants used throughout.
Grounded answer evaluation — evaluating LLM answers.

Relevance Evaluation

Why evaluation matters

Building a golden set

Picking queries

Labels

Capturing the golden set in the graph

Node schema — GoldenQuery

Edge schema — Expects

Endpoint: evaluation/import-golden-set

Metrics

Endpoint: evaluation/metrics