Curiosity

Relevance Evaluation

How to measure whether search and retrieval are actually getting better — instead of just feeling like they are. This page is the methodology; the levers it tells you to pull are in Ranking tuning and Hybrid search.

Why evaluation matters

Without a metric, every ranking change feels like an improvement. With a metric, you know which changes actually help, which regress something else, and when to stop tuning.

Three things make evaluation tractable in practice:

  1. A golden set of queries with labeled correct results.
  2. A measurement script that runs the golden set against any configuration.
  3. A deployment gate that ships only configurations that beat the baseline.

That's the whole loop.

Building a golden set

The golden set is the contract every ranking change measures against.

Picking queries

Source Target count Why
Top queries by volume 50 Catch most user traffic.
Zero-result queries 20 Reveal corpus / alias gaps.
Domain-jargon queries 10 Acronyms, product names, internal terms.
Long-tail / underserved queries 10 The 80/20 long tail still matters at scale.
Known failures from user feedback 10 Real escalations from real users.

Aim for 80–120 queries. Too few and noise dominates; too many and labeling is unmaintainable.

Labels

For each query, label the expected top-3 UIDs, ideally ranked:

query_id,query,expected_uids,priority,notes,added_at
q-0001,"macbook screen flicker","uid-1; uid-2; uid-3",p1,"flagship product",2026-01-15
q-0002,"battery drain overnight","uid-9; uid-12",p2,"known acronym BD",2026-01-15
q-0003,"error 0x80004005","uid-44",p1,"exact identifier",2026-01-15

Conventions that work:

  • Label by UID, not by text — text drifts, UIDs don't.
  • Three labelers in parallel, then resolve disagreements. Single-labeler golden sets reflect one person's biases.
  • Tag queries (priority, category) so you can slice the report.
  • Date-stamp the labels. Quarterly refreshes are mandatory.

Capturing the golden set in the graph

The CSV is the human-friendly input format. The graph is what the evaluation endpoints actually read from — that way the same labels are versioned alongside the rest of your workspace, queryable, and reachable from any other endpoint.

Node schema — GoldenQuery

A GoldenQuery node represents one labeled query in the suite. The QueryId is the natural key so re-imports update the same node.

Field Type Notes
QueryId string (key) Stable id from the CSV, e.g. q-0001.
QueryText string The user-facing query string passed to search.
Priority string p1 / p2 / p3 — used to weight aggregates and gate regressions.
Category string Free-form tag (product, error-code, jargon, …).
Surface string Which UI surface the query belongs to (global-search, support).
ExpectedUids list<string> Labeled correct UIDs in expected rank order.
Notes string Free-form labeler note.
AddedAt Time Date-stamp for hygiene refreshes.

Edge schema — Expects

Each GoldenQuery also gets one outgoing Expects edge per labeled target. The edge makes the dataset traversable from either side ("which queries label this case?") and survives even when a labeled node is later renamed or moved.

Edge From To Notes
Expects GoldenQuery labeled target node One per (query, target); order is kept in ExpectedUids.

Endpoint: evaluation/import-golden-set

Bulk-import the labeling CSV into the graph. Re-runnable — existing GoldenQuery nodes are updated in place.

// Endpoint: evaluation/import-golden-set
// Body: the raw CSV (same shape as the table above).
var csv   = Body;
var lines = csv.Split('\n').Skip(1).Where(l => !string.IsNullOrWhiteSpace(l));
int count = 0;

foreach (var line in lines)
{
    var cols          = line.Split(',');
    var queryId       = cols[0].Trim().Trim('"');
    var queryText     = cols[1].Trim().Trim('"');
    var expectedUids  = cols[2].Trim().Trim('"')
                            .Split(';', StringSplitOptions.RemoveEmptyEntries)
                            .Select(u => u.Trim())
                            .ToList();
    var priority      = cols[3].Trim().Trim('"');
    var notes         = cols.Length > 4 ? cols[4].Trim().Trim('"') : "";

    var query = await Graph.GetOrAddLockedAsync(N.GoldenQuery.Type, queryId);
    try
    {
        query.SetString(N.GoldenQuery.QueryText,    queryText);
        query.SetString(N.GoldenQuery.Priority,     priority);
        query.SetString(N.GoldenQuery.Notes,        notes);
        query.SetStringList(N.GoldenQuery.ExpectedUids, expectedUids);
        query.SetTime  (N.GoldenQuery.AddedAt,      Time.Now);

        foreach (var uid in expectedUids)
        {
            var target     = UID128.Parse(uid);
            var targetType = Graph.GetNodeTypeUIDForUID(target);
            query.AddUniqueEdge(E.Expects, target, targetType);
        }

        await Graph.CommitAsync(query);
        count++;
    }
    catch
    {
        Graph.AbandonChanges(query);
        throw;
    }
}

return Ok($"Imported {count} golden queries");

Metrics

The four numbers worth tracking:

Metric What it measures When it's the right metric
Precision@1 Is the first result correct? Surfaces where users click only the top hit ("I'm feeling lucky").
Precision@3 Are any of the top 3 correct? Most "user types and scrolls a bit" surfaces.
NDCG@10 Position-weighted quality of top-10. Rewards correct results higher up. When ordering of the top 10 matters.
Recall@50 Did the correct UID appear anywhere in the top 50? A floor check: "the engine isn't blind to this query."
Zero-result rate Fraction of golden queries returning nothing. A failure mode that ranking can't fix.

NDCG is the most useful single number for "is ranking improving?" — but track all four; they reveal different failure modes.

Endpoint: evaluation/metrics

Pure helpers. Doesn't read or write the graph — it exists only to be imported by the scoring endpoints via //ImportEndpoint("evaluation/metrics").

// Endpoint: evaluation/metrics
// Shared metric helpers. Imported by evaluation/score-query and any other
// endpoint that needs to compute Precision@k / NDCG@k against a label set.
public static class Metrics
{
    public static double PrecisionAtK(IReadOnlyList<UID128> retrieved, ISet<UID128> expected, int k)
    {
        if (k <= 0 || retrieved.Count == 0) return 0;
        var hits = retrieved.Take(k).Count(expected.Contains);
        return hits / (double)k;
    }

    public static double ReciprocalRank(IReadOnlyList<UID128> retrieved, ISet<UID128> expected)
    {
        for (int i = 0; i < retrieved.Count; i++)
            if (expected.Contains(retrieved[i])) return 1.0 / (i + 1);
        return 0.0;
    }

    public static double DcgAtK(IReadOnlyList<UID128> retrieved, ISet<UID128> expected, int k)
    {
        double sum = 0;
        var cap = Math.Min(k, retrieved.Count);
        for (int i = 0; i < cap; i++)
            if (expected.Contains(retrieved[i]))
                sum += 1.0 / Math.Log2(i + 2);
        return sum;
    }

    public static double NdcgAtK(IReadOnlyList<UID128> retrieved, IReadOnlyList<UID128> expectedRanked, int k)
    {
        var expectedSet = new HashSet<UID128>(expectedRanked);
        var ideal       = DcgAtK(expectedRanked, expectedSet, k);
        return ideal == 0 ? 0 : DcgAtK(retrieved, expectedSet, k) / ideal;
    }

    public static double RecallAtK(IReadOnlyList<UID128> retrieved, ISet<UID128> expected, int k)
        => retrieved.Take(k).Any(expected.Contains) ? 1.0 : 0.0;
}

Running the suite

Three endpoints, each with a single responsibility:

Endpoint Reads Writes Calls
evaluation/run-search search index
evaluation/score-query golden set + search index evaluation/run-search
evaluation/run-suite golden set evaluation/score-query (per query)

The one place that knows how search is invoked. Swap this body when you change the retrieval surface (BM25 vs hybrid vs semantic) without touching any of the scoring code.

// Endpoint: evaluation/run-search
// Returns the top-50 UIDs for a single query. Read-only.
public class RunSearchRequest  { public string Query;   public double? Alpha; }
public class RunSearchResponse { public List<UID128> RetrievedUids; }

var req = Body.FromJson<RunSearchRequest>();

var uids = Q()
    .StartSearch(N.SupportCase.Type, N.SupportCase.Body, SearchExpression.Parse(req.Query))
    .Take(50)
    .ToUIDList();

return new RunSearchResponse { RetrievedUids = uids };

Endpoint: evaluation/score-query

Pulls the labels for one GoldenQuery out of the graph, calls evaluation/run-search for the same text, and uses the imported Metrics helpers. The endpoint name in the RunEndpointAsync call matches the path declared in the endpoint above.

// Endpoint: evaluation/score-query
//ImportEndpoint("evaluation/metrics")

public class ScoreQueryRequest  { public string QueryId; public double? Alpha; }
public class ScoreQueryResponse
{
    public string  QueryId;
    public double  PrecisionAt1, PrecisionAt3, NdcgAt10, RecallAt50;
    public bool    Zero;
}

var req = Body.FromJson<ScoreQueryRequest>();

var goldenNode = Q().StartAt(N.GoldenQuery.Type, req.QueryId).AsEnumerable().FirstOrDefault();
if (goldenNode is null) return NotFound($"GoldenQuery '{req.QueryId}' not in graph");

var queryText    = goldenNode.GetString(N.GoldenQuery.QueryText);
var expectedList = goldenNode.GetStringList(N.GoldenQuery.ExpectedUids)
                            .Select(UID128.Parse)
                            .ToList();
var expectedSet  = new HashSet<UID128>(expectedList);

var search = await RunEndpointAsync<RunSearchResponse>(
    "evaluation/run-search",
    new RunSearchRequest { Query = queryText, Alpha = req.Alpha });

var retrieved = search.RetrievedUids;

return new ScoreQueryResponse
{
    QueryId      = req.QueryId,
    PrecisionAt1 = Metrics.PrecisionAtK(retrieved, expectedSet, 1),
    PrecisionAt3 = Metrics.PrecisionAtK(retrieved, expectedSet, 3),
    NdcgAt10     = Metrics.NdcgAtK   (retrieved, expectedList, 10),
    RecallAt50   = Metrics.RecallAtK (retrieved, expectedSet, 50),
    Zero         = retrieved.Count == 0,
};

Endpoint: evaluation/run-suite

Iterates every GoldenQuery in the graph, defers per-query work to evaluation/score-query, and aggregates. Relay status messages keep long runs observable.

// Endpoint: evaluation/run-suite (Pooling, Read Only)
public class RunSuiteRequest  { public double? Alpha; }
public class RunSuiteResponse
{
    public double PrecisionAt1, PrecisionAt3, NdcgAt10, RecallAt50, ZeroResult;
    public List<ScoreQueryResponse> PerQuery;
}

var req     = Body.FromJson<RunSuiteRequest>() ?? new RunSuiteRequest();
var queries = Q().StartAt(N.GoldenQuery.Type).ToList();
var scored  = new List<ScoreQueryResponse>(queries.Count);

for (int i = 0; i < queries.Count; i++)
{
    var qid    = queries[i].GetString(N.GoldenQuery.QueryId);
    var result = await RunEndpointAsync<ScoreQueryResponse>(
        "evaluation/score-query",
        new ScoreQueryRequest { QueryId = qid, Alpha = req.Alpha });

    scored.Add(result);
    await RelayStatusAsync($"Scored {qid} ({i + 1}/{queries.Count})");
}

double Mean(Func<ScoreQueryResponse, double> f) => scored.Count == 0 ? 0 : scored.Average(f);

return new RunSuiteResponse
{
    PrecisionAt1 = Math.Round(Mean(r => r.PrecisionAt1), 3),
    PrecisionAt3 = Math.Round(Mean(r => r.PrecisionAt3), 3),
    NdcgAt10     = Math.Round(Mean(r => r.NdcgAt10),     3),
    RecallAt50   = Math.Round(Mean(r => r.RecallAt50),   3),
    ZeroResult   = Math.Round(scored.Count(r => r.Zero) / (double)Math.Max(scored.Count, 1), 3),
    PerQuery     = scored,
};

Wire evaluation/run-suite into CI on a nightly schedule. Diff against the previous run.

Regression testing

Aggregate metrics smooth over individual failures. Track per-query regression alongside the aggregate.

Endpoint: evaluation/per-query-diff

Takes two prior RunSuiteResponse payloads and returns the per-query regressions worth alerting on.

// Endpoint: evaluation/per-query-diff
public class DiffRequest
{
    public List<ScoreQueryResponse> Baseline;
    public List<ScoreQueryResponse> Candidate;
}

var req     = Body.FromJson<DiffRequest>();
var byId    = req.Candidate.ToDictionary(c => c.QueryId);
var alerts  = new List<object>();

foreach (var b in req.Baseline)
{
    if (!byId.TryGetValue(b.QueryId, out var c)) continue;

    var droppedToZero = b.PrecisionAt3 == 1.0 && c.PrecisionAt3 == 0.0;
    var bigNdcgDrop   = (b.NdcgAt10 - c.NdcgAt10) > 0.5;

    if (droppedToZero || bigNdcgDrop)
    {
        alerts.Add(new
        {
            b.QueryId,
            DeltaPrecisionAt3 = c.PrecisionAt3 - b.PrecisionAt3,
            DeltaNdcgAt10     = c.NdcgAt10     - b.NdcgAt10,
            DroppedToZero     = droppedToZero,
        });
    }
}

return alerts;

Alert on:

  • Any query that drops Precision@3 from 1.0 to 0.0.
  • Any query that drops more than 0.5 in NDCG@10.

These are the regressions users will report. Ship without inspecting them and you'll be cleaning up afterwards.

Hybrid retrieval evaluation

For hybrid search, evaluate at multiple α values and pick the peak.

α precision@3 NDCG@10 recall@50
0.0 0.62 0.69 0.84
0.25 0.71 0.75 0.88
0.5 0.78 0.81 0.92
0.75 0.79 0.82 0.91
1.0 0.74 0.78 0.86

The example above suggests α = 0.75 for this corpus. Sweep, plot, lock the winner.

Endpoint: evaluation/sweep-alpha

Drives evaluation/run-suite at each α and emits the comparison row by row. The hybrid weight is just a parameter — evaluation/run-search is the one place that has to know what to do with it.

// Endpoint: evaluation/sweep-alpha (Pooling, Read Only)
var alphas = new[] { 0.0, 0.25, 0.5, 0.75, 1.0 };
var rows   = new List<object>();

foreach (var alpha in alphas)
{
    await RelayStatusAsync($"Running suite at α = {alpha}");
    var suite = await RunEndpointAsync<RunSuiteResponse>(
        "evaluation/run-suite",
        new { Alpha = alpha });

    rows.Add(new
    {
        Alpha        = alpha,
        PrecisionAt3 = suite.PrecisionAt3,
        NdcgAt10     = suite.NdcgAt10,
        RecallAt50   = suite.RecallAt50,
    });
}

return rows;

When metrics disagree

It happens. NDCG up, zero-result rate up means the engine got better at the queries it can answer but lost some it could before. Inspect manually before declaring victory.

The hierarchy of trust (from most to least):

  1. Per-query manual inspection — what did a user actually see?
  2. Per-query metric — did this specific query improve?
  3. Aggregate metric — across all queries.
  4. Click-through rate from real users — slowest signal, but the truth.

Golden-set hygiene

  • Version-control the CSV alongside the code; re-run evaluation/import-golden-set whenever it changes so the graph stays in sync.
  • One labeler per query for the first pass; second labeler for review of disagreements.
  • Refresh quarterly. Drop queries whose intended answer no longer exists.
  • Tag queries by Surface (global-search, product-page-search, support-search) so you can slice the suite output.
  • Don't sample from logs in a way that under-represents zero-result queries. They matter more than the volume suggests.

Where to go next

© 2026 Curiosity. All rights reserved.
Powered by Neko