Curiosity

Ranking Tuning

A workflow for getting "the right results to the top" — without breaking yesterday's results in the process. This page is the step-by-step version of the high-level guidance on the other search pages.

The whole loop is: build a golden set → measure → change one thing → measure again → ship or revert. Tuning is a science only if you have a baseline.

Step 1 — Build a golden set

A golden set is a list of queries with their correct top-3 results, by UID. Without it you're guessing.

Pull the inputs from real usage:

Source	What to include
Query logs	Top 50 queries by volume.
Zero-result queries	Top 20 queries that returned nothing.
Synonym / acronym cases	10 queries that depend on product nomenclature.
Failure stories from users	10 queries that real users escalated as "couldn't find X".

Format as a flat CSV — easy to diff, easy to extend:

query_id,query,expected_uids,notes
q-0001,"macbook screen flicker","uid-1; uid-2; uid-3","priority high"
q-0002,"battery drain overnight","uid-9; uid-12","known acronym BD"
q-0003,"error 0x80004005","uid-44","exact identifier"
…

Aim for at least 80 queries before you trust any score. Refresh the labels every quarter — products move, "correct" results drift.

Step 2 — Measure the baseline

Run every query in the golden set against the current production configuration and compute:

Metric	What it measures
Precision@1	Was the top result correct?
Precision@3	Were any of the top 3 correct?
NDCG@10	Quality of the top-10 ranking, weighted by position.
Recall@50	Did the correct UID appear in the first 50 hits at all?
Zero-result rate	Fraction of golden queries that returned nothing.

A 30-line script with the search API plus pandas is enough. Save the result as baseline.csv and commit it; it's the contract every change measures against.

Step 3 — Change one thing

One. Thing. Not two. Otherwise you can't attribute the change. Make the change in a staging workspace, never production.

The high-leverage levers, ordered by typical impact:

Field selection

Index fewer fields, not more. Boilerplate footers, signature blocks, status codes — none of them help ranking, all of them dilute scores. List every indexed field, justify each.

Field boosts

Boosts compose multiplicatively. A title boost of 5.0 against a body boost of 1.0 is a 5× ratio — usually too aggressive. Start at 2.0 for high-signal fields (Title, Summary), keep body at 1.0, and only adjust if golden-set numbers say so.

The cheapest precision win: a graph-driven TargetUIDs that narrows the candidate set. If the user is on the customer page, every search from that page should be scoped to that customer's nodes. No ranking trick beats not retrieving the wrong tenant in the first place.

Hybrid blend (`α`)

See Hybrid search for the algorithm. Sweep α ∈ {0.0, 0.25, 0.5, 0.75, 1.0} against the golden set and pick the peak NDCG.

Vector cutoff

Without a similarity threshold, every query returns count hits regardless of actual relevance. Set the cutoff to the lowest similarity score that still gives correct top-3 on golden queries — typically 0.3–0.5 for cosine.

Chunking

For long body fields, vector chunking determines whether the right sentence gets embedded. Smaller chunks lift recall for narrow phrases; larger chunks preserve context. Re-embedding is expensive — change this last, not first.

Recency / `TimeDecay`

For time-sensitive corpora (news, support cases), bias toward fresh results. Set TimeDecay.Offset to how many days stay full-score and TimeDecay.Scale to how many days the score decays over. Too aggressive and last week's relevant document disappears; too gentle and stale results dominate.

Step 4 — Re-measure

Run the same script you ran in Step 2. Compute the delta on each metric. Track three signals:

Direction. Is precision@3 up? NDCG@10 up?
Magnitude. Is the change meaningful (≥ 2 percentage points) or noise?
Regression. Did any single query drop from precision@3 = 1 to precision@3 = 0? Inspect those manually — they're the easiest user-visible regressions to ship by accident.

Stop iterating when:

the metric you care about is up,
no query regressed by more than one rank position on average, and
the change is small enough to roll forward and back cleanly.

Step 5 — Roll out safely

Treat ranking changes like code changes. The deployment shape that works:

Staging. Apply the change. Run the golden set. Eyeball the top 10 for the 10 worst-regressed queries.
Shadow mode. Run the new config alongside production for a day; log results from both, do not show new results to users. Diff the impressions.
Canary. Route 10% of traffic to the new config. Watch CTR / refinement / time-to-result for one business day.
Promote. Move to 100% only if canary metrics match or beat the baseline.
Keep the revert script ready. Ranking changes that look good for an hour can degrade over a week as queries shift.

Vector-specific tuning

Treat vector tuning as a slow loop — every change forces a re-embed.

Pick fields with intent. Long, descriptive text. Skip status fields.
Set a real cutoff. No production system should run with applyCutoff: false.
Match the model to the domain. General-purpose models (MiniLM) under-index on jargon-heavy corpora; consider domain-tuned or larger external models if the golden set shows it.
Re-embed in off-hours. The index is unavailable until the rebuild completes.

Common pitfalls

Tuning without evaluation. You will mistake noise for signal.
Boosting everything. Boost is relative — if every field is boosted, nothing is.
Indexing too much. Less is more. Cut noise before scoring it.
Changing two things at once. You won't know what helped or what hurt.
No revert path. Some changes (re-embed, chunk size) take hours to undo. Stage everything first.

Templates

Golden set CSV

query_id,query,expected_uids,notes,added_at
q-0001,"macbook screen flicker","uid-1; uid-2; uid-3","p1",2025-12-01
q-0002,"battery drain overnight","uid-9; uid-12","",2025-12-01

Run report

Configuration: prod-vs-α0.6, 2026-01-12
Golden set: 84 queries
                   baseline    candidate    delta
precision@1        0.62        0.67         +0.05
precision@3       0.78        0.83         +0.05
NDCG@10           0.71        0.75         +0.04
recall@50         0.91        0.92         +0.01
zero-result        2           1            −1
worst regression: q-0073 (precision@3 1 → 0)

Ranking Tuning

Step 1 — Build a golden set

Step 2 — Measure the baseline

Step 3 — Change one thing

Field selection

Field boosts

Facets and scoping

Hybrid blend (`α`)

Vector cutoff

Chunking

Recency / `TimeDecay`

Step 4 — Re-measure

Step 5 — Roll out safely

Vector-specific tuning

Common pitfalls

Templates

Golden set CSV

Run report

Referenced by

Ranking Tuning

Step 1 — Build a golden set

Step 2 — Measure the baseline

Step 3 — Change one thing

Field selection

Field boosts

Facets and scoping

Hybrid blend (α)

Vector cutoff

Chunking

Recency / TimeDecay

Step 4 — Re-measure

Step 5 — Roll out safely

Vector-specific tuning

Common pitfalls

Templates

Golden set CSV

Run report

Related pages

Referenced by

Hybrid blend (`α`)

Recency / `TimeDecay`