Ranking Tuning
A workflow for getting "the right results to the top" — without breaking yesterday's results in the process. This page is the step-by-step version of the high-level guidance on the other search pages.
The whole loop is: build a golden set → measure → change one thing → measure again → ship or revert. Tuning is a science only if you have a baseline.
Step 1 — Build a golden set
A golden set is a list of queries with their correct top-3 results, by UID. Without it you're guessing.
Pull the inputs from real usage:
| Source | What to include |
|---|---|
| Query logs | Top 50 queries by volume. |
| Zero-result queries | Top 20 queries that returned nothing. |
| Synonym / acronym cases | 10 queries that depend on product nomenclature. |
| Failure stories from users | 10 queries that real users escalated as "couldn't find X". |
Format as a flat CSV — easy to diff, easy to extend:
query_id,query,expected_uids,notes
q-0001,"macbook screen flicker","uid-1; uid-2; uid-3","priority high"
q-0002,"battery drain overnight","uid-9; uid-12","known acronym BD"
q-0003,"error 0x80004005","uid-44","exact identifier"
…
Aim for at least 80 queries before you trust any score. Refresh the labels every quarter — products move, "correct" results drift.
Step 2 — Measure the baseline
Run every query in the golden set against the current production configuration and compute:
| Metric | What it measures |
|---|---|
| Precision@1 | Was the top result correct? |
| Precision@3 | Were any of the top 3 correct? |
| NDCG@10 | Quality of the top-10 ranking, weighted by position. |
| Recall@50 | Did the correct UID appear in the first 50 hits at all? |
| Zero-result rate | Fraction of golden queries that returned nothing. |
A 30-line script with the search API plus pandas is enough. Save the result as baseline.csv and commit it; it's the contract every change measures against.
Step 3 — Change one thing
One. Thing. Not two. Otherwise you can't attribute the change. Make the change in a staging workspace, never production.
The high-leverage levers, ordered by typical impact:
Field selection
Index fewer fields, not more. Boilerplate footers, signature blocks, status codes — none of them help ranking, all of them dilute scores. List every indexed field, justify each.
Field boosts
Boosts compose multiplicatively. A title boost of 5.0 against a body boost of 1.0 is a 5× ratio — usually too aggressive. Start at 2.0 for high-signal fields (Title, Summary), keep body at 1.0, and only adjust if golden-set numbers say so.
Facets and scoping
The cheapest precision win: a graph-driven TargetUIDs that narrows the candidate set. If the user is on the customer page, every search from that page should be scoped to that customer's nodes. No ranking trick beats not retrieving the wrong tenant in the first place.
Hybrid blend (α)
See Hybrid search for the algorithm. Sweep α ∈ {0.0, 0.25, 0.5, 0.75, 1.0} against the golden set and pick the peak NDCG.
Vector cutoff
Without a similarity threshold, every query returns count hits regardless of actual relevance. Set the cutoff to the lowest similarity score that still gives correct top-3 on golden queries — typically 0.3–0.5 for cosine.
Chunking
For long body fields, vector chunking determines whether the right sentence gets embedded. Smaller chunks lift recall for narrow phrases; larger chunks preserve context. Re-embedding is expensive — change this last, not first.
Recency / TimeDecay
For time-sensitive corpora (news, support cases), bias toward fresh results. Set TimeDecay.Offset to how many days stay full-score and TimeDecay.Scale to how many days the score decays over. Too aggressive and last week's relevant document disappears; too gentle and stale results dominate.
Step 4 — Re-measure
Run the same script you ran in Step 2. Compute the delta on each metric. Track three signals:
- Direction. Is precision@3 up? NDCG@10 up?
- Magnitude. Is the change meaningful (≥ 2 percentage points) or noise?
- Regression. Did any single query drop from precision@3 = 1 to precision@3 = 0? Inspect those manually — they're the easiest user-visible regressions to ship by accident.
Stop iterating when:
- the metric you care about is up,
- no query regressed by more than one rank position on average, and
- the change is small enough to roll forward and back cleanly.
Step 5 — Roll out safely
Treat ranking changes like code changes. The deployment shape that works:
- Staging. Apply the change. Run the golden set. Eyeball the top 10 for the 10 worst-regressed queries.
- Shadow mode. Run the new config alongside production for a day; log results from both, do not show new results to users. Diff the impressions.
- Canary. Route 10% of traffic to the new config. Watch CTR / refinement / time-to-result for one business day.
- Promote. Move to 100% only if canary metrics match or beat the baseline.
- Keep the revert script ready. Ranking changes that look good for an hour can degrade over a week as queries shift.
Vector-specific tuning
Treat vector tuning as a slow loop — every change forces a re-embed.
- Pick fields with intent. Long, descriptive text. Skip status fields.
- Set a real cutoff. No production system should run with
applyCutoff: false. - Match the model to the domain. General-purpose models (MiniLM) under-index on jargon-heavy corpora; consider domain-tuned or larger external models if the golden set shows it.
- Re-embed in off-hours. The index is unavailable until the rebuild completes.
Common pitfalls
- Tuning without evaluation. You will mistake noise for signal.
- Boosting everything. Boost is relative — if every field is boosted, nothing is.
- Indexing too much. Less is more. Cut noise before scoring it.
- Changing two things at once. You won't know what helped or what hurt.
- No revert path. Some changes (re-embed, chunk size) take hours to undo. Stage everything first.
Templates
Golden set CSV
query_id,query,expected_uids,notes,added_at
q-0001,"macbook screen flicker","uid-1; uid-2; uid-3","p1",2025-12-01
q-0002,"battery drain overnight","uid-9; uid-12","",2025-12-01
Run report
Configuration: prod-vs-α0.6, 2026-01-12
Golden set: 84 queries
baseline candidate delta
precision@1 0.62 0.67 +0.05
precision@3 0.78 0.83 +0.05
NDCG@10 0.71 0.75 +0.04
recall@50 0.91 0.92 +0.01
zero-result 2 1 −1
worst regression: q-0073 (precision@3 1 → 0)