Grounded Answer Evaluation
How to measure whether an LLM-driven answer surface — grounded Q&A, RAG, support-bot — is actually trustworthy. Retrieval evaluation (covered in Relevance evaluation) tells you whether the engine found the right sources. This page is about whether the model used them correctly.
There are four jobs to evaluate:
- Faithfulness. Is every claim in the answer actually supported by the cited sources?
- Coverage. Does the answer use the sources that retrieval surfaced?
- Citation correctness. Do the cited UIDs actually exist and contain the claim?
- Refusal calibration. Does the system refuse when it should — and answer when it should?
Building the eval set
The same shape as the relevance golden set, but the labels are richer.
question_id,question,expected_answer,expected_citations,must_refuse,notes
qa-0001,"Why does my MacBook Air battery drain overnight?","Firmware 3.2 fixes it","uid-case-42; uid-case-77",false,"flagship case"
qa-0002,"What's the fix for error 0x80004005?","Reinstall the driver","uid-kb-12",false,""
qa-0003,"What's the latest stock price of AAPL?","","",true,"off-domain, should refuse"
qa-0004,"How do I cancel my subscription?","See the billing section","uid-kb-44",false,""
| Column | Meaning |
|---|---|
question |
The user's question. |
expected_answer |
The factual core. Use short paraphrases — not exact strings. |
expected_citations |
UIDs that the answer should cite. |
must_refuse |
true if the system should decline (off-domain, not in corpus). |
notes |
Why this question is in the set. |
Aim for 60–100 questions, weighted toward:
- common user questions (volume in support logs);
- known hallucination triggers (questions that drift outside the corpus);
- adjacent questions whose answers exist in nearby but-not-exact sources;
- refusal cases (off-domain, missing data, sensitive).
Metrics
Faithfulness
The most important metric. For each answer:
- Decompose the answer into atomic claims.
- For each claim, check whether any cited source contains it (verbatim or paraphrased).
- Faithfulness =
supported_claims / total_claims.
Two ways to compute:
| Approach | Pros | Cons |
|---|---|---|
| Human label | Most accurate. | Slow. ~3 minutes per answer. |
| LLM judge | Fast. Cheap. | Biased; same-model judges over-credit. |
LLM-as-judge is good enough for trend tracking; human labels are good enough for go/no-go.
Coverage
coverage = |cited_in_answer ∩ retrieved| / |retrieved_used|
Did the answer use the right sources, given what retrieval returned? Low coverage with high faithfulness means the model is ignoring relevant sources.
Citation correctness
Three checks, all binary:
- Exists. Each cited UID resolves to a real node.
- Reachable. The user who asked the question has permission to see each citation.
- Supports. The cited node actually contains the claim it's anchored to.
Score: cit_correct = AND of the three, averaged over citations.
Refusal calibration
For questions labeled must_refuse:
- Did the system refuse? (
should_refuse_did = 1if yes else 0)
For questions labeled must_refuse = false:
- Did the system answer? (
should_answer_did = 1if yes else 0)
Calibration = mean of both. A perfectly calibrated system is 1.0; pathological hallucination is 0.5 (refuses sometimes, answers wrongly other times); pathological over-refusal is 0.0 if no must_refuse=true exists.
Hallucination risk
A composite signal worth tracking on its own.
hallucination_risk = 1 - (faithfulness × cit_correct)
A 5% hallucination risk on user-facing surfaces is roughly the threshold where users start to lose trust. Track it weekly; alert if it climbs.
A scoring rubric
For each Q&A pair, compute and store:
| Field | Type | Notes |
|---|---|---|
faithfulness |
float [0, 1] |
Supported claims / total claims. |
coverage |
float [0, 1] |
Cited / retrieved-relevant. |
citation_correct |
float [0, 1] |
Exists × reachable × supports. |
refused |
bool | Did the system decline to answer? |
should_refuse |
bool | Ground truth from the golden set. |
latency_ms |
int | End-to-end answer time. |
tool_calls |
int | How many endpoint calls during the loop. |
hallucination_risk |
float [0, 1] |
1 − faithfulness × citation_correct. |
Aggregate weekly. Slice by question category. Plot the trend.
Common failure modes
| Symptom | Likely cause |
|---|---|
| High coverage, low faithfulness | Model paraphrases sources inaccurately. Tighten the prompt (see Prompting patterns). |
| Low coverage, high faithfulness | Retrieval surfaced relevant docs but model ignored them. Inspect prompt — context too long? sources truncated? |
| High citation incorrectness | Model invents UIDs. Reject answers whose citations don't exist; force retry. |
| Over-refusal | Refusal rules too aggressive, or retrieval threshold too high. |
| Under-refusal (lots of hallucination) | No refusal pathway, or model has no way to express uncertainty. |
| Long latency without quality lift | Too many tool calls. Cap the agent loop; cache popular reads. |
CI integration
The eval set should run on every change to:
- the retrieval pipeline (search config, embedding model, chunk size),
- the LLM provider or model version,
- the prompt template,
- any tool the agent can call.
Treat the suite like a unit-test gate: don't merge if hallucination_risk regresses by more than the noise threshold (typically 0.02).
# CI gate
python eval.py --suite golden.csv --config staging
if [ "$(jq -r '.hallucination_risk' eval-report.json)" > "0.05" ]; then
echo "Hallucination risk exceeded threshold"
exit 1
fi
Auditability
For every user-facing answer in production, persist:
{
"question": "...",
"user_uid": "...",
"retrieved_uids": ["...", "..."],
"context_tokens": 3214,
"model": "claude-sonnet-4-6",
"prompt_hash": "sha256:...",
"answer": "...",
"citations": ["...", "..."],
"tool_calls": 4,
"latency_ms": 2104,
"feedback": null
}
When a user reports "the bot lied to me," this is the only thing that lets you diagnose what happened. Treat it as required.
Refusal isn't failure
Refusal is a feature. Production grounded Q&A surfaces should refuse 10–30% of long-tail queries. If your refusal rate is 0%, you're hallucinating somewhere; track down which questions slipped past the guardrails.
Where to go next
- Prompting patterns — templates, refusal patterns.
- LLM agents — the multi-step loop you're evaluating.
- Relevance evaluation — the retrieval side of the same problem.
- Metrics reference — wiring per-tool error rates.