Curiosity

Grounded Answer Evaluation

How to measure whether an LLM-driven answer surface — grounded Q&A, RAG, support-bot — is actually trustworthy. Retrieval evaluation (covered in Relevance evaluation) tells you whether the engine found the right sources. This page is about whether the model used them correctly.

There are four jobs to evaluate:

Faithfulness. Is every claim in the answer actually supported by the cited sources?
Coverage. Does the answer use the sources that retrieval surfaced?
Citation correctness. Do the cited UIDs actually exist and contain the claim?
Refusal calibration. Does the system refuse when it should — and answer when it should?

Building the eval set

The same shape as the relevance golden set, but the labels are richer.

question_id,question,expected_answer,expected_citations,must_refuse,notes
qa-0001,"Why does my MacBook Air battery drain overnight?","Firmware 3.2 fixes it","uid-case-42; uid-case-77",false,"flagship case"
qa-0002,"What's the fix for error 0x80004005?","Reinstall the driver","uid-kb-12",false,""
qa-0003,"What's the latest stock price of AAPL?","","",true,"off-domain, should refuse"
qa-0004,"How do I cancel my subscription?","See the billing section","uid-kb-44",false,""

Column	Meaning
`question`	The user's question.
`expected_answer`	The factual core. Use short paraphrases — not exact strings.
`expected_citations`	UIDs that the answer should cite.
`must_refuse`	`true` if the system should decline (off-domain, not in corpus).
`notes`	Why this question is in the set.

Aim for 60–100 questions, weighted toward:

common user questions (volume in support logs);
known hallucination triggers (questions that drift outside the corpus);
adjacent questions whose answers exist in nearby but-not-exact sources;
refusal cases (off-domain, missing data, sensitive).

Metrics

Faithfulness

The most important metric. For each answer:

Decompose the answer into atomic claims.
For each claim, check whether any cited source contains it (verbatim or paraphrased).
Faithfulness = supported_claims / total_claims.

Two ways to compute:

Approach	Pros	Cons
Human label	Most accurate.	Slow. ~3 minutes per answer.
LLM judge	Fast. Cheap.	Biased; same-model judges over-credit.

LLM-as-judge is good enough for trend tracking; human labels are good enough for go/no-go.

Coverage

coverage = |cited_in_answer ∩ retrieved| / |retrieved_used|

Did the answer use the right sources, given what retrieval returned? Low coverage with high faithfulness means the model is ignoring relevant sources.

Citation correctness

Three checks, all binary:

Exists. Each cited UID resolves to a real node.
Reachable. The user who asked the question has permission to see each citation.
Supports. The cited node actually contains the claim it's anchored to.

Score: cit_correct = AND of the three, averaged over citations.

Refusal calibration

For questions labeled must_refuse:

Did the system refuse? (should_refuse_did = 1 if yes else 0)

For questions labeled must_refuse = false:

Did the system answer? (should_answer_did = 1 if yes else 0)

Calibration = mean of both. A perfectly calibrated system is 1.0; pathological hallucination is 0.5 (refuses sometimes, answers wrongly other times); pathological over-refusal is 0.0 if no must_refuse=true exists.

Hallucination risk

A composite signal worth tracking on its own.

hallucination_risk = 1 - (faithfulness × cit_correct)

A 5% hallucination risk on user-facing surfaces is roughly the threshold where users start to lose trust. Track it weekly; alert if it climbs.

A scoring rubric

For each Q&A pair, compute and store:

Field	Type	Notes
`faithfulness`	float `[0, 1]`	Supported claims / total claims.
`coverage`	float `[0, 1]`	Cited / retrieved-relevant.
`citation_correct`	float `[0, 1]`	Exists × reachable × supports.
`refused`	bool	Did the system decline to answer?
`should_refuse`	bool	Ground truth from the golden set.
`latency_ms`	int	End-to-end answer time.
`tool_calls`	int	How many endpoint calls during the loop.
`hallucination_risk`	float `[0, 1]`	`1 − faithfulness × citation_correct`.

Aggregate weekly. Slice by question category. Plot the trend.

Common failure modes

Symptom	Likely cause
High coverage, low faithfulness	Model paraphrases sources inaccurately. Tighten the prompt (see Prompting patterns).
Low coverage, high faithfulness	Retrieval surfaced relevant docs but model ignored them. Inspect prompt — context too long? sources truncated?
High citation incorrectness	Model invents UIDs. Reject answers whose citations don't exist; force retry.
Over-refusal	Refusal rules too aggressive, or retrieval threshold too high.
Under-refusal (lots of hallucination)	No refusal pathway, or model has no way to express uncertainty.
Long latency without quality lift	Too many tool calls. Cap the agent loop; cache popular reads.

CI integration

The eval set should run on every change to:

the retrieval pipeline (search config, embedding model, chunk size),
the LLM provider or model version,
the prompt template,
any tool the agent can call.

Treat the suite like a unit-test gate: don't merge if hallucination_risk regresses by more than the noise threshold (typically 0.02).

# CI gate
python eval.py --suite golden.csv --config staging
if [ "$(jq -r '.hallucination_risk' eval-report.json)" > "0.05" ]; then
    echo "Hallucination risk exceeded threshold"
    exit 1
fi

Auditability

For every user-facing answer in production, persist:

{
  "question":         "...",
  "user_uid":         "...",
  "retrieved_uids":   ["...", "..."],
  "context_tokens":   3214,
  "model":            "claude-sonnet-4-6",
  "prompt_hash":      "sha256:...",
  "answer":           "...",
  "citations":        ["...", "..."],
  "tool_calls":       4,
  "latency_ms":       2104,
  "feedback":         null
}

When a user reports "the bot lied to me," this is the only thing that lets you diagnose what happened. Treat it as required.

Refusal isn't failure

Refusal is a feature. Production grounded Q&A surfaces should refuse 10–30% of long-tail queries. If your refusal rate is 0%, you're hallucinating somewhere; track down which questions slipped past the guardrails.

Where to go next

Prompting patterns — templates, refusal patterns.
LLM agents — the multi-step loop you're evaluating.
Relevance evaluation — the retrieval side of the same problem.
Metrics reference — wiring per-tool error rates.