Curiosity

Semantic Similarity

Semantic similarity answers "how related are these two items by meaning?" without depending on shared keywords. It's the foundation for "similar docs," "duplicate detection," "related cases," and "you might also like" surfaces.

This page is the practitioner's guide: when to use it, how to call it, how to set thresholds, and what to put on the UI. For provider setup, see LLM configuration. For the API surface, see Embeddings API.

Use cases

Surface Pattern
"Similar documents" panel Top-N similar by embedding, filtered by type.
"Similar tickets" auto-suggest Top-N similar tickets to the open one, scoped to the same product line.
Product recommendations Similar products by description, scoped to the user's region.
Duplicate detection Pairs above a high similarity threshold (e.g. > 0.95) flagged for review or merge.
Theme clustering All items embedded → group by similarity → label clusters.
Lookalike audiences Find customers semantically similar to a seed set.

"Similar docs" — the canonical pattern

// 1. Find candidates by embedding similarity.
var seedUid = Node.GetUID("Article", articleId);

var candidates = Q()
    .StartAt("Article", articleId)
    .EmitWithScores();   // returns the seed; not what we want — see below

The right shape: seed into the similarity index with text rather than the node itself.

return Q().StartAtSimilarText(
              text:        article.Body.FirstParagraph(),
              count:       50,
              nodeTypes:   new[] { "Article" },
              applyCutoff: true)
          .Where(n => n.UID != Node.GetUID("Article", articleId))   // exclude the seed
          .Take(10)
          .EmitWithScores();

For "more like this UID" without recomputing the embedding, call the REST endpoint:

POST /api/embeddings/similar/{seedUid}?count=10

It returns the top-N UIDs directly. See Embeddings API → /similar/{uid}.

Similar tickets / cases

Constrain by the dimensions that matter — product, customer, time window — before scoring.

return Q().StartAtSimilarText(
              text:        openTicket.Summary,
              count:       500,
              nodeTypes:   new[] { "SupportCase" })
          .IsRelatedTo(Node.GetUID("Product", openTicket.ProductId))
          .Where(n => n.GetTime(N.SupportCase.OpenedAt) > DateTimeOffset.UtcNow.AddYears(-1))
          .Take(5)
          .EmitWithScores();

The graph filter (IsRelatedTo) is much cheaper than the vector branch, so push as much constraint as possible into the graph.

Similar products

Embed the product description (and optionally category + features) once during ingestion. At query time:

return Q().StartAtSimilarText(text, count: 100, nodeTypes: new[] { "Product" })
          .Where(n => n.GetString(N.Product.Region) == userRegion)
          .Where(n => n.GetFloat(N.Product.Price) <= maxPrice)
          .Take(20)
          .EmitWithScores();

UX nudge: surface 3–8 results, not 20. Past 10 the model's confidence drops and users start ignoring the list.

Threshold guidance

A similarity score is meaningful only relative to your data. The thresholds below are starting points — calibrate against a labeled set (see Relevance evaluation).

Score (cosine, normalized to [0, 1]) Typical interpretation
≥ 0.95 Probable duplicate. Flag for review.
0.85 – 0.95 Very similar. Safe to suggest as related.
0.70 – 0.85 Related. Show in a "you might also like" rail.
0.50 – 0.70 Loosely related. Useful for browsing only.
< 0.50 Probably noise. Don't surface.

Set applyCutoff = true and configure the per-index threshold to filter the long tail server-side. Without a cutoff, even a query about "marine biology" returns top-N "marketing" docs — just lower-scored.

UX recommendations

  • Show 3–8 items. Not 50. Confidence drops past 10.
  • Label the score, don't show the number. "Very similar," "Related," "Loosely related" — bucketed labels beat decimals for non-technical users.
  • Anchor to the seed. Always show what the recommendations are similar to so the user can recalibrate.
  • Diversity matters. If the top 5 are near-duplicates, group them. Variety beats redundancy.
  • Refresh boundaries. "Similar items, last 12 months" beats "similar items, all time" for time-sensitive corpora.
  • Cite the dimension. "Similar by description" vs "Similar by who bought it" — the user should know which one they're seeing.

Duplicate detection workflow

  1. Generate pairs. For each new node, query top-K similar with a high threshold (≥ 0.95).
  2. Score the pair. Combine embedding similarity with cheap signal: shared keys, identical hash on a normalized field, identical author + day.
  3. Decide.
    • All signals agree → auto-merge.
    • Embedding-high, signals-weak → queue for human review.
    • Embedding-weak → skip.
  4. Audit. Keep the merge log; embeddings drift after re-embeds, so review old merges quarterly.

Theme clustering

For small corpora (≤ 100 k items), pull all embeddings, run HDBSCAN or k-means in Python, label clusters by sampling top items. For larger corpora, use the workspace's UMAP projection (/api/embeddings/projected) as the input and cluster the projected points.

Common pitfalls

  • One embedding for everything. Different surfaces need different fields embedded. A title-only embedding is great for "similar headlines" and bad for "similar narratives."
  • Forgetting the seed. "Similar to article 42" returning article 42 looks broken. Always exclude.
  • Stale embeddings. Switching models without re-embedding silently degrades every similarity surface. See Reindexing and re-embedding.
  • No cutoff in production. Every query returns count results regardless. Set the threshold.
© 2026 Curiosity. All rights reserved.
Powered by Neko