Curiosity

Embeddings

Embeddings are dense-vector representations of text. Curiosity Workspace produces them automatically for the fields you mark as vector-indexed, stores them in a vector index, and uses them for:

  • vector search — find documents by meaning, not by keyword match;
  • similar-items features — surface related cases, related documents, related products;
  • RAG grounding — fetch the right chunks before calling an LLM.

This page is the operational guide: which provider, which fields, which chunk size, how to evaluate, how to migrate when you change models. For the search-side concerns (top-k, threshold, hybrid), see Vector Search. For the chat-side concerns, see LLM Configuration.

Workflow

flowchart LR Field[Indexed long-text field] -->|on commit| Chunker[Chunker] Chunker --> Provider[Embedding provider<br/>(OpenAI / Azure / Anthropic-paired / local)] Provider --> Vector[Vector] Vector --> Index[(Vector index)] Query[User query] --> Provider2[Embedding provider] Provider2 --> QVec[Query vector] QVec --> Index Index --> Results[Nearest neighbors]

The same provider is used to embed corpus content (write side) and queries (read side). If they differ, vectors aren't comparable and retrieval breaks.

Choosing what to embed

Embed the fields a user would describe in plain English. Skip the rest.

Good candidates Why
Ticket / case bodies Long, descriptive, paraphrased by users
Knowledge-base article bodies Same
Meeting / call transcripts Long; queries are conversational
Long summaries that capture the gist The summary is doing the embedding work for you
Product / part descriptions Descriptive, often searched by characteristic
Poor candidates Why
IDs, SKUs, serial numbers Exact match wins; text search is faster
Short labels, statuses, enums Belong in facets
Boilerplate (footers, signatures) Wastes index space and drags retrieval quality
Raw HTML / JSON before parsing Parse first, embed the prose

A useful rule of thumb: if a field is fewer than 5 words, it's almost never worth embedding.

Chunking

The workspace produces one embedding per chunk. For fields longer than the model's effective context, enable chunking.

Decision Recommendation
Chunk size 200–800 tokens; start at 512
Overlap 10–20% (e.g., 64 tokens for 512-size chunks)
Boundary preference Paragraph → sentence → token window
Per-field tuning Short summaries: no chunks. Long bodies: chunked. Transcripts: chunk by speaker turn when possible

Failures from bad chunking are easy to recognize:

  • Chunks too small: many fragments per document, queries hit them out of context;
  • Chunks too large: chunks aren't semantically tight, recall drops;
  • No overlap: a query landing on a boundary misses the match.

Picking a provider

See the LLM Configuration provider matrix. Short version:

  • Hosted (OpenAI, Azure OpenAI) — easiest to set up, best quality on most corpora, sends text to the provider.
  • Local (built-in MiniLM, FastText, or a self-hosted server) — slightly lower quality, no data leaves your network.
  • Anthropic — pair with one of the above for embeddings; Claude doesn't offer an embedding service.

When in doubt:

  • Start with a small hosted model (text-embedding-3-small). Cheap, fast, good baseline.
  • If quality is insufficient, upgrade to text-embedding-3-large and re-embed.
  • If data residency requires it, switch to a local model and re-embed.

Configuring embeddings on a node type

Two ways to do it:

Via the UI
  1. Settings → Search → Indexes.
  2. Pick the node type.
  3. For the field you want embedded, toggle Vector index on.
  4. Set chunk size and overlap.
  5. Save. A background task backfills embeddings for existing nodes. ===
Via the connector schema (recommended for prod)
[Node]
public class Ticket
{
[Key] public string Id { get; set; }
[Property] public string Subject { get; set; }

[Property]
[VectorIndex(ChunkSize = 512, ChunkOverlap = 64)]
public string Body { get; set; }

[Timestamp] public DateTimeOffset CreatedAt { get; set; }
}

Promotes with the rest of your connector code; survives schema migrations because it lives in source.

Cost control

Embedding is dominated by either provider tokens (hosted) or hardware (local). To keep costs predictable:

  • Embed only what you need (the "what to embed" table above).
  • Skip unchanged content. The connector can hash the field and only re-upsert when the hash changes.
  • Set a daily token budget under Settings → AI Settings → Quotas.
  • Monitor vector index size and embed queue depth. A growing queue means the provider is throttling or down.
  • Defer full rebuilds to off-hours. Switching models forces a full re-embed of every embedded field.

Re-embedding

You must rebuild embeddings when you:

  • switch embedding providers or models;
  • change chunk size / overlap;
  • add a new vector-indexed field on existing data;
  • restore a backup made on a different model.

The rebuild runs in the background; queries fall back to text retrieval during the rebuild, then swap atomically. See Reindexing and re-embedding.

Evaluation

A small evaluation set saves more time than a thousand vibes-based tweaks.

  1. Collect 30–100 golden queries representative of real user intent.
  2. For each, record the expected matching nodes (the "right answer").
  3. After any embedding change, re-run the queries and compare to the gold set.
  4. Track precision@5 and recall@10 over time.

Common metrics:

Metric Meaning What it tells you
Precision@k Of the top-k results, how many are good? Whether the top is clean
Recall@k Of all good results, how many made the top-k? Whether the model is finding them at all
MRR Mean reciprocal rank of the first correct result Where the right answer lands
NDCG@k Weighted relevance of the ranked list Quality of ordering, not just inclusion

If your evaluation is small enough, just eyeball it. If it's larger, capture metrics in a custom endpoint and chart them.

Versioning and migrations

Treat the embedding model as part of your schema:

  • Record the model name and version with the index ("text-embedding-3-small @ v3.0").
  • When you switch, document why (quality, cost, residency) in the release notes.
  • Plan a re-embed window — for large corpora, this can be hours.

Common pitfalls

  • Switching models without re-embedding. Vectors are not comparable across models; retrieval will be silently broken.
  • Embedding everything. Doubles your index size, halves the value of each vector.
  • No chunking on long fields. The model truncates or averages; recall collapses.
  • No evaluation. Quality regresses silently and only surfaces in user complaints.
  • Mixing the production and dev provider keys in one workspace. Costs are confusing and rate limits don't add up the way you'd expect.

Next steps

© 2026 Curiosity. All rights reserved.
Powered by Neko