Curiosity

Performance Tuning

Performance issues in Curiosity Workspace generally fall into four buckets: ingestion throughput, search latency, chat / RAG latency, and resource pressure. This page gives target metrics for each, the levers that move them, and a profiling workflow.

For capacity planning by data size, see Scaling. For the operational metrics that drive these targets, see Monitoring.

Target metrics

These are workable defaults — your domain may shift them up or down, but use them as a starting point.

Surface Metric Good Investigate above
Search (text) P50 latency < 80 ms 300 ms
Search (text) P95 latency < 250 ms 800 ms
Search (hybrid) P95 latency < 500 ms 1.5 s
Search Result count for a typical query 5–50 hits 0 hits or > 200 hits
Chat turn End-to-end latency < 4 s > 10 s
Chat turn Tool calls per turn 1–3 > 5
Ingestion Sustained writes/s (small nodes) 1k–5k < 200
Ingestion P95 commit latency < 1 s > 5 s
Embeddings New nodes embedded per minute matches your provider's RPS budget embedding queue depth growing
Container CPU under steady load < 60% > 80% sustained
Container Resident memory grows then plateaus grows unbounded
Container Disk IOPS < 50% of provisioned > 80% sustained

If a metric isn't reachable from the built-in Monitoring dashboard, instrument it from a connector or a custom endpoint and ship to your monitoring stack.

Hardware sizing baseline

Workload CPU RAM Disk Disk IOPS
Local dev (single user) 4 cores 8 GB 50 GB SSD any
Small team (≤ 50 users, < 1M nodes) 4–8 cores 16 GB 200 GB SSD 3 000
Mid-size (≤ 500 users, < 10M nodes) 8–16 cores 32 GB 500 GB SSD 5 000
Large (≤ 5 000 users, < 50M nodes) 16–32 cores 64–128 GB 1 TB+ NVMe 10 000+

Increasing RAM has the most reliable effect because the graph engine memory-maps indexes; more RAM = more of the working set resident.

Ingestion performance

Symptoms: slow connector runs, commit timeouts, queue depth growth.

Levers, roughly in order of impact:

  • Batch commits. Call CommitPendingAsync() every 100–500 items, not once per row and not once per million.
  • Stable keys. Unstable keys cause the engine to do extra deduplication work and balloon graph size.
  • Skip unchanged records. Compare a hash of the source row against the node's last-modified-hash property; skip the upsert if unchanged.
  • Parallelize source reads, not graph writes. The graph is a single writer; parallel writers serialize on commit. Multiple connectors against different node types can run in parallel safely.
  • Defer edge creation for related items that come from a separate source — write nodes first, edges second.
  • Move file parsing off-host when ingesting many large files. Parsers are out-of-process anyway; give them more CPU.

Benchmark with a representative subset before sizing for the whole corpus. Doubling node count rarely doubles ingestion time on tuned pipelines.

Search performance

Symptoms: search latency above target, search-latency p95 regression after a config change.

Levers:

  • Index only useful fields. Every indexed field adds memory and parser CPU. Drop boilerplate (legal footers, signatures, generic disclaimers).
  • Type scope every query with BeforeTypesFacet. A search for "MacBook" across every type is slower and less precise than the same search restricted to Ticket.
  • Use TargetUIDs for "search within context". A graph-derived target set is cheaper than a post-filter.
  • Tune field boosts so titles dominate body text on short queries. See Ranking Tuning.
  • Hybrid search costs more than text. Use hybrid where it earns its keep (long descriptive content) and text-only for short identifier searches.
  • Cache expensive aggregates behind a custom endpoint with a short TTL when they don't need to be live.

Common slow query patterns and fixes:

Pattern Fix
Q().StartAt(type).Where(...) over millions of nodes Add a more specific StartAt (by key or UID) or scope with Out(...) from a smaller starting set
Multi-hop traversal without Take() Bound each hop; add .Take(N) at the right intermediate stage
Q().StartAtSimilarText(query) over a huge corpus, then filter Compute the target set first (graph), then call StartAtSimilarText within TargetUIDs
Search returning thousands of hits Add facets the user actually uses; don't deliver large unfiltered result sets

Chat / RAG performance

Symptoms: chat turns slow, users abandon mid-answer, tools timing out.

Levers:

  • Bound retrieval at the tool level. Take(8) is enough for most RAG; passing 50 chunks to the LLM is wasteful and reduces answer quality.
  • Cap snippet size. scope.ChatAI.GetTextFromNode(uid, limit: 4_000) instead of the full document.
  • Pick a faster chat model if quality is acceptable (claude-haiku-4-5, gpt-4o-mini, a local 7B/13B-class model).
  • Reduce tool count per chat surface. Five tools is plenty for most chats; ten is too many.
  • Use a fallback provider so a slow primary doesn't bottleneck every turn.
  • Stream responses. Time-to-first-token matters more than total wall-clock for perceived latency.

Embedding throughput

Embedding is bottlenecked by the provider's request budget (hosted) or your local hardware (self-hosted).

  • Chunk size matters. Too small → many short calls; too large → fewer but slower calls and worse retrieval. Start at 512 tokens; tune from there.
  • Batch embedding calls at the provider level when possible (most hosted providers accept arrays of inputs).
  • Throttle in front of the provider. Hitting the rate limit looks like "embeddings unavailable" to users.
  • Rebuilds are expensive. Schedule full re-embeds in off-hours. See Reindexing and re-embedding.

Resource pressure

Symptoms: container OOM-kill, sustained CPU > 80%, disk IOPS saturation.

Levers:

  • RAM: the most useful single lever. The graph engine maps indexes; tight RAM forces page faults. Aim for resident memory ≈ working set + 25% headroom.
  • CPU: typical bottleneck is parsers during ingestion bursts and the LLM call's await on hosted providers. Add CPU; consider lifting the LLM provider's tier.
  • Disk IOPS: SSD baseline; NVMe for write-heavy environments. EBS gp3 lets you provision IOPS independent of capacity.
  • Embedding queue: if it's growing without bound, the provider is throttling or down. See Troubleshooting → embeddings.

Profiling workflow

When a metric regresses, walk this loop:

  1. Confirm the regression against the trailing 7-day baseline — not against your gut.
  2. Bisect the cause: most regressions follow a deploy, a config change, a data growth event, or a provider change. Look at the change window.
  3. Reproduce on staging if possible.
  4. Profile the slow path: enable MSK_LOG_LEVEL=Debug temporarily; look for slow-query and slow-commit warnings.
  5. Apply the smallest fix that addresses the bottleneck. Resist the urge to refactor.
  6. Reset MSK_LOG_LEVEL=Information after diagnosis — debug logs are voluminous.
  7. Add the metric that would have caught this earlier if it wasn't already monitored.

Anti-patterns

  • Premature horizontal scaling. Workspace scales vertically first; throwing replicas at a CPU-bound box doesn't help.
  • Embedding everything. Increases memory and embed-time cost without proportional retrieval benefit.
  • Boost-everything ranking. Boosts are a priority order; if everything is "high priority", nothing is.
  • Single giant commit. Memory-heavy and slow to recover from a mid-run crash.
  • One mega-endpoint that does retrieval + LLM + post-processing + write-back. Split it; cache the deterministic parts.

Next steps

© 2026 Curiosity. All rights reserved.
Powered by Neko