Curiosity

Performance Tuning

Performance issues in Curiosity Workspace generally fall into four buckets: ingestion throughput, search latency, chat / RAG latency, and resource pressure. This page gives target metrics for each, the levers that move them, and a profiling workflow.

For capacity planning by data size, see Scaling. For the operational metrics that drive these targets, see Monitoring.

Target metrics

These are workable defaults — your domain may shift them up or down, but use them as a starting point.

Surface	Metric	Good	Investigate above
Search (text)	P50 latency	< 80 ms	300 ms
Search (text)	P95 latency	< 250 ms	800 ms
Search (hybrid)	P95 latency	< 500 ms	1.5 s
Search	Result count for a typical query	5–50 hits	0 hits or > 200 hits
Chat turn	End-to-end latency	< 4 s	> 10 s
Chat turn	Tool calls per turn	1–3	> 5
Ingestion	Sustained writes/s (small nodes)	1k–5k	< 200
Ingestion	P95 commit latency	< 1 s	> 5 s
Embeddings	New nodes embedded per minute	matches your provider's RPS budget	embedding queue depth growing
Container	CPU under steady load	< 60%	> 80% sustained
Container	Resident memory	grows then plateaus	grows unbounded
Container	Disk IOPS	< 50% of provisioned	> 80% sustained

If a metric isn't reachable from the built-in Monitoring dashboard, instrument it from a connector or a custom endpoint and ship to your monitoring stack.

Hardware sizing baseline

Workload	CPU	RAM	Disk	Disk IOPS
Local dev (single user)	4 cores	8 GB	50 GB SSD	any
Small team (≤ 50 users, < 1M nodes)	4–8 cores	16 GB	200 GB SSD	3 000
Mid-size (≤ 500 users, < 10M nodes)	8–16 cores	32 GB	500 GB SSD	5 000
Large (≤ 5 000 users, < 50M nodes)	16–32 cores	64–128 GB	1 TB+ NVMe	10 000+

Increasing RAM has the most reliable effect because the graph engine memory-maps indexes; more RAM = more of the working set resident.

Ingestion performance

Symptoms: slow connector runs, commit timeouts, queue depth growth.

Levers, roughly in order of impact:

Batch commits. Call CommitPendingAsync() every 100–500 items, not once per row and not once per million.
Stable keys. Unstable keys cause the engine to do extra deduplication work and balloon graph size.
Skip unchanged records. Compare a hash of the source row against the node's last-modified-hash property; skip the upsert if unchanged.
Parallelize source reads, not graph writes. The graph is a single writer; parallel writers serialize on commit. Multiple connectors against different node types can run in parallel safely.
Defer edge creation for related items that come from a separate source — write nodes first, edges second.
Move file parsing off-host when ingesting many large files. Parsers are out-of-process anyway; give them more CPU.

Benchmark with a representative subset before sizing for the whole corpus. Doubling node count rarely doubles ingestion time on tuned pipelines.

Search performance

Symptoms: search latency above target, search-latency p95 regression after a config change.

Levers:

Index only useful fields. Every indexed field adds memory and parser CPU. Drop boilerplate (legal footers, signatures, generic disclaimers).
Type scope every query with BeforeTypesFacet. A search for "MacBook" across every type is slower and less precise than the same search restricted to Ticket.
Use TargetUIDs for "search within context". A graph-derived target set is cheaper than a post-filter.
Tune field boosts so titles dominate body text on short queries. See Ranking Tuning.
Hybrid search costs more than text. Use hybrid where it earns its keep (long descriptive content) and text-only for short identifier searches.
Cache expensive aggregates behind a custom endpoint with a short TTL when they don't need to be live.

Common slow query patterns and fixes:

Pattern	Fix
`Q().StartAt(type).Where(...)` over millions of nodes	Add a more specific `StartAt` (by key or UID) or scope with `Out(...)` from a smaller starting set
Multi-hop traversal without `Take()`	Bound each hop; add `.Take(N)` at the right intermediate stage
`Q().StartAtSimilarText(query)` over a huge corpus, then filter	Compute the target set first (graph), then call `StartAtSimilarText` within `TargetUIDs`
Search returning thousands of hits	Add facets the user actually uses; don't deliver large unfiltered result sets

Chat / RAG performance

Symptoms: chat turns slow, users abandon mid-answer, tools timing out.

Levers:

Bound retrieval at the tool level. Take(8) is enough for most RAG; passing 50 chunks to the LLM is wasteful and reduces answer quality.
Cap snippet size. scope.ChatAI.GetTextFromNode(uid, limit: 4_000) instead of the full document.
Pick a faster chat model if quality is acceptable (claude-haiku-4-5, gpt-4o-mini, a local 7B/13B-class model).
Reduce tool count per chat surface. Five tools is plenty for most chats; ten is too many.
Use a fallback provider so a slow primary doesn't bottleneck every turn.
Stream responses. Time-to-first-token matters more than total wall-clock for perceived latency.

Embedding throughput

Embedding is bottlenecked by the provider's request budget (hosted) or your local hardware (self-hosted).

Chunk size matters. Too small → many short calls; too large → fewer but slower calls and worse retrieval. Start at 512 tokens; tune from there.
Batch embedding calls at the provider level when possible (most hosted providers accept arrays of inputs).
Throttle in front of the provider. Hitting the rate limit looks like "embeddings unavailable" to users.
Rebuilds are expensive. Schedule full re-embeds in off-hours. See Reindexing and re-embedding.

Resource pressure

Symptoms: container OOM-kill, sustained CPU > 80%, disk IOPS saturation.

Levers:

RAM: the most useful single lever. The graph engine maps indexes; tight RAM forces page faults. Aim for resident memory ≈ working set + 25% headroom.
CPU: typical bottleneck is parsers during ingestion bursts and the LLM call's await on hosted providers. Add CPU; consider lifting the LLM provider's tier.
Disk IOPS: SSD baseline; NVMe for write-heavy environments. EBS gp3 lets you provision IOPS independent of capacity.
Embedding queue: if it's growing without bound, the provider is throttling or down. See Troubleshooting → embeddings.

Profiling workflow

When a metric regresses, walk this loop:

Confirm the regression against the trailing 7-day baseline — not against your gut.
Bisect the cause: most regressions follow a deploy, a config change, a data growth event, or a provider change. Look at the change window.
Reproduce on staging if possible.
Profile the slow path: enable MSK_LOG_LEVEL=Debug temporarily; look for slow-query and slow-commit warnings.
Apply the smallest fix that addresses the bottleneck. Resist the urge to refactor.
Reset MSK_LOG_LEVEL=Information after diagnosis — debug logs are voluminous.
Add the metric that would have caught this earlier if it wasn't already monitored.

Anti-patterns

Premature horizontal scaling. Workspace scales vertically first; throwing replicas at a CPU-bound box doesn't help.
Embedding everything. Increases memory and embed-time cost without proportional retrieval benefit.
Boost-everything ranking. Boosts are a priority order; if everything is "high priority", nothing is.
Single giant commit. Memory-heavy and slow to recover from a mid-run crash.
One mega-endpoint that does retrieval + LLM + post-processing + write-back. Split it; cache the deterministic parts.

Next steps

Capacity planning: Scaling.
Tuning relevance: Search Optimization, Ranking Tuning.
Operational visibility: Monitoring.
Symptom-first debugging: Troubleshooting.

Performance Tuning

Target metrics

Hardware sizing baseline

Ingestion performance

Search performance

Chat / RAG performance

Embedding throughput

Resource pressure

Profiling workflow

Anti-patterns

Next steps

Referenced by