Performance Tuning
Performance issues in Curiosity Workspace generally fall into four buckets: ingestion throughput, search latency, chat / RAG latency, and resource pressure. This page gives target metrics for each, the levers that move them, and a profiling workflow.
For capacity planning by data size, see Scaling. For the operational metrics that drive these targets, see Monitoring.
Target metrics
These are workable defaults — your domain may shift them up or down, but use them as a starting point.
| Surface | Metric | Good | Investigate above |
|---|---|---|---|
| Search (text) | P50 latency | < 80 ms | 300 ms |
| Search (text) | P95 latency | < 250 ms | 800 ms |
| Search (hybrid) | P95 latency | < 500 ms | 1.5 s |
| Search | Result count for a typical query | 5–50 hits | 0 hits or > 200 hits |
| Chat turn | End-to-end latency | < 4 s | > 10 s |
| Chat turn | Tool calls per turn | 1–3 | > 5 |
| Ingestion | Sustained writes/s (small nodes) | 1k–5k | < 200 |
| Ingestion | P95 commit latency | < 1 s | > 5 s |
| Embeddings | New nodes embedded per minute | matches your provider's RPS budget | embedding queue depth growing |
| Container | CPU under steady load | < 60% | > 80% sustained |
| Container | Resident memory | grows then plateaus | grows unbounded |
| Container | Disk IOPS | < 50% of provisioned | > 80% sustained |
If a metric isn't reachable from the built-in Monitoring dashboard, instrument it from a connector or a custom endpoint and ship to your monitoring stack.
Hardware sizing baseline
| Workload | CPU | RAM | Disk | Disk IOPS |
|---|---|---|---|---|
| Local dev (single user) | 4 cores | 8 GB | 50 GB SSD | any |
| Small team (≤ 50 users, < 1M nodes) | 4–8 cores | 16 GB | 200 GB SSD | 3 000 |
| Mid-size (≤ 500 users, < 10M nodes) | 8–16 cores | 32 GB | 500 GB SSD | 5 000 |
| Large (≤ 5 000 users, < 50M nodes) | 16–32 cores | 64–128 GB | 1 TB+ NVMe | 10 000+ |
Increasing RAM has the most reliable effect because the graph engine memory-maps indexes; more RAM = more of the working set resident.
Ingestion performance
Symptoms: slow connector runs, commit timeouts, queue depth growth.
Levers, roughly in order of impact:
- Batch commits. Call
CommitPendingAsync()every 100–500 items, not once per row and not once per million. - Stable keys. Unstable keys cause the engine to do extra deduplication work and balloon graph size.
- Skip unchanged records. Compare a hash of the source row against the node's last-modified-hash property; skip the upsert if unchanged.
- Parallelize source reads, not graph writes. The graph is a single writer; parallel writers serialize on commit. Multiple connectors against different node types can run in parallel safely.
- Defer edge creation for related items that come from a separate source — write nodes first, edges second.
- Move file parsing off-host when ingesting many large files. Parsers are out-of-process anyway; give them more CPU.
Benchmark with a representative subset before sizing for the whole corpus. Doubling node count rarely doubles ingestion time on tuned pipelines.
Search performance
Symptoms: search latency above target, search-latency p95 regression after a config change.
Levers:
- Index only useful fields. Every indexed field adds memory and parser CPU. Drop boilerplate (legal footers, signatures, generic disclaimers).
- Type scope every query with
BeforeTypesFacet. A search for "MacBook" across every type is slower and less precise than the same search restricted toTicket. - Use
TargetUIDsfor "search within context". A graph-derived target set is cheaper than a post-filter. - Tune field boosts so titles dominate body text on short queries. See Ranking Tuning.
- Hybrid search costs more than text. Use hybrid where it earns its keep (long descriptive content) and text-only for short identifier searches.
- Cache expensive aggregates behind a custom endpoint with a short TTL when they don't need to be live.
Common slow query patterns and fixes:
| Pattern | Fix |
|---|---|
Q().StartAt(type).Where(...) over millions of nodes |
Add a more specific StartAt (by key or UID) or scope with Out(...) from a smaller starting set |
Multi-hop traversal without Take() |
Bound each hop; add .Take(N) at the right intermediate stage |
Q().StartAtSimilarText(query) over a huge corpus, then filter |
Compute the target set first (graph), then call StartAtSimilarText within TargetUIDs |
| Search returning thousands of hits | Add facets the user actually uses; don't deliver large unfiltered result sets |
Chat / RAG performance
Symptoms: chat turns slow, users abandon mid-answer, tools timing out.
Levers:
- Bound retrieval at the tool level.
Take(8)is enough for most RAG; passing 50 chunks to the LLM is wasteful and reduces answer quality. - Cap snippet size.
scope.ChatAI.GetTextFromNode(uid, limit: 4_000)instead of the full document. - Pick a faster chat model if quality is acceptable (
claude-haiku-4-5,gpt-4o-mini, a local 7B/13B-class model). - Reduce tool count per chat surface. Five tools is plenty for most chats; ten is too many.
- Use a fallback provider so a slow primary doesn't bottleneck every turn.
- Stream responses. Time-to-first-token matters more than total wall-clock for perceived latency.
Embedding throughput
Embedding is bottlenecked by the provider's request budget (hosted) or your local hardware (self-hosted).
- Chunk size matters. Too small → many short calls; too large → fewer but slower calls and worse retrieval. Start at 512 tokens; tune from there.
- Batch embedding calls at the provider level when possible (most hosted providers accept arrays of inputs).
- Throttle in front of the provider. Hitting the rate limit looks like "embeddings unavailable" to users.
- Rebuilds are expensive. Schedule full re-embeds in off-hours. See Reindexing and re-embedding.
Resource pressure
Symptoms: container OOM-kill, sustained CPU > 80%, disk IOPS saturation.
Levers:
- RAM: the most useful single lever. The graph engine maps indexes; tight RAM forces page faults. Aim for
resident memory ≈ working set + 25% headroom. - CPU: typical bottleneck is parsers during ingestion bursts and the LLM call's await on hosted providers. Add CPU; consider lifting the LLM provider's tier.
- Disk IOPS: SSD baseline; NVMe for write-heavy environments. EBS
gp3lets you provision IOPS independent of capacity. - Embedding queue: if it's growing without bound, the provider is throttling or down. See Troubleshooting → embeddings.
Profiling workflow
When a metric regresses, walk this loop:
- Confirm the regression against the trailing 7-day baseline — not against your gut.
- Bisect the cause: most regressions follow a deploy, a config change, a data growth event, or a provider change. Look at the change window.
- Reproduce on staging if possible.
- Profile the slow path: enable
MSK_LOG_LEVEL=Debugtemporarily; look for slow-query and slow-commit warnings. - Apply the smallest fix that addresses the bottleneck. Resist the urge to refactor.
- Reset
MSK_LOG_LEVEL=Informationafter diagnosis — debug logs are voluminous. - Add the metric that would have caught this earlier if it wasn't already monitored.
Anti-patterns
- Premature horizontal scaling. Workspace scales vertically first; throwing replicas at a CPU-bound box doesn't help.
- Embedding everything. Increases memory and embed-time cost without proportional retrieval benefit.
- Boost-everything ranking. Boosts are a priority order; if everything is "high priority", nothing is.
- Single giant commit. Memory-heavy and slow to recover from a mid-run crash.
- One mega-endpoint that does retrieval + LLM + post-processing + write-back. Split it; cache the deterministic parts.
Next steps
- Capacity planning: Scaling.
- Tuning relevance: Search Optimization, Ranking Tuning.
- Operational visibility: Monitoring.
- Symptom-first debugging: Troubleshooting.