Scaling
This page is a practical scaling guide: how to size a workspace by tier, how to grow each of the three axes (data volume, ingestion throughput, query load), and how backup behaves as the corpus grows.
For per-query tuning see Performance tuning. For operational deployment see Deployment.
Scale tiers
Use these as starting points — every workspace's mix of nodes/edges/queries is different, but the tiers below are calibrated against real production deployments.
| Tier | Target users | Graph size (nodes) | Queries / day | Single-VM CPU / RAM | Notes |
|---|---|---|---|---|---|
| Pilot | up to 50 | up to 1M | up to 50k | 8 vCPU / 32 GB | Single VM. SSD storage. Suitable for early adopters. |
| Small | up to 500 | up to 10M | up to 500k | 16 vCPU / 64 GB | Single VM with read-only replica for failover. |
| Medium | up to 5 000 | up to 100M | up to 5M | 32 vCPU / 128 GB | Add a dedicated GPU worker if running OCR/STT or large embedding models. |
| Large | up to 25 000 | up to 1B | up to 50M | 64+ vCPU / 256+ GB | Multi-node cluster. NVMe storage. Dedicated workers for ingestion and embedding. |
| XL | 25 000+ | 1B+ | 50M+ | engagement-specific | Talk to Curiosity for architecture review. |
These tiers are about the workspace server itself. External LLM providers (OpenAI, Anthropic, etc.) have their own cost/throughput characteristics — see LLM configuration.
Vertical sizing
For single-VM deployments at Pilot through Medium tiers, the bottleneck moves from CPU to RAM to disk I/O as the graph grows. Rules of thumb:
| Component | Sizing heuristic |
|---|---|
| RAM | Keep ≥ 50% of the hot indexes in memory. Indexes typically grow at 1.5× to 3× the raw text size. |
| CPU | Search latency is CPU-bound. Provision for peak QPS × P95 latency × 1.5 headroom. |
| Disk | SSD or NVMe is mandatory. Spinning disks turn 100 ms searches into 5 s. |
| Network | 1 Gbps is plenty for most workspaces. 10 Gbps only if you're pushing > 100 MB/s of ingestion traffic. |
Storage estimate
Rough storage = raw text size × (1 + index_factor + embedding_factor) where:
index_factor≈ 1.5–3 (text indexes, NLP annotations, graph structure)embedding_factor≈vector_dim × 4 bytes / avg_chunk_chars(e.g. 1536-dim float32 over 500-char chunks ≈ 12× raw text)
For a 100 GB raw text corpus with embeddings: expect 1.5–2 TB of storage.
Ingestion scaling
The graph is single-writer per workspace; the ingestion pipeline is what you scale.
| Bottleneck | Symptom | Fix |
|---|---|---|
| Single connector slow | One source's sync window growing each run. | Parallelize: partition the source (by date, region, account) and run N connector processes. |
| Connector → workspace network | Commit calls taking > 1 s. | Larger SetAutoCommitCost, fewer commits with more nodes each. |
| Workspace queue backed up | Latency between commit and visibility climbing. | Add ingestion-worker capacity. Pause indexing during the load (PauseIndexing). |
| NLP / embedding pipeline | Re-process queue takes hours/days after ingest spikes. | Add GPU workers for embeddings. Increase batch sizes in pipeline config. |
| File extraction (OCR/STT) | PDF/audio backlog growing. | Add OCR/STT worker pool. GPU is 10–30× faster for STT. |
Targets to aim for per worker:
- Ingestion writes: 5 000–20 000 nodes/sec per worker (graph + simple indexes).
- Embeddings (CPU, local model): 50–200 docs/sec.
- Embeddings (external API): provider-bound. Batch aggressively.
- OCR: 5–15 pages/sec CPU, 50–150 pages/sec GPU.
- STT (Whisper small, GPU): 10–15× real-time.
Query scaling
Search is read-heavy. Three levers:
| Lever | Effect |
|---|---|
| Read replicas | Distribute search load. Eventually consistent. See Read-only replicas. |
| Index sharding | Smaller per-shard indexes → faster lookups. Adds operational cost. |
| Query-side caching (endpoints) | Cache popular aggregates. Most "expensive" endpoints have a handful of common variants. |
| Hybrid blend tuning | Vector branch is the more expensive one — tune α to keep latency in budget. |
| Graph-constrained search | Pushing constraints into TargetUIDs collapses the candidate set the search engine touches. |
The single biggest query-scaling win is almost always narrower start sets. A graph-scoped TargetUIDs cuts both candidate retrieval and ACL filtering. See Search ranking tuning.
Multi-tenant patterns
Two viable shapes:
- Tenant = workspace. Each tenant gets a dedicated workspace; full isolation, simple ACLs, trivial DR. Best when tenants don't share data and procurement is per-tenant.
- Tenant = graph subset. One workspace, every node and edge tagged by tenant, ACLs enforce isolation. Best when tenants share lookup data (catalogs, knowledge bases).
For (2), ensure:
- A
Tenantnode per tenant; every user/team/data node links to it. - All user-facing search calls use
CreateSearchAsUserAsync(permission-aware). - Tenant ID is in every endpoint's input — never trust a client-supplied tenant.
- Quotas per tenant (rate limits on endpoint calls, ingestion volume).
Backup at scale
| Tier | Backup strategy |
|---|---|
| Pilot/Small | Daily full snapshot. Restore tested quarterly. |
| Medium | Daily full + hourly incremental. Restore tested monthly. Cross-region copy. |
| Large/XL | Continuous snapshots + WAL-style streaming to standby region. RPO < 5 min targets. |
Snapshot size grows with the corpus. At 1B+ nodes a full snapshot can take many hours — incremental + change-data-capture is the only path to short RPOs.
The backup configuration knobs live in the operations guide — see Backup and restore.
Disaster recovery
Define and test:
- RPO (recovery point objective): how much data can you afford to lose?
- RTO (recovery time objective): how quickly must you be back up?
- Failover path: cold standby? warm replica? active-active?
- Restore drill: at least quarterly. The first time you restore should not be during an outage.
Practical scaling checklist
- Schema and indexes audited annually; drop fields that aren't queried.
- Ingestion batched and idempotent; no full re-syncs in production.
- Search calls graph-scoped wherever possible.
- Hybrid search blend tuned with a golden set.
- Embedding model and chunking pinned; re-embed jobs scheduled, not ad-hoc.
- Endpoint metrics in your monitoring stack; alerts on latency, error, and query-tracker drift.
- Backup snapshots cross-region; restore drilled.
- Capacity reviewed quarterly against the tier table.
Where to go next
- Performance tuning — per-query tuning.
- Search optimization — schema and facets.
- Reindexing and re-embedding — re-build operations.
- Monitoring — built-in dashboards.
- Metrics reference — wiring into external monitoring.