Curiosity

Storage and indexing

How Curiosity Workspace persists data on disk, how indexes are laid out alongside it, and what that means for capacity planning, backups, and recovery.

On-disk layout

A running workspace writes to a single root directory pointed at by MSK_GRAPH_STORAGE. Underneath it, the workspace organizes data into several logical areas:

$MSK_GRAPH_STORAGE/
├── graph/                # nodes, edges, properties (the graph database)
├── text-index/           # text search index
├── vector-index/         # vector / embedding index
├── parsers/              # parsed documents (intermediate Document nodes)
├── audit/                # audit log
└── journal/              # write journal (also writable to MSK_GRAPH_JOURNAL_FOLDER)

$MSK_GRAPH_BACKUP_FOLDER/  # rolling backups, if configured
$MSK_LOG_PATH/             # logs, if file-based logging is configured

The directory structure may evolve between releases; treat the whole MSK_GRAPH_STORAGE directory as the unit of backup and restore.

What's in memory vs on disk

Data In memory On disk
Graph nodes and edges (working set) Yes (memory-mapped) Yes (persistent storage)
Active text-index segments Yes (mmap) Yes
Active vector-index segments Yes (mmap) Yes
Parsed file content No (streamed from disk on demand) Yes
Backups No Yes
Journal entries Buffered Yes (durable on commit)

The engine memory-maps graph and index segments and relies on the OS page cache to keep hot data resident. More RAM = larger working set staying in memory = lower query latency. That's the primary scaling lever before you reach for sharding.

Persistence guarantees

  • A successful await graph.CommitPendingAsync() durably writes the change to the journal before returning. A crash immediately after the call cannot lose the change.
  • Index updates land after the commit returns, but before the workspace acknowledges any search for the changed nodes. There's a brief window where a node is in the graph but not in the search index — application code that needs read-your-writes should query the graph directly, not the index.
  • The journal is replayed on every startup. A workspace that boots without errors is consistent.

Sizing

Rough estimates to budget storage at design time:

Workload Indicative size
Graph (nodes + edges) ~ 1.5 × the raw property bytes you commit
Text index ~ 1× the sum of indexed text bytes
Vector index ~ (embedded text bytes) × (embedding dims × 4 bytes) ÷ (chunk size)
Journal headroom 5 GB minimum, 20% of graph in steady state
Backups (rolling) 1× the live graph for the most recent snapshot

A starter PVC of 200 GB is appropriate for hundreds of thousands of documents with embeddings; scale up before you hit 80%.

Storage class recommendations

Platform Recommended Notes
Linux host with local disk NVMe SSD Fastest.
Linux host with attached disk gp3 (AWS), Premium SSD (Azure), pd-ssd (GCP) Block storage, low latency.
Kubernetes ReadWriteOnce SSD-class Always single-writer.
Shared filesystems (NFS / EFS) Tolerated for non-prod Slower; the index files are sensitive to latency.
Object storage (S3, GCS, Azure Blob) Not supported as primary Use only for backups via a sync sidecar.

Backups

  • Snapshot the volume that hosts MSK_GRAPH_STORAGE. Because reads are lock-free, a snapshot taken while the workspace is running is consistent.
  • For platforms without native snapshots, set MSK_GRAPH_BACKUP_FOLDER and schedule a backup task that writes consistent point-in-time copies into it; then ship the folder off-host.
  • Always back up the secrets (MSK_JWT_KEY, MSK_GRAPH_MASTER_KEY, MSK_ADMIN_PASSWORD, MSK_LICENSE) separately. A graph backup is useless if you can't decrypt it.

See Backup and restore.

Re-creating indexes

If a text or vector index ever needs rebuilding (because you changed the recipe, switched embedding providers, or restored an older backup onto a newer workspace), the engine does this in the background:

  1. A new index is built alongside the old one.
  2. Queries continue against the old index until the new one finishes.
  3. The engine swaps atomically — no downtime.

See Reindexing and re-embedding for the operational details.

Storage on different platforms

  • Docker host: bind-mount a local SSD directory at /data. See Docker.
  • Kubernetes: volumeClaimTemplates provisioning a ReadWriteOnce block-storage PVC. See Kubernetes.
  • AWS: EBS gp3, snapshot via DLM. See AWS.
  • Azure: Premium SSD managed disk. See Azure.
  • GCP: SSD Persistent Disk. See GCP.
  • OpenShift: ODF / Ceph RBD / vSphere CSI / platform default. See OpenShift.
  • Windows: NTFS volume on a dedicated SSD. See Windows.

See also

Referenced by

© 2026 Curiosity. All rights reserved.
Powered by Neko