Curiosity

Read-only replicas

A read-only replica is a second Curiosity Workspace process that follows a primary, applies its writes in near-real-time, and serves read traffic. It exists to spread search and graph-query load off the primary and to provide a warm standby for failover.

This page covers the architecture, the environment flags, how to bring a replica online, and the operational expectations. For multi-region disaster recovery and snapshot-based DR, see Backup and restore.

Licensed feature

Replicas require the Replicas feature flag on your license. Check Admin → License before planning a deployment. If the feature is disabled, the replica process will refuse to start.

How it works

flowchart LR client[Client / load balancer] -->|writes| primary client -->|reads| primary client -->|reads| replica1 client -->|reads| replica2 primary[(Primary)] -->|WAL stream<br/>gRPC :42999| replica1[(Replica A)] primary -->|WAL stream<br/>gRPC :42999| replica2[(Replica B)] replica1 -->|metrics + last sync seq| primary replica2 -->|metrics + last sync seq| primary

The replica protocol has three phases:

  1. Register. On startup, the replica generates a unique ReplicaUID and registers itself with the primary over HTTPS (the primary's normal API port). Authentication uses a shared MSK_JWT_KEY between primary and replica.
  2. Catch up. The replica streams the primary's underlying RocksDB files over gRPC (TCP port 42999) into its own local storage. This is a one-shot file copy that happens once per replica lifetime.
  3. Tail. The replica subscribes to the primary's write-ahead log and applies new batches sequentially, typically every 250 ms. It periodically reports the last applied sequence number back to the primary so lag can be observed.

There is no rsync of backup snapshots, no shared block storage, and no requirement that primary and replica share a filesystem. Every replica keeps its own complete copy of the data.

What replicas serve

Traffic Where it goes
Search queries Replica (round-robin) or primary.
Graph queries (Q().StartAt(…).Emit()) Replica or primary.
Node reads (GET /api/node/{uid}) Replica or primary.
Embedding queries (/api/embeddings/*) Replica or primary.
Writes (any AddOrUpdate, Link, Delete) Primary only. Replicas reject write requests.
Schema changes Primary only.
Admin operations (license, SSO config, …) Primary only.
File uploads Primary only.
Endpoint execution Replica if the endpoint is marked read-only; primary otherwise.
AI tool calls Same as endpoints.

Endpoints can be marked read-only-safe in their code; the front-end and SDK call ReadOnlyCompatible() on those requests to allow routing to replicas. Anything that mutates the graph calls ForcePrimary() and is routed to the primary.

Indexing on replicas

Replicas rebuild text and vector indexes locally as the WAL stream arrives — they do not copy index files from the primary. Two consequences:

  • A freshly-promoted replica is fully self-sufficient. Failing the primary over to it does not lose indexes.
  • The replica's CPU and disk-I/O cost is comparable to the primary's. Replicas are not "cheap read mirrors" — size them like primaries.

When the primary adds a new index, it pushes a LoadNewIndexOnReplica notification over gRPC; the replica picks it up and begins building locally. The same applies to index removal.

Environment flags

On the replica

Variable Required Value Notes
MSK_REPLICA yes true Switches the process into read-only mode.
MSK_PRIMARY_ADDRESS yes full URL of the primary, e.g. https://workspace-primary.example.com Used for initial registration and WAL streaming.
MSK_JWT_KEY yes shared secret string Must match the primary's MSK_JWT_KEY exactly. Used for inter-node auth.
MSK_GRAPH_STORAGE yes local path Where the replica stores its copy of the data. Must be on local SSD/NVMe.
MSK_PUBLIC_ADDRESS yes public URL the replica is reachable at Used in registration so the primary can call the replica back.
MSK_SERVER_ADDRESS recommended internal URL for primary→replica calls Falls back to MSK_PUBLIC_ADDRESS if unset.
MSK_GRAPH_MASTER_KEY if encryption is on same value as primary Replicas cannot read encrypted data without the same key.
MSK_LICENSE yes license that includes the Replicas feature Replicas validate their own license at startup.
Other MSK_* settings as needed Provider keys, SSO, TLS, etc. Most settings can be set on both. The replica reads configuration from the graph after the initial catch-up.

On the primary

The primary doesn't need a flag to enable replication — it accepts replica registrations as soon as one shows up with a valid JWT. What it does need:

Variable Required Notes
MSK_JWT_KEY yes Same value the replicas will present.
MSK_PUBLIC_ADDRESS yes Reachable from each replica.

Make sure TCP port 42999 is open from each replica to the primary — that's the gRPC channel the WAL stream uses. It's internal-only; never expose it on the public internet.

CORS on the primary

When a replica registers, the primary automatically adds the replica's public address to its CORS origin list. No manual configuration needed.

Bringing a replica online

flowchart LR A[Provision host] --> B[Install workspace] B --> C[Set MSK_REPLICA + MSK_PRIMARY_ADDRESS + MSK_JWT_KEY] C --> D[Start the process] D --> E[Replica registers with primary] E --> F[Initial catch-up over gRPC] F --> G[Tailing the WAL] G --> H[Replica serves reads]

Step-by-step

1

Provision a host

Size it like a primary: same CPU class, same RAM, NVMe storage. A replica that's smaller than the primary will fall behind under load.

1

Install the workspace binary

Use the same version as the primary. A replica more than one minor version older or newer than the primary will refuse to start. See Upgrades and migrations → Version compatibility.

1

Set the replica-specific env vars

export MSK_REPLICA=true
export MSK_PRIMARY_ADDRESS=https://workspace-primary.example.com
export MSK_JWT_KEY="$(cat /etc/curiosity/jwt-key)"   # same as primary
export MSK_GRAPH_STORAGE=/var/lib/curiosity-replica
export MSK_PUBLIC_ADDRESS=https://workspace-replica-a.example.com
export MSK_LICENSE="$(cat /etc/curiosity/license)"

If the primary uses graph encryption, also set MSK_GRAPH_MASTER_KEY to the same value.

1

Start the process

systemctl start curiosity-workspace

Tail the log. You should see, in order:

Registering replica with primary at https://workspace-primary.example.com
Catching up with primary (streaming RocksDB files)…
Loading read-only graph from storage
Replica online; tailing WAL from sequence …

If catch-up fails, the replica exits with 0xDEAD — by design. Restart after fixing the cause (network, JWT mismatch, version skew).

1

Verify

curl https://workspace-replica-a.example.com/api/graph/replica
# {"isReplica":true,"hasReplicas":false,"replicaUID":"…","replicaName":"…"}

curl https://workspace-primary.example.com/api/graph/replica
# {"isReplica":false,"hasReplicas":true}

hasReplicas: true on the primary confirms it's tracking changes for at least one replica.

Routing reads to replicas

The front-end and the C# SDK both support a replica URL list and route reads to it round-robin. Two ways to configure it:

Inline in the workspace settings

In Admin → Workspace Configuration, set the ReplicaURLs list. The setting propagates to clients on next page load.

Via the front-end global

Embed the list as a global JavaScript variable on the served page:

<script>
  window.MSKREPLICAURLS = "https://workspace-replica-a.example.com;https://workspace-replica-b.example.com";
</script>

The front-end picks this up at startup and splits on ;. Each entry is health-checked before use; failed replicas are retired and re-tested on a 5–30 minute window.

Behind a load balancer

If you sit a load balancer in front of the workspace, terminate two paths:

  • /api/** calls marked ReadOnlyCompatible() → primary + replicas, weighted however you want.
  • Everything else → primary only.

Most teams use the SDK / front-end replica list rather than a load balancer because the routing logic (and health checks) is already there.

Health checks

Endpoint Returns
GET /api/graph/replica { isReplica, hasReplicas, replicaUID?, replicaName? }.
GET /api/graph/counters Document and edge counts. Use to compare replica vs primary for lag detection.
GET /api/admin/metrics/* Internal metrics, including replica status and replication errors.

A simple "is the replica caught up?" check is comparing node counts between primary and replica. For stricter monitoring, compare the last-applied sequence number — exposed in the internal metrics — against the primary's current sequence.

Lag, errors, and failure modes

Symptom Likely cause
Replica refuses to start with 0xDEAD immediately JWT mismatch, primary unreachable, or version skew. Check the log line before exit.
Replica catches up but lag grows over time Replica is underpowered. Match the primary's hardware spec.
Replica returns stale results Expected within the ~250 ms tail interval; under high write load, can grow.
Replica restarts unexpectedly Process or host crash. The primary detects this and re-initiates the catch-up flow.
hasReplicas: false on the primary despite a running replica Replica isn't registered. Inspect the replica's startup log for registration errors.
Write requests succeed against a replica They shouldn't — replicas reject writes at the HTTP layer. If they don't, you're hitting the primary by accident.

Operational recommendations

  • Minimum two replicas in production. Single-replica setups give no read-load headroom during a replica failure.
  • Geographic placement. Put replicas in the same network zone as the read traffic they serve. Cross-region replication works but adds latency to every WAL batch.
  • Same hardware tier as primary. Replicas rebuild indexes locally; underpowered replicas fall behind silently.
  • JWT rotation. When rotating MSK_JWT_KEY, do it in a maintenance window. The current implementation does not support overlapping keys; replicas must restart after the rotation.
  • Don't replicate to dev/staging. Replicas are for production read scaling, not for environment promotion. Use Workspace export/import for that.
  • Backup is still your responsibility. Replicas are not backups. A logical corruption on the primary replicates to every replica within milliseconds. Maintain real backups per Backup and restore.

Failover

Curiosity does not yet ship an automatic primary→replica promotion mechanism. If the primary fails:

  1. Pick the most up-to-date replica (highest applied WAL sequence).
  2. Stop it.
  3. Restart it without MSK_REPLICA and without MSK_PRIMARY_ADDRESS. It boots as a normal primary.
  4. Update DNS / load-balancer config to point clients at the new primary.
  5. Tear down the old primary (or, after repair, re-introduce it as a replica).

Plan and rehearse this procedure. The first time you do it should be in a fire drill, not in a real outage.

Cost model

Resource Per-replica cost vs primary
CPU 70–100% of primary (replicas rebuild indexes; the workload is similar).
RAM 70–100% of primary.
Disk ≈ 100% of primary (full copy of the data plus indexes).
Network Outbound from primary scales with write volume × N replicas.
Embedding / LLM API calls Zero — replicas do not call external providers; they receive embeddings via the WAL.

Where to go next

© 2026 Curiosity. All rights reserved.
Powered by Neko