Curiosity

Read-only replicas

A read-only replica is a second Curiosity Workspace process that follows a primary, applies its writes in near-real-time, and serves read traffic. It exists to spread search and graph-query load off the primary and to provide a warm standby for failover.

This page covers the architecture, the environment flags, how to bring a replica online, and the operational expectations. For multi-region disaster recovery and snapshot-based DR, see Backup and restore.

Licensed feature

Replicas require the Replicas feature flag on your license. Check Admin → License before planning a deployment. If the feature is disabled, the replica process will refuse to start.

How it works

The replica protocol has three phases:

Register. On startup, the replica generates a unique ReplicaUID and registers itself with the primary over HTTPS (the primary's normal API port). Authentication uses a shared MSK_JWT_KEY between primary and replica.
Catch up. The replica streams the primary's underlying RocksDB files over gRPC (TCP port 42999) into its own local storage. This is a one-shot file copy that happens once per replica lifetime.
Tail. The replica subscribes to the primary's write-ahead log and applies new batches sequentially, typically every 250 ms. It periodically reports the last applied sequence number back to the primary so lag can be observed.

There is no rsync of backup snapshots, no shared block storage, and no requirement that primary and replica share a filesystem. Every replica keeps its own complete copy of the data.

What replicas serve

Traffic	Where it goes
Search queries	Replica (round-robin) or primary.
Graph queries (`Q().StartAt(…).Emit()`)	Replica or primary.
Node reads (`GET /api/node/{uid}`)	Replica or primary.
Embedding queries (`/api/embeddings/*`)	Replica or primary.
Writes (any `AddOrUpdate`, `Link`, `Delete`)	Primary only. Replicas reject write requests.
Schema changes	Primary only.
Admin operations (license, SSO config, …)	Primary only.
File uploads	Primary only.
Endpoint execution	Replica if the endpoint is marked read-only; primary otherwise.
AI tool calls	Same as endpoints.

Endpoints can be marked read-only-safe in their code; the front-end and SDK call ReadOnlyCompatible() on those requests to allow routing to replicas. Anything that mutates the graph calls ForcePrimary() and is routed to the primary.

Indexing on replicas

Replicas rebuild text and vector indexes locally as the WAL stream arrives — they do not copy index files from the primary. Two consequences:

A freshly-promoted replica is fully self-sufficient. Failing the primary over to it does not lose indexes.
The replica's CPU and disk-I/O cost is comparable to the primary's. Replicas are not "cheap read mirrors" — size them like primaries.

When the primary adds a new index, it pushes a LoadNewIndexOnReplica notification over gRPC; the replica picks it up and begins building locally. The same applies to index removal.

Environment flags

On the replica

Variable	Required	Value	Notes
`MSK_REPLICA`	yes	`true`	Switches the process into read-only mode.
`MSK_PRIMARY_ADDRESS`	yes	full URL of the primary, e.g. `https://workspace-primary.example.com`	Used for initial registration and WAL streaming.
`MSK_JWT_KEY`	yes	shared secret string	Must match the primary's `MSK_JWT_KEY` exactly. Used for inter-node auth.
`MSK_GRAPH_STORAGE`	yes	local path	Where the replica stores its copy of the data. Must be on local SSD/NVMe.
`MSK_PUBLIC_ADDRESS`	yes	public URL the replica is reachable at	Used in registration so the primary can call the replica back.
`MSK_SERVER_ADDRESS`	recommended	internal URL for primary→replica calls	Falls back to `MSK_PUBLIC_ADDRESS` if unset.
`MSK_GRAPH_MASTER_KEY`	if encryption is on	same value as primary	Replicas cannot read encrypted data without the same key.
`MSK_LICENSE`	yes	license that includes the `Replicas` feature	Replicas validate their own license at startup.
Other `MSK_*` settings	as needed	Provider keys, SSO, TLS, etc.	Most settings can be set on both. The replica reads configuration from the graph after the initial catch-up.

On the primary

The primary doesn't need a flag to enable replication — it accepts replica registrations as soon as one shows up with a valid JWT. What it does need:

Variable	Required	Notes
`MSK_JWT_KEY`	yes	Same value the replicas will present.
`MSK_PUBLIC_ADDRESS`	yes	Reachable from each replica.

Make sure TCP port 42999 is open from each replica to the primary — that's the gRPC channel the WAL stream uses. It's internal-only; never expose it on the public internet.

CORS on the primary

When a replica registers, the primary automatically adds the replica's public address to its CORS origin list. No manual configuration needed.

Bringing a replica online

flowchart LR A[Provision host] --> B[Install workspace] B --> C[Set MSK_REPLICA + MSK_PRIMARY_ADDRESS + MSK_JWT_KEY] C --> D[Start the process] D --> E[Replica registers with primary] E --> F[Initial catch-up over gRPC] F --> G[Tailing the WAL] G --> H[Replica serves reads]

Step-by-step

Provision a host

Size it like a primary: same CPU class, same RAM, NVMe storage. A replica that's smaller than the primary will fall behind under load.

Install the workspace binary

Use the same version as the primary. A replica more than one minor version older or newer than the primary will refuse to start. See Upgrades and migrations → Version compatibility.

Set the replica-specific env vars

export MSK_REPLICA=true
export MSK_PRIMARY_ADDRESS=https://workspace-primary.example.com
export MSK_JWT_KEY="$(cat /etc/curiosity/jwt-key)"   # same as primary
export MSK_GRAPH_STORAGE=/var/lib/curiosity-replica
export MSK_PUBLIC_ADDRESS=https://workspace-replica-a.example.com
export MSK_LICENSE="$(cat /etc/curiosity/license)"

If the primary uses graph encryption, also set MSK_GRAPH_MASTER_KEY to the same value.

Start the process

systemctl start curiosity-workspace

Tail the log. You should see, in order:

Registering replica with primary at https://workspace-primary.example.com
Catching up with primary (streaming RocksDB files)…
Loading read-only graph from storage
Replica online; tailing WAL from sequence …

If catch-up fails, the replica exits with 0xDEAD — by design. Restart after fixing the cause (network, JWT mismatch, version skew).

Verify

curl https://workspace-replica-a.example.com/api/graph/replica
# {"isReplica":true,"hasReplicas":false,"replicaUID":"…","replicaName":"…"}

curl https://workspace-primary.example.com/api/graph/replica
# {"isReplica":false,"hasReplicas":true}

hasReplicas: true on the primary confirms it's tracking changes for at least one replica.

Routing reads to replicas

The front-end and the C# SDK both support a replica URL list and route reads to it round-robin. Two ways to configure it:

Inline in the workspace settings

In Admin → Workspace Configuration, set the ReplicaURLs list. The setting propagates to clients on next page load.

Via the front-end global

Embed the list as a global JavaScript variable on the served page:

<script>
  window.MSKREPLICAURLS = "https://workspace-replica-a.example.com;https://workspace-replica-b.example.com";
</script>

The front-end picks this up at startup and splits on ;. Each entry is health-checked before use; failed replicas are retired and re-tested on a 5–30 minute window.

Behind a load balancer

If you sit a load balancer in front of the workspace, terminate two paths:

/api/** calls marked ReadOnlyCompatible() → primary + replicas, weighted however you want.
Everything else → primary only.

Most teams use the SDK / front-end replica list rather than a load balancer because the routing logic (and health checks) is already there.

Health checks

Endpoint	Returns
`GET /api/graph/replica`	`{ isReplica, hasReplicas, replicaUID?, replicaName? }`.
`GET /api/graph/counters`	Document and edge counts. Use to compare replica vs primary for lag detection.
`GET /api/admin/metrics/*`	Internal metrics, including replica status and replication errors.

A simple "is the replica caught up?" check is comparing node counts between primary and replica. For stricter monitoring, compare the last-applied sequence number — exposed in the internal metrics — against the primary's current sequence.

Lag, errors, and failure modes

Symptom	Likely cause
Replica refuses to start with `0xDEAD` immediately	JWT mismatch, primary unreachable, or version skew. Check the log line before exit.
Replica catches up but lag grows over time	Replica is underpowered. Match the primary's hardware spec.
Replica returns stale results	Expected within the ~250 ms tail interval; under high write load, can grow.
Replica restarts unexpectedly	Process or host crash. The primary detects this and re-initiates the catch-up flow.
`hasReplicas: false` on the primary despite a running replica	Replica isn't registered. Inspect the replica's startup log for registration errors.
Write requests succeed against a replica	They shouldn't — replicas reject writes at the HTTP layer. If they don't, you're hitting the primary by accident.

Operational recommendations

Minimum two replicas in production. Single-replica setups give no read-load headroom during a replica failure.
Geographic placement. Put replicas in the same network zone as the read traffic they serve. Cross-region replication works but adds latency to every WAL batch.
Same hardware tier as primary. Replicas rebuild indexes locally; underpowered replicas fall behind silently.
JWT rotation. When rotating MSK_JWT_KEY, do it in a maintenance window. The current implementation does not support overlapping keys; replicas must restart after the rotation.
Don't replicate to dev/staging. Replicas are for production read scaling, not for environment promotion. Use Workspace export/import for that.
Backup is still your responsibility. Replicas are not backups. A logical corruption on the primary replicates to every replica within milliseconds. Maintain real backups per Backup and restore.

Failover

Curiosity does not yet ship an automatic primary→replica promotion mechanism. If the primary fails:

Pick the most up-to-date replica (highest applied WAL sequence).
Stop it.
Restart it without MSK_REPLICA and without MSK_PRIMARY_ADDRESS. It boots as a normal primary.
Update DNS / load-balancer config to point clients at the new primary.
Tear down the old primary (or, after repair, re-introduce it as a replica).

Plan and rehearse this procedure. The first time you do it should be in a fire drill, not in a real outage.

Cost model

Resource	Per-replica cost vs primary
CPU	70–100% of primary (replicas rebuild indexes; the workload is similar).
RAM	70–100% of primary.
Disk	≈ 100% of primary (full copy of the data plus indexes).
Network	Outbound from primary scales with write volume × N replicas.
Embedding / LLM API calls	Zero — replicas do not call external providers; they receive embeddings via the WAL.

Where to go next

Backup and restore — replicas are not a substitute for backups.
Scaling — when to add replicas vs. scale the primary vertically.
Upgrades and migrations — version compatibility and rolling upgrades.
Configuration reference — full MSK_* flag list.
Monitoring — replica metrics in the built-in dashboards.