Read-only replicas
A read-only replica is a second Curiosity Workspace process that follows a primary, applies its writes in near-real-time, and serves read traffic. It exists to spread search and graph-query load off the primary and to provide a warm standby for failover.
This page covers the architecture, the environment flags, how to bring a replica online, and the operational expectations. For multi-region disaster recovery and snapshot-based DR, see Backup and restore.
Licensed feature
Replicas require the Replicas feature flag on your license. Check Admin → License before planning a deployment. If the feature is disabled, the replica process will refuse to start.
How it works
The replica protocol has three phases:
- Register. On startup, the replica generates a unique
ReplicaUIDand registers itself with the primary over HTTPS (the primary's normal API port). Authentication uses a sharedMSK_JWT_KEYbetween primary and replica. - Catch up. The replica streams the primary's underlying RocksDB files over gRPC (TCP port
42999) into its own local storage. This is a one-shot file copy that happens once per replica lifetime. - Tail. The replica subscribes to the primary's write-ahead log and applies new batches sequentially, typically every 250 ms. It periodically reports the last applied sequence number back to the primary so lag can be observed.
There is no rsync of backup snapshots, no shared block storage, and no requirement that primary and replica share a filesystem. Every replica keeps its own complete copy of the data.
What replicas serve
| Traffic | Where it goes |
|---|---|
| Search queries | Replica (round-robin) or primary. |
Graph queries (Q().StartAt(…).Emit()) |
Replica or primary. |
Node reads (GET /api/node/{uid}) |
Replica or primary. |
Embedding queries (/api/embeddings/*) |
Replica or primary. |
Writes (any AddOrUpdate, Link, Delete) |
Primary only. Replicas reject write requests. |
| Schema changes | Primary only. |
| Admin operations (license, SSO config, …) | Primary only. |
| File uploads | Primary only. |
| Endpoint execution | Replica if the endpoint is marked read-only; primary otherwise. |
| AI tool calls | Same as endpoints. |
Endpoints can be marked read-only-safe in their code; the front-end and SDK call ReadOnlyCompatible() on those requests to allow routing to replicas. Anything that mutates the graph calls ForcePrimary() and is routed to the primary.
Indexing on replicas
Replicas rebuild text and vector indexes locally as the WAL stream arrives — they do not copy index files from the primary. Two consequences:
- A freshly-promoted replica is fully self-sufficient. Failing the primary over to it does not lose indexes.
- The replica's CPU and disk-I/O cost is comparable to the primary's. Replicas are not "cheap read mirrors" — size them like primaries.
When the primary adds a new index, it pushes a LoadNewIndexOnReplica notification over gRPC; the replica picks it up and begins building locally. The same applies to index removal.
Environment flags
On the replica
| Variable | Required | Value | Notes |
|---|---|---|---|
MSK_REPLICA |
yes | true |
Switches the process into read-only mode. |
MSK_PRIMARY_ADDRESS |
yes | full URL of the primary, e.g. https://workspace-primary.example.com |
Used for initial registration and WAL streaming. |
MSK_JWT_KEY |
yes | shared secret string | Must match the primary's MSK_JWT_KEY exactly. Used for inter-node auth. |
MSK_GRAPH_STORAGE |
yes | local path | Where the replica stores its copy of the data. Must be on local SSD/NVMe. |
MSK_PUBLIC_ADDRESS |
yes | public URL the replica is reachable at | Used in registration so the primary can call the replica back. |
MSK_SERVER_ADDRESS |
recommended | internal URL for primary→replica calls | Falls back to MSK_PUBLIC_ADDRESS if unset. |
MSK_GRAPH_MASTER_KEY |
if encryption is on | same value as primary | Replicas cannot read encrypted data without the same key. |
MSK_LICENSE |
yes | license that includes the Replicas feature |
Replicas validate their own license at startup. |
Other MSK_* settings |
as needed | Provider keys, SSO, TLS, etc. | Most settings can be set on both. The replica reads configuration from the graph after the initial catch-up. |
On the primary
The primary doesn't need a flag to enable replication — it accepts replica registrations as soon as one shows up with a valid JWT. What it does need:
| Variable | Required | Notes |
|---|---|---|
MSK_JWT_KEY |
yes | Same value the replicas will present. |
MSK_PUBLIC_ADDRESS |
yes | Reachable from each replica. |
Make sure TCP port 42999 is open from each replica to the primary — that's the gRPC channel the WAL stream uses. It's internal-only; never expose it on the public internet.
CORS on the primary
When a replica registers, the primary automatically adds the replica's public address to its CORS origin list. No manual configuration needed.
Bringing a replica online
Step-by-step
Provision a host
Size it like a primary: same CPU class, same RAM, NVMe storage. A replica that's smaller than the primary will fall behind under load.
Install the workspace binary
Use the same version as the primary. A replica more than one minor version older or newer than the primary will refuse to start. See Upgrades and migrations → Version compatibility.
Set the replica-specific env vars
export MSK_REPLICA=true
export MSK_PRIMARY_ADDRESS=https://workspace-primary.example.com
export MSK_JWT_KEY="$(cat /etc/curiosity/jwt-key)" # same as primary
export MSK_GRAPH_STORAGE=/var/lib/curiosity-replica
export MSK_PUBLIC_ADDRESS=https://workspace-replica-a.example.com
export MSK_LICENSE="$(cat /etc/curiosity/license)"
If the primary uses graph encryption, also set MSK_GRAPH_MASTER_KEY to the same value.
Start the process
systemctl start curiosity-workspace
Tail the log. You should see, in order:
Registering replica with primary at https://workspace-primary.example.com
Catching up with primary (streaming RocksDB files)…
Loading read-only graph from storage
Replica online; tailing WAL from sequence …
If catch-up fails, the replica exits with 0xDEAD — by design. Restart after fixing the cause (network, JWT mismatch, version skew).
Verify
curl https://workspace-replica-a.example.com/api/graph/replica
# {"isReplica":true,"hasReplicas":false,"replicaUID":"…","replicaName":"…"}
curl https://workspace-primary.example.com/api/graph/replica
# {"isReplica":false,"hasReplicas":true}
hasReplicas: true on the primary confirms it's tracking changes for at least one replica.
Routing reads to replicas
The front-end and the C# SDK both support a replica URL list and route reads to it round-robin. Two ways to configure it:
Inline in the workspace settings
In Admin → Workspace Configuration, set the ReplicaURLs list. The setting propagates to clients on next page load.
Via the front-end global
Embed the list as a global JavaScript variable on the served page:
<script>
window.MSKREPLICAURLS = "https://workspace-replica-a.example.com;https://workspace-replica-b.example.com";
</script>
The front-end picks this up at startup and splits on ;. Each entry is health-checked before use; failed replicas are retired and re-tested on a 5–30 minute window.
Behind a load balancer
If you sit a load balancer in front of the workspace, terminate two paths:
/api/**calls markedReadOnlyCompatible()→ primary + replicas, weighted however you want.- Everything else → primary only.
Most teams use the SDK / front-end replica list rather than a load balancer because the routing logic (and health checks) is already there.
Health checks
| Endpoint | Returns |
|---|---|
GET /api/graph/replica |
{ isReplica, hasReplicas, replicaUID?, replicaName? }. |
GET /api/graph/counters |
Document and edge counts. Use to compare replica vs primary for lag detection. |
GET /api/admin/metrics/* |
Internal metrics, including replica status and replication errors. |
A simple "is the replica caught up?" check is comparing node counts between primary and replica. For stricter monitoring, compare the last-applied sequence number — exposed in the internal metrics — against the primary's current sequence.
Lag, errors, and failure modes
| Symptom | Likely cause |
|---|---|
Replica refuses to start with 0xDEAD immediately |
JWT mismatch, primary unreachable, or version skew. Check the log line before exit. |
| Replica catches up but lag grows over time | Replica is underpowered. Match the primary's hardware spec. |
| Replica returns stale results | Expected within the ~250 ms tail interval; under high write load, can grow. |
| Replica restarts unexpectedly | Process or host crash. The primary detects this and re-initiates the catch-up flow. |
hasReplicas: false on the primary despite a running replica |
Replica isn't registered. Inspect the replica's startup log for registration errors. |
| Write requests succeed against a replica | They shouldn't — replicas reject writes at the HTTP layer. If they don't, you're hitting the primary by accident. |
Operational recommendations
- Minimum two replicas in production. Single-replica setups give no read-load headroom during a replica failure.
- Geographic placement. Put replicas in the same network zone as the read traffic they serve. Cross-region replication works but adds latency to every WAL batch.
- Same hardware tier as primary. Replicas rebuild indexes locally; underpowered replicas fall behind silently.
- JWT rotation. When rotating
MSK_JWT_KEY, do it in a maintenance window. The current implementation does not support overlapping keys; replicas must restart after the rotation. - Don't replicate to dev/staging. Replicas are for production read scaling, not for environment promotion. Use Workspace export/import for that.
- Backup is still your responsibility. Replicas are not backups. A logical corruption on the primary replicates to every replica within milliseconds. Maintain real backups per Backup and restore.
Failover
Curiosity does not yet ship an automatic primary→replica promotion mechanism. If the primary fails:
- Pick the most up-to-date replica (highest applied WAL sequence).
- Stop it.
- Restart it without
MSK_REPLICAand withoutMSK_PRIMARY_ADDRESS. It boots as a normal primary. - Update DNS / load-balancer config to point clients at the new primary.
- Tear down the old primary (or, after repair, re-introduce it as a replica).
Plan and rehearse this procedure. The first time you do it should be in a fire drill, not in a real outage.
Cost model
| Resource | Per-replica cost vs primary |
|---|---|
| CPU | 70–100% of primary (replicas rebuild indexes; the workload is similar). |
| RAM | 70–100% of primary. |
| Disk | ≈ 100% of primary (full copy of the data plus indexes). |
| Network | Outbound from primary scales with write volume × N replicas. |
| Embedding / LLM API calls | Zero — replicas do not call external providers; they receive embeddings via the WAL. |
Where to go next
- Backup and restore — replicas are not a substitute for backups.
- Scaling — when to add replicas vs. scale the primary vertically.
- Upgrades and migrations — version compatibility and rolling upgrades.
- Configuration reference — full
MSK_*flag list. - Monitoring — replica metrics in the built-in dashboards.