Deploying on Kubernetes
Kubernetes is the recommended platform for production Curiosity Workspace deployments. The image is the same one used in Docker; what changes is how persistence, secrets, and traffic ingress are wired up.
This page walks through a complete StatefulSet-based deployment with TLS at an Ingress, plus options for Helm and Kustomize.
Why a StatefulSet
Workspace is a stateful single-writer application. It holds the graph in memory and persists it to disk under MSK_GRAPH_STORAGE. A StatefulSet gives you:
- a stable pod name (
curiosity-0) you can reference in backups and runbooks; - a
PersistentVolumeClaimthat follows the pod across reschedules; - ordered, controlled restarts on updates.
Deployment + ReadWriteOnce PVC also works for a single replica, but StatefulSet is the convention.
Prerequisites
- A working Kubernetes cluster (kubectl context configured).
- A storage class that supports
ReadWriteOnceblock storage. AWS EBS, Azure Disk, GCP Persistent Disk all work. - An Ingress controller (NGINX, Traefik, ALB, AGIC, etc.) and a way to terminate TLS in front of the workspace.
- A secret store (sealed secrets, External Secrets Operator, Vault Agent, AWS Secrets Manager integration, …).
- 16+ GB RAM available on the target node, ideally with anti-affinity rules to keep the pod off noisy neighbors.
Complete manifest
This manifest deploys one Workspace pod with persistent storage, a Service, an Ingress with TLS, secrets, healthchecks, and resource requests. Save it as curiosity.yaml and apply with kubectl apply -f.
apiVersion: v1
kind: Namespace
metadata:
name: curiosity
---
apiVersion: v1
kind: Secret
metadata:
name: curiosity-secrets
namespace: curiosity
type: Opaque
stringData:
MSK_ADMIN_PASSWORD: <generate-with-secret-store>
MSK_JWT_KEY: <32+-bytes-from-secret-store>
MSK_GRAPH_MASTER_KEY: <32+-bytes-from-secret-store>
MSK_LICENSE: <your-license-token>
---
apiVersion: v1
kind: ConfigMap
metadata:
name: curiosity-config
namespace: curiosity
data:
MSK_GRAPH_STORAGE: "/data/curiosity"
MSK_GRAPH_BACKUP_FOLDER: "/data/backups"
MSK_GRAPH_JOURNAL_FOLDER: "/data/journal"
MSK_PUBLIC_ADDRESS: "https://workspace.example.com"
MSK_USE_HSTS: "true"
MSK_REDIRECT_TO_HTTPS: "false" # TLS is terminated at the Ingress
---
apiVersion: v1
kind: Service
metadata:
name: curiosity
namespace: curiosity
spec:
selector:
app: curiosity
ports:
- name: http
port: 8080
targetPort: 8080
clusterIP: None # headless: gives the pod a stable DNS name
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: curiosity
namespace: curiosity
spec:
serviceName: curiosity
replicas: 1
podManagementPolicy: OrderedReady
updateStrategy:
type: OnDelete # require an explicit delete to roll the pod
selector:
matchLabels:
app: curiosity
template:
metadata:
labels:
app: curiosity
spec:
terminationGracePeriodSeconds: 120
securityContext:
runAsUser: 10000
runAsGroup: 10000
fsGroup: 10000
runAsNonRoot: true
containers:
- name: curiosity
image: curiosityai/curiosity:v1.42.0
imagePullPolicy: IfNotPresent
ports:
- name: http
containerPort: 8080
envFrom:
- configMapRef:
name: curiosity-config
- secretRef:
name: curiosity-secrets
volumeMounts:
- name: data
mountPath: /data
resources:
requests:
cpu: "4"
memory: 16Gi
limits:
cpu: "8"
memory: 32Gi
readinessProbe:
httpGet:
path: /api/login/check
port: http
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 6
livenessProbe:
httpGet:
path: /api/login/check
port: http
initialDelaySeconds: 120
periodSeconds: 30
failureThreshold: 5
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ReadWriteOnce]
storageClassName: gp3 # change per platform
resources:
requests:
storage: 200Gi
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: curiosity
namespace: curiosity
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/proxy-body-size: "100m"
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
spec:
ingressClassName: nginx
tls:
- hosts: [workspace.example.com]
secretName: curiosity-tls
rules:
- host: workspace.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: curiosity
port:
number: 8080
Apply, then watch readiness:
kubectl apply -f curiosity.yaml
kubectl -n curiosity rollout status statefulset/curiosity
kubectl -n curiosity get pods,svc,ingress
Tuning per platform
| Platform | Recommended storageClassName |
Notes |
|---|---|---|
| AWS EKS (EC2 nodes) | gp3 (EBS CSI) |
Provision the EBS CSI driver. See AWS. |
| AWS EKS (Fargate) | — | Fargate has no EBS — fall back to EFS (ReadWriteMany) and a Deployment. |
| Azure AKS | managed-csi |
Default storage class works. See Azure. |
| Google GKE | premium-rwo or standard-rwo |
Use the regional disk classes for HA. See GCP. |
| OpenShift | ocs-storagecluster-ceph-rbd (or platform default) |
Add an appropriate SecurityContextConstraint. See OpenShift. |
Helm and Kustomize
Curiosity does not yet ship an official Helm chart. Two community-friendly approaches:
- Kustomize — keep the manifest above as a base, and overlay per environment with
kustomization.yamlpatches that changeimage:,replicas:, secrets, and resource limits. - Helm — wrap the manifest in a chart of your own. The variables that change between environments are: image tag, storage class and size, ingress hostname, secret references, resource requests/limits.
Operations
- Backups: snapshot the PVC via your CSI's
VolumeSnapshotor rely onMSK_GRAPH_BACKUP_FOLDERwritten to a separate PVC that you can replicate off-cluster. See Backup and restore. - Logs: the container writes to stdout/stderr; collect via your platform's aggregator (CloudWatch, Stackdriver, ELK).
- Metrics: per-endpoint and per-tool metrics are available via
/api/endpoints/metricsand/api/chatai/tools/metrics(admin-scoped). See Monitoring. - Scaling: the Workspace runs as a single writer per graph. Scale up the pod (more CPU/RAM) before scaling out.
Upgrades
Roll the pod by setting a new image tag and deleting the existing pod (because updateStrategy: OnDelete):
kubectl -n curiosity set image statefulset/curiosity \
curiosity=curiosityai/curiosity:v1.43.0
kubectl -n curiosity delete pod curiosity-0
The new pod starts, attaches the existing PVC, and replays its journal before becoming ready. Always take a snapshot first and follow Upgrades and migrations.
Common pitfalls
ReadWriteManystorage with this manifest — the workspace is a single writer. Pin toReadWriteOnceblock storage unless you're explicitly running on a shared filesystem like EFS, and even then use aDeploymentwith 1 replica.- PVC that's too small — graph storage grows with data + indexes + embeddings. Resize the PVC ahead of capacity events; expanding a PVC online requires the storage class to allow it (
allowVolumeExpansion: true). - Missing
MSK_PUBLIC_ADDRESS— when behind an Ingress, generated links (SSO callbacks, email links) will be wrong without this set. - Aggressive liveness probe — boot can take a minute or two on cold caches. Keep
initialDelaySecondsgenerous (120s) to avoid restart loops. RollingUpdatewith one replica — the new pod can't attach to the PVC while the old one holds it. UseOnDeleteor setmaxUnavailable: 1.
See also
- Configuration reference — every
MSK_*variable in the manifest. - Deployment — production readiness across all platforms.
- Backup and restore — VolumeSnapshot patterns.
- Troubleshooting — when something doesn't come up.