Curiosity Workspaces

# Ingestion Pipelines

# Ingestion Pipelines

In Curiosity Workspace, “ingestion pipelines” describes the operational workflow that gets data from the outside world into your workspace and keeps it correct over time.

Pipelines can be implemented as:

connectors (code-based ingestion with full control)
configured integrations (when the source system is supported and configuration is sufficient)
scheduled tasks (for periodic sync, reindexing, or enrichment)

# A standard ingestion pipeline lifecycle

# 1) Design

define node/edge schemas and keys
decide which fields will be searchable (text and/or embeddings)
decide which relationships are necessary for navigation and filtering

# 2) Initial load

ingest the historical dataset
validate counts and relationships
build initial search/vector indexes

# 3) Incremental updates

ingest deltas (new/updated/deleted records)
keep keys stable and updates idempotent
reindex where required

# 4) Enrichment (optional)

run NLP pipelines and entity capture
link entities into the graph
compute derived edges (e.g., clusters, “similar to” relationships) if your use case requires it

# Operational considerations

Idempotency: re-running a pipeline should not create duplicates.
Backfills: when schemas change, plan for reprocessing.
Observability: log ingestion counts, errors, and time spent per stage.
Secrets: store tokens/credentials in a secret manager, not in code.

# Triggers and scheduling

Common triggers:

schedule (hourly/daily)
webhook/event stream (near real-time)
manual (for one-off imports or backfills)

See APIs & Extensibility → Scheduled Tasks.

# Next steps

Define the data model first: Schema Design
Implement a connector: Connectors