# Ingestion Pipelines

# Ingestion Pipelines

In Curiosity Workspace, “ingestion pipelines” describes the operational workflow that gets data from the outside world into your workspace and keeps it correct over time.

Pipelines can be implemented as:

  • connectors (code-based ingestion with full control)
  • configured integrations (when the source system is supported and configuration is sufficient)
  • scheduled tasks (for periodic sync, reindexing, or enrichment)

# A standard ingestion pipeline lifecycle

# 1) Design

  • define node/edge schemas and keys
  • decide which fields will be searchable (text and/or embeddings)
  • decide which relationships are necessary for navigation and filtering

# 2) Initial load

  • ingest the historical dataset
  • validate counts and relationships
  • build initial search/vector indexes

# 3) Incremental updates

  • ingest deltas (new/updated/deleted records)
  • keep keys stable and updates idempotent
  • reindex where required

# 4) Enrichment (optional)

  • run NLP pipelines and entity capture
  • link entities into the graph
  • compute derived edges (e.g., clusters, “similar to” relationships) if your use case requires it

# Operational considerations

  • Idempotency: re-running a pipeline should not create duplicates.
  • Backfills: when schemas change, plan for reprocessing.
  • Observability: log ingestion counts, errors, and time spent per stage.
  • Secrets: store tokens/credentials in a secret manager, not in code.

# Triggers and scheduling

Common triggers:

  • schedule (hourly/daily)
  • webhook/event stream (near real-time)
  • manual (for one-off imports or backfills)

See APIs & Extensibility → Scheduled Tasks.

# Next steps