Connectors
A connector is an external program (typically a long-running C# console app) that reads from a source system and writes nodes/edges/ACLs into a Curiosity Workspace using the Curiosity.Library SDK. Connectors are the canonical way to keep a workspace in sync with the truth in your source systems.
If your data lives somewhere standard and you don't need custom mapping, the built-in integrations in the UI may be sufficient. Connectors give you full control over schemas, keys, edges, and ACL ingestion — which you'll want as soon as the data shape matters.
Connector lifecycle
A well-formed connector run does five things:
- Authenticate to the workspace with an API token scoped to
ingestion. - Register schemas (idempotent).
- Read deltas from the source (initial sync = full; subsequent runs = incremental).
- Upsert nodes and edges with stable keys.
- Commit in bounded batches; record a cursor for the next run.
Available built-in integrations
Configurable from Settings → Integrations, no code required:
- Filesystem — local files and network shares.
- Web — crawl and index public/internal websites.
- Database — JDBC-style connections to PostgreSQL, MySQL, SQL Server, etc.
- SaaS connectors — popular business systems (Slack, Jira, ServiceNow, Confluence, Microsoft 365, Google Drive, and others depending on your license).
For systems not in the list — or when you need custom mapping, custom keys, or ACL ingestion that the built-in connector doesn't model — build a custom connector.
Minimal connector (C#)
The smallest end-to-end connector that ingests a typed entity with edges and ACLs:
using Curiosity.Library;
[Node]
public class Customer
{
[Key] public string Id { get; set; }
[Property] public string Name { get; set; }
[Property] public string Tier { get; set; }
}
[Node]
public class Ticket
{
[Key] public string Id { get; set; }
[Property] public string Subject { get; set; }
[Property] public string Body { get; set; }
[Timestamp] public DateTimeOffset CreatedAt { get; set; }
}
public static class Edges
{
public const string HasTicket = nameof(HasTicket);
public const string TicketOf = nameof(TicketOf);
}
using var workspace = await Workspace.ConnectAsync(
baseUrl: Environment.GetEnvironmentVariable("WORKSPACE_URL"),
apiToken: Environment.GetEnvironmentVariable("WORKSPACE_TOKEN"));
var graph = workspace.Graph;
await graph.CreateNodeSchemaAsync<Customer>();
await graph.CreateNodeSchemaAsync<Ticket>();
await graph.CreateEdgeSchemaAsync(typeof(Edges));
var enterprise = await graph.CreateTeamAsync("Enterprise Support", "Enterprise customers");
await foreach (var row in source.StreamSinceAsync(lastCursor))
{
var customer = graph.TryAdd(new Customer { Id = row.CustomerId, Name = row.CustomerName, Tier = row.Tier });
var ticket = graph.TryAdd(new Ticket { Id = row.TicketId, Subject = row.Subject, Body = row.Body, CreatedAt = row.CreatedAt });
graph.Link(customer, ticket, Edges.HasTicket, Edges.TicketOf);
if (row.Tier == "Enterprise")
graph.RestrictAccessToTeam(ticket, enterprise);
if (row.Index % 500 == 0)
await graph.CommitPendingAsync();
}
await graph.CommitPendingAsync();
await source.SaveCursorAsync();
Run it with a scoped API token:
export WORKSPACE_URL=http://localhost:8080
export WORKSPACE_TOKEN=<ingestion-scoped token>
dotnet run --project FirstApp.Connector
For an end-to-end developer walkthrough (with NLP extraction, embeddings, and a UI), see Build your first enterprise AI app.
Connector responsibilities
A production-grade connector needs to do all of these. None are optional in real environments:
| Responsibility | What "good" looks like |
|---|---|
| Schemas | Registered once at startup; evolution handled with versioned migrations. |
| Keys | Stable IDs from source. Never random. Never depend on row order. |
| Edges | Created explicitly with named edge types. Both directions when readability matters. |
| ACLs | RestrictAccessToTeam / RestrictAccessToUser mirroring source-system permissions. |
| Commits | Batched (100–500 items). One CommitPendingAsync() per batch; one final flush. |
| Cursors | A persistent watermark (timestamp + sequence) so reruns are idempotent. |
| Deletes | Tombstone propagation, or periodic reconciliation against source. |
| Observability | Per-batch counts, durations, error counts. Failures surface clearly. |
| Retries | Exponential backoff on transient failures (network, 429, 5xx). |
| Secrets | API token + source credentials in a secret manager, never in source. |
Permission ingestion patterns
ACLs are why a workspace connector is fundamentally different from "pump data into a search index". You typically have one of three shapes:
Source-mirrored ACLs (recommended)
Read the source's permission model (groups, sharing rules, projects) and call RestrictAccessToTeam / RestrictAccessToUser to mirror it. Membership changes flow on the next run.
foreach (var share in row.Shares)
{
var team = await graph.CreateTeamAsync(share.GroupName, share.GroupDescription);
graph.RestrictAccessToTeam(ticket, team);
}
Tier-based ACLs
The source doesn't have a permission model, but you have a known segmentation rule (free vs paid, region, business unit).
if (row.Tier == "Enterprise")
graph.RestrictAccessToTeam(ticket, enterpriseTeam);
Public-by-default with explicit private overrides
Most content is public; a small subset is restricted.
if (row.IsConfidential)
graph.RestrictAccessToTeam(ticket, restrictedTeam);
// else: default visibility = public
See Access Control Model and Permission model architecture.
Incremental sync patterns
| Pattern | When to use | Notes |
|---|---|---|
| Full refresh | Small datasets, weekly runs | Simplest. Expensive at scale. |
| Watermark-based incremental | Sources with reliable timestamps | Pull updated_after = <last cursor>. Most common pattern. |
| Change-feed / webhook | Sources with native change feeds | Near-real-time. Most complex; needs idempotent writers. |
| Reconciliation pass | Anywhere deletes are critical | Periodic full scan that tombstones missing records. |
Whatever pattern you pick, the writes must be idempotent: re-running the connector should not create duplicates or change node counts.
Delete handling
The graph engine doesn't auto-delete data when source rows disappear. You have to do it explicitly. Three workable approaches:
- Tombstone column in source. Soft-delete in graph (
row.IsDeleted = true). - Reconciliation pass comparing source primary keys to graph nodes; delete the difference.
- Audit-driven deletes triggered by source webhooks.
Hard-delete nodes with graph.RemoveNode(uid) if you need them gone (vs. soft-deleted). Both forms remove them from search.
Connector testing checklist
- Schema registration is idempotent (run twice; no errors).
- Re-running ingestion does not change node/edge counts.
- Cursor advances forward only.
- Source credentials and the workspace API token are read from env vars (or a secret manager), never embedded.
- Deletes in source are reflected in the workspace within one run cycle.
- An end-user test account sees the data it should and only that data.
- Failure modes (source down, API token expired, body parse error) surface as exceptions with enough context to debug.
Connector packaging
- Local dev:
dotnet run. - CI / scheduled job: package as a self-contained
dotnet publishor a small Docker image. - Inside Kubernetes: a
CronJobor a long-runningDeploymentwith a sidecar. - From within the workspace: as a Scheduled Task. For light, periodic ingestion, this avoids deploying a separate service.
Common pitfalls
- Unstable keys cause duplicate nodes on every run. The single most common ingestion bug.
- Missing edges make the graph unusable for navigation, faceting, and graph-scoped search.
- Ingesting unstructured text into one giant property — split into appropriate fields so search and embeddings can do their job.
- No ACL ingestion — every user sees every record. Set
RestrictAccessTo*from day one. - Unbounded commits — calling
CommitPendingAsync()once at the end of a million-row run will use too much memory. Commit in batches.
Next steps
- Walk a complete tutorial: Build your first enterprise AI app.
- Design your schema first: Schema Design.
- Operationalize ingestion: Ingestion Pipelines and Pipeline Orchestration.
- Implement a custom connector with the SDK: Workspace Customization → Data Connector.
- Reference: Token scopes for the right API token, Schema Reference for the schema attributes.
Referenced by
- faq
- data-flow
- connector-templates
- ingestion-pipelines
- pipeline-orchestration
- schema-design
- customer-360
- migration-from-elasticsearch-vector-db-langchain
- migration-from-legacy-enterprise-search
- api-overview
- custom-endpoints
- integrations
- build-your-first-enterprise-ai-app
- overview
- workspace-configuration
- sdk-csharp
- sdk-python
- introduction