Performance
The defaults are tuned for "thousands to hundreds of thousands of records". For larger workloads — millions of nodes, multi-GB JSON files, full backfills — a handful of knobs make a 10× difference.
Auto-commit threshold
By default, CommitPendingAsync is yours to call. For long-running ingest loops, let the library auto-flush when the buffer crosses a threshold:
graph.SetAutoCommitCost(everyNodes: 10_000);
A larger threshold:
- amortizes commit overhead better (fewer HTTP round-trips),
- but loses more work on a crash before the next commit.
10_000 is a reasonable starting point. Tune up if your batches are small; tune down if your records are large (file contents, long bodies).
Pause indexing for backfills
The search index updates after every commit. For an initial backfill of millions of records, you don't need the index to be fresh until the end:
graph.PauseIndexing("initial-backfill");
try
{
await IngestEverythingAsync(graph);
await graph.CommitPendingAsync();
}
finally
{
graph.ResumeIndexing("initial-backfill"); // index rebuilds once
}
The string argument is a label — multiple paused operations can coexist, and the index resumes when every label has been released. Always ResumeIndexing in a finally so a thrown exception doesn't leave the workspace stuck with stale search results.
Use Node.FromKey to skip reads
Linking by key avoids a fetch entirely:
// FAST — link stored by key, becomes active when both nodes exist
graph.Link(
partNode,
Node.FromKey(nameof(Device), deviceName),
Edges.PartOf,
Edges.HasPart);
Versus the read-first pattern, which costs one round-trip per link:
// SLOW — fetch the device node first
var device = await graph.GetNodeByKeyAsync(nameof(Device), deviceName);
graph.Link(partNode, device, Edges.PartOf, Edges.HasPart);
Reserve the read-first version for the rare case where you need to inspect a property before deciding to link.
Stream large source files
Don't load the whole source into memory:
using var stream = File.OpenRead("dump.json");
await foreach (var record in JsonSerializer.DeserializeAsyncEnumerable<Record>(stream))
{
Map(graph, record);
}
This keeps peak memory bounded regardless of source file size.
Parallel ingestion (with care)
If the source supports concurrent reads, you can ingest in parallel. Two gotchas:
- Write conflicts on shared nodes. Two workers calling
AddOrUpdatefor the same key race; the last writer wins on properties. For most sources this is fine, but track it. - Connection budget. Each parallel worker holds one
IGraphconnection. Don't outscale the workspace's connection pool — start with 4 and measure.
var workers = Enumerable.Range(0, 4).Select(async i =>
{
using var g = Graph.Connect(/* same endpoint/token, different connectorName per worker */);
await IngestShardAsync(g, shardIndex: i, totalShards: 4);
});
await Task.WhenAll(workers);
Profile before optimizing
Most connectors are I/O-bound — the source API is the bottleneck, not Curiosity. Before tuning:
var sw = Stopwatch.StartNew();
var batch = await source.FetchAsync(...);
Console.WriteLine($"fetch {sw.ElapsedMilliseconds}ms ({batch.Count} records)");
sw.Restart();
foreach (var r in batch) Map(graph, r);
Console.WriteLine($"map {sw.ElapsedMilliseconds}ms");
sw.Restart();
await graph.CommitPendingAsync();
Console.WriteLine($"commit {sw.ElapsedMilliseconds}ms");
If commit dominates, raise the auto-commit threshold; if fetch dominates, parallelize the source side; if map dominates, you've written a CPU-heavy mapper (rare).