Curiosity

RSS / Atom recipe

Source: RssSample/ · one or more RSS / Atom feeds. Built on System.ServiceModel.Syndication so all common dialects share a single code path.

Owns in the academic graph: university news items, feeds, authors, categories.

What it teaches

  • Dedup by entry ID (guid for RSS, id for Atom) using a persisted HashSet<string>.
  • Polling pattern with a "seen" file for cross-run state.
  • Partial-success ingestion — feed-level failures don't take down the whole run.
  • Optional heuristic feed → university mapping (the kind of domain logic that lives in your Ingest method).

The dedup + polling loop

var seen = new SeenEntryStore(seenPath);  // file-backed HashSet<string>

foreach (var (spec, display, university) in feeds)
{
    var newInFeed = 0; var dupInFeed = 0;

    await foreach (var entry in source.ReadAsync(spec))
    {
        if (!seen.MarkNew(entry.EntryId))
        {
            dupInFeed++;
            continue;  // already ingested in a prior run
        }

        NewsIngest.Ingest(graph, entry, display, university);
        newInFeed++;
    }

    logger.LogInformation("{Feed}: {New} new, {Dup} already seen",
                          display, newInFeed, dupInFeed);
    await graph.CommitPendingAsync();
}

seen.Save();  // persist dedup set for the next run

Configuration

Variable Purpose Default
RECIPE_FEED_URLS Comma-separated feed URLs (blank → local mode) (blank)
RECIPE_LOCAL_ROOT Local fallback root data/feeds/
RECIPE_SEEN_PATH Dedup state file data/.seen

Reuse notes

  • Dedup by entry ID, not URL — URLs can change while IDs are stable.
  • Use updatedAt on the entry to catch corrected/republished items.
  • Stateless source → dedup is the right tool; monotonic source → watermark is the right tool. (Postgres recipe shows the latter.)