RSS / Atom recipe
Source: RssSample/ · one or more RSS / Atom feeds. Built on System.ServiceModel.Syndication so all common dialects share a single code path.
Owns in the academic graph: university news items, feeds, authors, categories.
What it teaches
- Dedup by entry ID (
guidfor RSS,idfor Atom) using a persistedHashSet<string>. - Polling pattern with a "seen" file for cross-run state.
- Partial-success ingestion — feed-level failures don't take down the whole run.
- Optional heuristic feed → university mapping (the kind of domain logic that lives in your
Ingestmethod).
The dedup + polling loop
var seen = new SeenEntryStore(seenPath); // file-backed HashSet<string>
foreach (var (spec, display, university) in feeds)
{
var newInFeed = 0; var dupInFeed = 0;
await foreach (var entry in source.ReadAsync(spec))
{
if (!seen.MarkNew(entry.EntryId))
{
dupInFeed++;
continue; // already ingested in a prior run
}
NewsIngest.Ingest(graph, entry, display, university);
newInFeed++;
}
logger.LogInformation("{Feed}: {New} new, {Dup} already seen",
display, newInFeed, dupInFeed);
await graph.CommitPendingAsync();
}
seen.Save(); // persist dedup set for the next run
Configuration
| Variable | Purpose | Default |
|---|---|---|
RECIPE_FEED_URLS |
Comma-separated feed URLs (blank → local mode) | (blank) |
RECIPE_LOCAL_ROOT |
Local fallback root | data/feeds/ |
RECIPE_SEEN_PATH |
Dedup state file | data/.seen |
Reuse notes
- Dedup by entry ID, not URL — URLs can change while IDs are stable.
- Use
updatedAton the entry to catch corrected/republished items. - Stateless source → dedup is the right tool; monotonic source → watermark is the right tool. (Postgres recipe shows the latter.)