Curiosity

Sitemap recipe

Source: SitemapSample/ · walks a sitemap.xml (or sitemap index), fetches each URL, canonicalizes, deduplicates, and hashes the body for change detection.

Owns in the academic graph: university web pages, hosts, sections, tags.

What it teaches

  • URL canonicalization — lowercase host, drop default port, strip fragment + tracking params, honor <link rel="canonical">.
  • Content hash (SHA-256) for change detection across reruns.
  • Sitemap index recursion — expand <sitemap> pointers automatically.
  • Polite delays between requests (configurable ms).
  • HTML parsing with HtmlAgilityPack for title / H1 / meta tags.
  • Partial-success ingestion — log failures, never stop the run.

The dedup loop

public interface ISitemapSource
{
    IAsyncEnumerable<SitemapEntry> ListUrlsAsync(string sitemapUrlOrPath);
    Task<ScrapedPage?>             FetchAsync(string url);
}

var attempted = 0; var deduped = 0; var ok = 0;

await foreach (var entry in source.ListUrlsAsync(sitemap))
{
    attempted++;
    var page = await source.FetchAsync(entry.Url);
    if (page is null)            { deduped++; continue; }  // canonical URL already seen
    if (page.StatusCode >= 400)  { continue; }             // skip HTTP errors

    WebsiteIngest.Ingest(graph, page, entry.LastModified);
    if (++ok % 25 == 0) await graph.CommitPendingAsync();
}

Under the hood, HttpSitemapSource.FetchAsync:

  • Canonicalizes the URL (lowercase host, drop port 80/443, strip #, drop ?utm_*).
  • Dedupes against a HashSet<string> of seen canonicals — returns null on repeat.
  • Respects <link rel="canonical"> from the fetched page.
  • Stores SHA-256 of body text on the _WebPage.ContentHash property.

Configuration

Variable Purpose Default
RECIPE_SITEMAP_URL https://example.com/sitemap.xml (blank → local mode) (blank)
RECIPE_POLITENESS_MS Inter-request delay 500
RECIPE_LOCAL_ROOT Local fallback root data/

Reuse notes

  • Always read robots.txt first and set a meaningful User-Agent header.
  • Use ContentHash to short-circuit re-ingestion on unchanged pages.
  • For multi-domain crawls, rate-limit per host, not globally.