Sitemap recipe
Source: SitemapSample/ · walks a sitemap.xml (or sitemap index), fetches each URL, canonicalizes, deduplicates, and hashes the body for change detection.
Owns in the academic graph: university web pages, hosts, sections, tags.
What it teaches
- URL canonicalization — lowercase host, drop default port, strip fragment + tracking params, honor
<link rel="canonical">. - Content hash (SHA-256) for change detection across reruns.
- Sitemap index recursion — expand
<sitemap>pointers automatically. - Polite delays between requests (configurable ms).
- HTML parsing with HtmlAgilityPack for title / H1 / meta tags.
- Partial-success ingestion — log failures, never stop the run.
The dedup loop
public interface ISitemapSource
{
IAsyncEnumerable<SitemapEntry> ListUrlsAsync(string sitemapUrlOrPath);
Task<ScrapedPage?> FetchAsync(string url);
}
var attempted = 0; var deduped = 0; var ok = 0;
await foreach (var entry in source.ListUrlsAsync(sitemap))
{
attempted++;
var page = await source.FetchAsync(entry.Url);
if (page is null) { deduped++; continue; } // canonical URL already seen
if (page.StatusCode >= 400) { continue; } // skip HTTP errors
WebsiteIngest.Ingest(graph, page, entry.LastModified);
if (++ok % 25 == 0) await graph.CommitPendingAsync();
}
Under the hood, HttpSitemapSource.FetchAsync:
- Canonicalizes the URL (lowercase host, drop port
80/443, strip#, drop?utm_*). - Dedupes against a
HashSet<string>of seen canonicals — returnsnullon repeat. - Respects
<link rel="canonical">from the fetched page. - Stores SHA-256 of body text on the
_WebPage.ContentHashproperty.
Configuration
| Variable | Purpose | Default |
|---|---|---|
RECIPE_SITEMAP_URL |
https://example.com/sitemap.xml (blank → local mode) |
(blank) |
RECIPE_POLITENESS_MS |
Inter-request delay | 500 |
RECIPE_LOCAL_ROOT |
Local fallback root | data/ |
Reuse notes
- Always read
robots.txtfirst and set a meaningful User-Agent header. - Use
ContentHashto short-circuit re-ingestion on unchanged pages. - For multi-domain crawls, rate-limit per host, not globally.