Web Crawler
A generic crawler for public websites. Walks a starting URL (or sitemap), respects robots.txt, honors per-host politeness delays, and stores extracted content as _WebPage nodes.
variant=info text="Web / feeds" variant=secondary text="Anonymous"
What gets ingested
| Element | Mapped to |
|---|---|
| HTML page | _WebPage (title, body text, headings, meta) |
| Inline image | _Image + _Blob |
| Linked file (PDF, DOCX, …) | _FileEntry + _Blob |
| Embedded video (YouTube, etc.) | Downloaded via FFmpeg and stored as a _FileEntry so the audio can be transcribed |
Authentication
- Type: none. The crawler can be configured to send a custom User-Agent string.
Access control mapping
- Content is public —
_AccessGroup.Public.
Sync cadence
- Default cron: daily at 03:00 UTC.
- Incremental sync: URL canonicalization + content hash. Pages whose canonical URL has been seen are deduplicated; identical content (by SHA-256) is skipped.
Notable
- Configurable filters: by content type, maximum file size, minimum article length, allowed/blocked domain patterns.
- Optional robots.txt enforcement — keep it on for the open web.
- Optional video transcription pipeline: downloads with FFmpeg, sends to the workspace's audio-to-text pipeline.