Curiosity

Web Crawler

A generic crawler for public websites. Walks a starting URL (or sitemap), respects robots.txt, honors per-host politeness delays, and stores extracted content as _WebPage nodes.

variant=info text="Web / feeds" variant=secondary text="Anonymous"

What gets ingested

Element Mapped to
HTML page _WebPage (title, body text, headings, meta)
Inline image _Image + _Blob
Linked file (PDF, DOCX, …) _FileEntry + _Blob
Embedded video (YouTube, etc.) Downloaded via FFmpeg and stored as a _FileEntry so the audio can be transcribed

Authentication

  • Type: none. The crawler can be configured to send a custom User-Agent string.

Access control mapping

  • Content is public — _AccessGroup.Public.

Sync cadence

  • Default cron: daily at 03:00 UTC.
  • Incremental sync: URL canonicalization + content hash. Pages whose canonical URL has been seen are deduplicated; identical content (by SHA-256) is skipped.

Notable

  • Configurable filters: by content type, maximum file size, minimum article length, allowed/blocked domain patterns.
  • Optional robots.txt enforcement — keep it on for the open web.
  • Optional video transcription pipeline: downloads with FFmpeg, sends to the workspace's audio-to-text pipeline.