Curiosity - Web Crawler

Curiosity

Home

Build, query, and extend the graph + search + AI platform.

Deploy & Operate

Roll out Curiosity to Docker, Kubernetes, or your cloud of choice.

Ship domain-specific AI apps with custom UIs, endpoints, and connectors.

Connect & Ingest

Connectors, integrations, ingestion pipelines, and per-source recipes.

Look up REST, SDK, query language, schema, and error references.

Compose UIs in C# with our front-end component library.

Transpile C# to JavaScript and run it in the browser.

Test, ingest, sync, and promote workspaces from any shell.

Open Source Projects

Browse every open-source project Curiosity maintains.

Changelog Support

Web Crawler

A generic crawler for public websites. Walks a starting URL (or sitemap), respects robots.txt, honors per-host politeness delays, and stores extracted content as _WebPage nodes.

variant=info text="Web / feeds" variant=secondary text="Anonymous"

What gets ingested

Element	Mapped to
HTML page	`_WebPage` (title, body text, headings, meta)
Inline image	`_Image` + `_Blob`
Linked file (PDF, DOCX, …)	`_FileEntry` + `_Blob`
Embedded video (YouTube, etc.)	Downloaded via FFmpeg and stored as a `_FileEntry` so the audio can be transcribed

Authentication

Type: none. The crawler can be configured to send a custom User-Agent string.

Access control mapping

Content is public — _AccessGroup.Public.

Sync cadence

Default cron: daily at 03:00 UTC.
Incremental sync: URL canonicalization + content hash. Pages whose canonical URL has been seen are deduplicated; identical content (by SHA-256) is skipped.

Notable

Configurable filters: by content type, maximum file size, minimum article length, allowed/blocked domain patterns.
Optional robots.txt enforcement — keep it on for the open web.
Optional video transcription pipeline: downloads with FFmpeg, sends to the workspace's audio-to-text pipeline.