Multimodal Search

Curiosity Workspace extracts searchable text from images, scanned PDFs, audio, and video. Once extracted, the text becomes a regular searchable property — text search, vector search, NLP pipelines, and graph queries all see it like any other field.

This page is the reference for what's supported, what the limits are, and what to expect operationally. For a runnable walkthrough see Multimodal tutorial.

Optical Character Recognition (OCR)

OCR runs automatically on uploaded files of supported types. The extracted text is stored on the file node and re-indexed by the text and embedding indexes the same way other text fields are.

Supported formats

Class	Extensions
Common raster images	`.png`, `.jpg`, `.jpeg`, `.gif`, `.bmp`, `.webp`, `.svg`
Professional formats	`.tif`, `.tiff`, `.dng`, `.raw`, `.heic`, `.heif`, `.psb`
Office image formats	`.odg`, `.otg`, `.odi`
Scanned PDFs	`.pdf` files whose pages are image-only (text PDFs are extracted directly, no OCR needed).

Languages

Built-in OCR ships English, French, Spanish, German, and Portuguese language models. Additional languages can be enabled per workspace by an administrator — see Internationalization.

Limits and performance

Limit	Default	Notes
Max file size	200 MB	Configurable per workspace.
Max pages per PDF	2000	Pages beyond this are skipped, not errored.
Throughput (CPU, single worker)	~5–15 pages / second (clean scan, 300 dpi)	GPU workers run roughly 10× faster.
Memory	~500 MB / concurrent page

Quality expectations

Clean modern scans (300+ dpi): 95–99% character accuracy. Treat as production-quality.
Photographed documents (phone camera, tilted): 80–95%. Good enough for search, not for citations.
Low-resolution / damaged scans: 50–80%. Useful for "is this document about X" but not for exact-quote retrieval.
Handwriting: not supported out of the box. Add a handwriting model via the admin UI for narrow domains.

Speech-to-Text (STT)

STT transcribes audio and video files. The transcript is stored on the file node alongside word-level timestamps so the UI can deep-link to the relevant moment in a media file.

Supported formats

Class	Extensions
Video	`.mp4`, `.wmv`, `.mpeg`, `.avi`, `.mkv`, `.mov`, `.ogv`, `.3gp`, `.flv`
Audio	`.mp3`, `.wav`, `.mka`, `.wma`, `.flac`, `.aac`, `.aiff`, `.m4a`, `.oga`, `.weba`, `.webm`

Models

The workspace uses Whisper for transcription. Pick the model size per workspace based on the accuracy/cost tradeoff:

Whisper model	Approx. RTF (CPU)	Approx. RTF (GPU)	Quality
`tiny`	0.3×	0.05×	Acceptable for clean speech.
`base`	0.5×	0.07×	Good for monologue / dictation.
`small`	1.0×	0.10×	Solid default.
`medium`	2.5×	0.20×	Strong, handles accents.
`large-v3`	6.0×	0.30×	Best, multilingual.

RTF = real-time factor (time to transcribe ÷ media duration). 0.1× means 1 hour of audio takes 6 minutes.

Features

Searchable transcripts. Find spoken phrases in any media file.
Timestamped navigation. Search hits link to the segment offset; the player jumps directly there.
Language detection. Auto-detected per file. Force a language in the admin UI for known mono-lingual corpora to skip the detection step.
Speaker diarization (optional). Mark up speaker turns. Disabled by default — turn on per workspace if needed.

Limits and performance

Limit	Default	Notes
Max file size	2 GB	Configurable. Larger files split into segments.
Max duration	8 hours
Word-level timestamps	enabled	Disable for marginal throughput gains on long media.
Concurrent workers	1 per GPU / 4 per CPU	Configure per worker pool.

Cost and capacity notes

OCR and STT both run inside the workspace by default. No external API calls; no per-file cost beyond compute.
GPU workers move the needle. For STT especially, GPU throughput is 10–30× CPU.
Schedule big backfills off-hours. A back-catalog of 100 000 PDF scans will saturate a CPU worker for days. Plan capacity ahead.
Embeddings on multimodal text. Once extracted, the text is just text — and embeddings cost the usual per-token amount. Disable embedding on short OCR fields (single labels, captions) to save on rebuild costs.

Retrieval shape

After extraction, multimodal content participates in normal retrieval. Two extras:

SearchHit.Highlights[] includes the transcript fragment that matched (text) or the OCR span that matched (image-with-bounding-box on supported viewers).
ChildHit carries the page/timestamp identifier — use it to deep-link from the hit list to "page 7" or "minute 32".

See Search DSL for the full hit shape.

Multimodal tutorial — runnable example with sample files.
Internationalization — adding language packs.
Vector search — semantic retrieval over extracted text.
Reindexing and re-embedding — after enabling new OCR/STT languages.

Multimodal Search

Optical Character Recognition (OCR)

Supported formats

Languages

Limits and performance

Quality expectations

Speech-to-Text (STT)

Supported formats

Models

Features

Limits and performance

Cost and capacity notes

Retrieval shape

Related pages

Referenced by