Multimodal Search
Curiosity Workspace extracts searchable text from images, scanned PDFs, audio, and video. Once extracted, the text becomes a regular searchable property — text search, vector search, NLP pipelines, and graph queries all see it like any other field.
This page is the reference for what's supported, what the limits are, and what to expect operationally. For a runnable walkthrough see Multimodal tutorial.
Optical Character Recognition (OCR)
OCR runs automatically on uploaded files of supported types. The extracted text is stored on the file node and re-indexed by the text and embedding indexes the same way other text fields are.
Supported formats
| Class | Extensions |
|---|---|
| Common raster images | .png, .jpg, .jpeg, .gif, .bmp, .webp, .svg |
| Professional formats | .tif, .tiff, .dng, .raw, .heic, .heif, .psb |
| Office image formats | .odg, .otg, .odi |
| Scanned PDFs | .pdf files whose pages are image-only (text PDFs are extracted directly, no OCR needed). |
Languages
Built-in OCR ships English, French, Spanish, German, and Portuguese language models. Additional languages can be enabled per workspace by an administrator — see Internationalization.
Limits and performance
| Limit | Default | Notes |
|---|---|---|
| Max file size | 200 MB | Configurable per workspace. |
| Max pages per PDF | 2000 | Pages beyond this are skipped, not errored. |
| Throughput (CPU, single worker) | ~5–15 pages / second (clean scan, 300 dpi) | GPU workers run roughly 10× faster. |
| Memory | ~500 MB / concurrent page |
Quality expectations
- Clean modern scans (300+ dpi): 95–99% character accuracy. Treat as production-quality.
- Photographed documents (phone camera, tilted): 80–95%. Good enough for search, not for citations.
- Low-resolution / damaged scans: 50–80%. Useful for "is this document about X" but not for exact-quote retrieval.
- Handwriting: not supported out of the box. Add a handwriting model via the admin UI for narrow domains.
Speech-to-Text (STT)
STT transcribes audio and video files. The transcript is stored on the file node alongside word-level timestamps so the UI can deep-link to the relevant moment in a media file.
Supported formats
| Class | Extensions |
|---|---|
| Video | .mp4, .wmv, .mpeg, .avi, .mkv, .mov, .ogv, .3gp, .flv |
| Audio | .mp3, .wav, .mka, .wma, .flac, .aac, .aiff, .m4a, .oga, .weba, .webm |
Models
The workspace uses Whisper for transcription. Pick the model size per workspace based on the accuracy/cost tradeoff:
| Whisper model | Approx. RTF (CPU) | Approx. RTF (GPU) | Quality |
|---|---|---|---|
tiny |
0.3× | 0.05× | Acceptable for clean speech. |
base |
0.5× | 0.07× | Good for monologue / dictation. |
small |
1.0× | 0.10× | Solid default. |
medium |
2.5× | 0.20× | Strong, handles accents. |
large-v3 |
6.0× | 0.30× | Best, multilingual. |
RTF = real-time factor (time to transcribe ÷ media duration). 0.1× means 1 hour of audio takes 6 minutes.
Features
- Searchable transcripts. Find spoken phrases in any media file.
- Timestamped navigation. Search hits link to the segment offset; the player jumps directly there.
- Language detection. Auto-detected per file. Force a language in the admin UI for known mono-lingual corpora to skip the detection step.
- Speaker diarization (optional). Mark up speaker turns. Disabled by default — turn on per workspace if needed.
Limits and performance
| Limit | Default | Notes |
|---|---|---|
| Max file size | 2 GB | Configurable. Larger files split into segments. |
| Max duration | 8 hours | |
| Word-level timestamps | enabled | Disable for marginal throughput gains on long media. |
| Concurrent workers | 1 per GPU / 4 per CPU | Configure per worker pool. |
Cost and capacity notes
- OCR and STT both run inside the workspace by default. No external API calls; no per-file cost beyond compute.
- GPU workers move the needle. For STT especially, GPU throughput is 10–30× CPU.
- Schedule big backfills off-hours. A back-catalog of 100 000 PDF scans will saturate a CPU worker for days. Plan capacity ahead.
- Embeddings on multimodal text. Once extracted, the text is just text — and embeddings cost the usual per-token amount. Disable embedding on short OCR fields (single labels, captions) to save on rebuild costs.
Retrieval shape
After extraction, multimodal content participates in normal retrieval. Two extras:
SearchHit.Highlights[]includes the transcript fragment that matched (text) or the OCR span that matched (image-with-bounding-box on supported viewers).ChildHitcarries the page/timestamp identifier — use it to deep-link from the hit list to "page 7" or "minute 32".
See Search DSL for the full hit shape.
Related pages
- Multimodal tutorial — runnable example with sample files.
- Internationalization — adding language packs.
- Vector search — semantic retrieval over extracted text.
- Reindexing and re-embedding — after enabling new OCR/STT languages.