Multimodal Search Tutorial
A runnable walkthrough: ingest a PDF, an image, and an audio file; verify extraction; query the extracted text via search and vector retrieval. For supported formats and limits see Multimodal search.
Prerequisites
- A workspace you can sign into with admin permissions.
- An ingestion token. See Token scopes.
- Three sample files:
- a scanned PDF (we use a fictional
invoice-2024-q3.pdf) - a photo of a document (
whiteboard.jpg) - a short audio clip (
standup-2024-09-12.mp3)
- a scanned PDF (we use a fictional
- The C# SDK (
Curiosity.Library) or the Python SDK (curiosity). The example below uses Python.
Step 1 — Enable OCR and STT
- Open the workspace UI, sign in as admin.
- Go to Admin → NLP Configuration. Toggle OCR Engine to on, select the languages your documents use, save.
- Go to Admin → AI Integrations → Speech-to-Text. Pick a Whisper model (
smallis the sensible default), save.
These changes apply to new uploads immediately; existing files re-process the next time the workspace's indexer runs.
Step 2 — Upload the files
import os
from curiosity import Graph
with Graph.connect(endpoint=os.environ["WORKSPACE_URL"],
token=os.environ["CURIOSITY_TOKEN"],
connector_name="multimodal-tutorial") as g:
with open("invoice-2024-q3.pdf", "rb") as f:
invoice = g.upload_file(f, "invoice-2024-q3.pdf", source_name="finance")
with open("whiteboard.jpg", "rb") as f:
whiteboard = g.upload_file(f, "whiteboard.jpg", source_name="meetings")
with open("standup-2024-09-12.mp3", "rb") as f:
standup = g.upload_file(f, "standup-2024-09-12.mp3", source_name="meetings")
g.commit_pending()
OCR runs as soon as the file is committed. STT queues immediately too, but transcription takes longer — typically 1–10× real-time depending on the Whisper model and worker hardware.
Step 3 — Confirm extraction
In the workspace UI, open each file. You should see:
- Invoice PDF. Extracted text in the side panel: vendor name, line items, totals.
- Whiteboard photo. Extracted snippets corresponding to the writing in the photo. Quality depends on legibility — see Multimodal search → Quality expectations.
- Standup audio. A timestamped transcript with one entry per segment.
If extraction hasn't happened yet, the file page shows a "Processing" badge. Wait a few minutes and refresh.
Step 4 — Search across modalities
The extracted text is just text — it participates in the regular search pipeline.
from curiosity import Query
q = (Query()
.start_at("File")
.take(20)
.emit("F"))
# All files matching the term "supplier" — PDF text, OCR'd images, and transcripts will all be candidates.
For a search-style retrieval (with facets, scoring, ranking), use the search endpoint directly:
curl -X POST "$WORKSPACE_URL/api/search" \
-H "Authorization: Bearer $CURIOSITY_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"Query": { "Value": "supplier overdue" },
"BeforeTypesFacet": ["File"]
}'
Each hit's Highlights[] carries the matched passage; for media files, look at ChildHits[] to find the page (PDF) or timestamp (audio/video) that matched.
Step 5 — Vector retrieval
If the embedding index for File.ExtractedText is enabled, the same query runs as semantic retrieval too:
q = (Query()
# Find the 50 most semantically similar files to the phrase.
.similar(index="File.ExtractedText", count=50, tolerance=0.5)
.emit("F"))
Combine semantic and keyword retrieval as hybrid by setting VectorSearchTypes and VectorSearchMode = Hybrid on a SearchRequest. See Hybrid search.
Step 6 — Deep-link to the right moment
When the hit is from a transcript, the workspace's UI handles the deep-link automatically. If you're building a custom UI, read ChildHit.UID and resolve it to a segment offset:
hit_uid = "uid-…" # from SearchHit.ChildHits
segment = g.node_by_uid(hit_uid)
print(segment["StartMs"], segment["EndMs"], segment["Text"])
Use these in a media player as #t={startSec} parameters.
Troubleshooting
| Symptom | Likely cause / fix |
|---|---|
| File uploaded but no extracted text appears. | OCR/STT not enabled, or the file's type isn't in the supported list. |
| Extraction takes very long. | Whisper is set to large-v3 on CPU — switch to small or add a GPU worker. |
| OCR text has obvious mistakes. | Source image is too low-res or tilted. Re-scan at 300 dpi or higher. |
| STT transcript stops mid-file. | File exceeds the configured max duration. Split into chunks or raise the cap. |
| Search returns the file but not the right passage. | Vector chunking too coarse. See Vector search → chunking. |
Where to go next
- Multimodal search reference for limits, formats, and quality.
- Vector search — tune retrieval over extracted text.
- Internationalization — add OCR/STT languages.
- Reindexing and re-embedding — after enabling new languages.