Curiosity

Multimodal Search Tutorial

A runnable walkthrough: ingest a PDF, an image, and an audio file; verify extraction; query the extracted text via search and vector retrieval. For supported formats and limits see Multimodal search.

Prerequisites

A workspace you can sign into with admin permissions.
An ingestion token. See Token scopes.
Three sample files:
- a scanned PDF (we use a fictional invoice-2024-q3.pdf)
- a photo of a document (whiteboard.jpg)
- a short audio clip (standup-2024-09-12.mp3)
The C# SDK (Curiosity.Library) or the Python SDK (curiosity). The example below uses Python.

Step 1 — Enable OCR and STT

Open the workspace UI, sign in as admin.
Go to Admin → NLP Configuration. Toggle OCR Engine to on, select the languages your documents use, save.
Go to Admin → AI Integrations → Speech-to-Text. Pick a Whisper model (small is the sensible default), save.

These changes apply to new uploads immediately; existing files re-process the next time the workspace's indexer runs.

Step 2 — Upload the files

import os
from curiosity import Graph

with Graph.connect(endpoint=os.environ["WORKSPACE_URL"],
                   token=os.environ["CURIOSITY_TOKEN"],
                   connector_name="multimodal-tutorial") as g:

    with open("invoice-2024-q3.pdf", "rb") as f:
        invoice = g.upload_file(f, "invoice-2024-q3.pdf", source_name="finance")

    with open("whiteboard.jpg", "rb") as f:
        whiteboard = g.upload_file(f, "whiteboard.jpg", source_name="meetings")

    with open("standup-2024-09-12.mp3", "rb") as f:
        standup = g.upload_file(f, "standup-2024-09-12.mp3", source_name="meetings")

    g.commit_pending()

OCR runs as soon as the file is committed. STT queues immediately too, but transcription takes longer — typically 1–10× real-time depending on the Whisper model and worker hardware.

Step 3 — Confirm extraction

In the workspace UI, open each file. You should see:

Invoice PDF. Extracted text in the side panel: vendor name, line items, totals.
Whiteboard photo. Extracted snippets corresponding to the writing in the photo. Quality depends on legibility — see Multimodal search → Quality expectations.
Standup audio. A timestamped transcript with one entry per segment.

If extraction hasn't happened yet, the file page shows a "Processing" badge. Wait a few minutes and refresh.

Step 4 — Search across modalities

The extracted text is just text — it participates in the regular search pipeline.

from curiosity import Query

q = (Query()
     .start_at("File")
     .take(20)
     .emit("F"))

# All files matching the term "supplier" — PDF text, OCR'd images, and transcripts will all be candidates.

For a search-style retrieval (with facets, scoring, ranking), use the search endpoint directly:

curl -X POST "$WORKSPACE_URL/api/search" \
  -H "Authorization: Bearer $CURIOSITY_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "Query": { "Value": "supplier overdue" },
    "BeforeTypesFacet": ["File"]
  }'

Each hit's Highlights[] carries the matched passage; for media files, look at ChildHits[] to find the page (PDF) or timestamp (audio/video) that matched.

Step 5 — Vector retrieval

If the embedding index for File.ExtractedText is enabled, the same query runs as semantic retrieval too:

q = (Query()
     # Find the 50 most semantically similar files to the phrase.
     .similar(index="File.ExtractedText", count=50, tolerance=0.5)
     .emit("F"))

Combine semantic and keyword retrieval as hybrid by setting VectorSearchTypes and VectorSearchMode = Hybrid on a SearchRequest. See Hybrid search.

Step 6 — Deep-link to the right moment

When the hit is from a transcript, the workspace's UI handles the deep-link automatically. If you're building a custom UI, read ChildHit.UID and resolve it to a segment offset:

hit_uid = "uid-…"           # from SearchHit.ChildHits
segment = g.node_by_uid(hit_uid)
print(segment["StartMs"], segment["EndMs"], segment["Text"])

Use these in a media player as #t={startSec} parameters.

Troubleshooting

Symptom	Likely cause / fix
File uploaded but no extracted text appears.	OCR/STT not enabled, or the file's type isn't in the supported list.
Extraction takes very long.	Whisper is set to `large-v3` on CPU — switch to `small` or add a GPU worker.
OCR text has obvious mistakes.	Source image is too low-res or tilted. Re-scan at 300 dpi or higher.
STT transcript stops mid-file.	File exceeds the configured max duration. Split into chunks or raise the cap.
Search returns the file but not the right passage.	Vector chunking too coarse. See Vector search → chunking.

Where to go next

Multimodal search reference for limits, formats, and quality.
Vector search — tune retrieval over extracted text.
Internationalization — add OCR/STT languages.
Reindexing and re-embedding — after enabling new languages.

Referenced by

multimodal-search