Curiosity

Multimodal Search Tutorial

A runnable walkthrough: ingest a PDF, an image, and an audio file; verify extraction; query the extracted text via search and vector retrieval. For supported formats and limits see Multimodal search.

Prerequisites

  • A workspace you can sign into with admin permissions.
  • An ingestion token. See Token scopes.
  • Three sample files:
    • a scanned PDF (we use a fictional invoice-2024-q3.pdf)
    • a photo of a document (whiteboard.jpg)
    • a short audio clip (standup-2024-09-12.mp3)
  • The C# SDK (Curiosity.Library) or the Python SDK (curiosity). The example below uses Python.

Step 1 — Enable OCR and STT

  1. Open the workspace UI, sign in as admin.
  2. Go to Admin → NLP Configuration. Toggle OCR Engine to on, select the languages your documents use, save.
  3. Go to Admin → AI Integrations → Speech-to-Text. Pick a Whisper model (small is the sensible default), save.

These changes apply to new uploads immediately; existing files re-process the next time the workspace's indexer runs.

Step 2 — Upload the files

import os
from curiosity import Graph

with Graph.connect(endpoint=os.environ["WORKSPACE_URL"],
                   token=os.environ["CURIOSITY_TOKEN"],
                   connector_name="multimodal-tutorial") as g:

    with open("invoice-2024-q3.pdf", "rb") as f:
        invoice = g.upload_file(f, "invoice-2024-q3.pdf", source_name="finance")

    with open("whiteboard.jpg", "rb") as f:
        whiteboard = g.upload_file(f, "whiteboard.jpg", source_name="meetings")

    with open("standup-2024-09-12.mp3", "rb") as f:
        standup = g.upload_file(f, "standup-2024-09-12.mp3", source_name="meetings")

    g.commit_pending()

OCR runs as soon as the file is committed. STT queues immediately too, but transcription takes longer — typically 1–10× real-time depending on the Whisper model and worker hardware.

Step 3 — Confirm extraction

In the workspace UI, open each file. You should see:

  • Invoice PDF. Extracted text in the side panel: vendor name, line items, totals.
  • Whiteboard photo. Extracted snippets corresponding to the writing in the photo. Quality depends on legibility — see Multimodal search → Quality expectations.
  • Standup audio. A timestamped transcript with one entry per segment.

If extraction hasn't happened yet, the file page shows a "Processing" badge. Wait a few minutes and refresh.

Step 4 — Search across modalities

The extracted text is just text — it participates in the regular search pipeline.

from curiosity import Query

q = (Query()
     .start_at("File")
     .take(20)
     .emit("F"))

# All files matching the term "supplier" — PDF text, OCR'd images, and transcripts will all be candidates.

For a search-style retrieval (with facets, scoring, ranking), use the search endpoint directly:

curl -X POST "$WORKSPACE_URL/api/search" \
  -H "Authorization: Bearer $CURIOSITY_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "Query": { "Value": "supplier overdue" },
    "BeforeTypesFacet": ["File"]
  }'

Each hit's Highlights[] carries the matched passage; for media files, look at ChildHits[] to find the page (PDF) or timestamp (audio/video) that matched.

Step 5 — Vector retrieval

If the embedding index for File.ExtractedText is enabled, the same query runs as semantic retrieval too:

q = (Query()
     # Find the 50 most semantically similar files to the phrase.
     .similar(index="File.ExtractedText", count=50, tolerance=0.5)
     .emit("F"))

Combine semantic and keyword retrieval as hybrid by setting VectorSearchTypes and VectorSearchMode = Hybrid on a SearchRequest. See Hybrid search.

When the hit is from a transcript, the workspace's UI handles the deep-link automatically. If you're building a custom UI, read ChildHit.UID and resolve it to a segment offset:

hit_uid = "uid-…"           # from SearchHit.ChildHits
segment = g.node_by_uid(hit_uid)
print(segment["StartMs"], segment["EndMs"], segment["Text"])

Use these in a media player as #t={startSec} parameters.

Troubleshooting

Symptom Likely cause / fix
File uploaded but no extracted text appears. OCR/STT not enabled, or the file's type isn't in the supported list.
Extraction takes very long. Whisper is set to large-v3 on CPU — switch to small or add a GPU worker.
OCR text has obvious mistakes. Source image is too low-res or tilted. Re-scan at 300 dpi or higher.
STT transcript stops mid-file. File exceeds the configured max duration. Split into chunks or raise the cap.
Search returns the file but not the right passage. Vector chunking too coarse. See Vector search → chunking.

Where to go next

Referenced by

© 2026 Curiosity. All rights reserved.
Powered by Neko