OCR & File Extraction

The Curiosity Workspace ships with a built-in extractor that turns every file you upload into searchable text — automatically. Plain-text PDFs and Office documents are parsed directly; scans, photos, and image-only PDFs go through Optical Character Recognition (OCR). The extracted content is then indexed by the same text, vector, and NLP pipelines that handle everything else in the workspace, so it shows up in search results, vector similarity, graph queries, and the chat assistant without any extra plumbing.

This page is the feature overview: what runs, when, and why. For supported formats, language packs, and throughput numbers see Multimodal Search. For the developer how-to — calling the extractor from your own C# — see Extracting files to Markdown.

What the extractor does

When a file lands in the workspace, the extractor inspects its type and dispatches to the right backend:

File class	Backend	Output
Text PDFs, Office, HTML, email	Direct parser (no OCR needed)	Structured text, headings, tables
Image-only PDFs, scanned PDFs	OCR engine, page-by-page	Plain text per page
Images (PNG, JPG, TIFF, HEIC, …)	OCR engine	Plain text
Audio and video	Speech-to-Text (Whisper)	Timestamped transcript
Archives (zip, tar, …)	Recursive extraction	One result per inner file

You don't pick the backend — the extractor does. A .pdf that turns out to be image-only falls through to OCR transparently.

When extraction runs

Extraction happens once, when the file is first indexed. The result is cached on the file node, so re-indexing (because metadata changed, a new index was added, or the workspace was reindexed) does not re-OCR documents you've already processed. To force a re-extraction — for example after enabling a new OCR language — clear the cached extract and re-queue the file.

flowchart LR Upload[File uploaded] --> FileNode[(File node committed)] FileNode --> Queue[(Index queue)] Queue --> Extractor[Extractor — text / OCR / STT / Office] Extractor --> Cache[(Cached extract on file node)] Cache --> Indexes[Text + embedding + NLP indexes] Indexes --> Search[Search · Vector · Graph · Chat]

Markdown output

Beyond plain text, the extractor can preserve document structure as Markdown — headings, lists, and tables come through as Markdown when the source format allows it (text PDFs, Office documents, HTML). For pure OCR results the output is plain text, since scans don't carry structural cues.

Markdown output is what you want when you're feeding the extract into an LLM, a chat-with-document flow, or any pipeline that benefits from preserved structure. The simplest way to get it is ChatAI.GetFileAsMarkdownAsync(fileUID) from a Custom Endpoint or Code Index — see the how-to: Extracting files to Markdown.

Languages

Built-in OCR ships with English, French, Spanish, German, and Portuguese models. Additional languages can be enabled per workspace — see Internationalization.

Where the text shows up

Once extracted, the text behaves like any other field on the file node:

Search. Full-text search and ranking treat it as the file's body.
Vector / semantic search. Embedding indexes run over the extract and feed similarity, "find related", and RAG retrieval.
NLP. Entity extraction, key-phrases, sentiment, classifiers — all run over the extract by default.
Chat assistant. "Chat with this document" uses the extracted text (and page numbers, when available) as grounding context.
Highlights and deep links. Search hits surface the OCR span or transcript segment that matched, and the viewer jumps to the right page or timestamp.

Multimodal Search — supported formats, languages, throughput numbers, and STT details.
Extracting files to Markdown — how-to for calling the extractor from C#.
Internationalization — adding OCR language packs.
Reindexing and re-embedding — after enabling new OCR/STT languages.