OCR & File Extraction
The Curiosity Workspace ships with a built-in extractor that turns every file you upload into searchable text — automatically. Plain-text PDFs and Office documents are parsed directly; scans, photos, and image-only PDFs go through Optical Character Recognition (OCR). The extracted content is then indexed by the same text, vector, and NLP pipelines that handle everything else in the workspace, so it shows up in search results, vector similarity, graph queries, and the chat assistant without any extra plumbing.
This page is the feature overview: what runs, when, and why. For supported formats, language packs, and throughput numbers see Multimodal Search. For the developer how-to — calling the extractor from your own C# — see Extracting files to Markdown.
What the extractor does
When a file lands in the workspace, the extractor inspects its type and dispatches to the right backend:
| File class | Backend | Output |
|---|---|---|
| Text PDFs, Office, HTML, email | Direct parser (no OCR needed) | Structured text, headings, tables |
| Image-only PDFs, scanned PDFs | OCR engine, page-by-page | Plain text per page |
| Images (PNG, JPG, TIFF, HEIC, …) | OCR engine | Plain text |
| Audio and video | Speech-to-Text (Whisper) | Timestamped transcript |
| Archives (zip, tar, …) | Recursive extraction | One result per inner file |
You don't pick the backend — the extractor does. A .pdf that turns out to be image-only falls through to OCR transparently.
When extraction runs
Extraction happens once, when the file is first indexed. The result is cached on the file node, so re-indexing (because metadata changed, a new index was added, or the workspace was reindexed) does not re-OCR documents you've already processed. To force a re-extraction — for example after enabling a new OCR language — clear the cached extract and re-queue the file.
Markdown output
Beyond plain text, the extractor can preserve document structure as Markdown — headings, lists, and tables come through as Markdown when the source format allows it (text PDFs, Office documents, HTML). For pure OCR results the output is plain text, since scans don't carry structural cues.
Markdown output is what you want when you're feeding the extract into an LLM, a chat-with-document flow, or any pipeline that benefits from preserved structure. The simplest way to get it is ChatAI.GetFileAsMarkdownAsync(fileUID) from a Custom Endpoint or Code Index — see the how-to: Extracting files to Markdown.
Languages
Built-in OCR ships with English, French, Spanish, German, and Portuguese models. Additional languages can be enabled per workspace — see Internationalization.
Where the text shows up
Once extracted, the text behaves like any other field on the file node:
- Search. Full-text search and ranking treat it as the file's body.
- Vector / semantic search. Embedding indexes run over the extract and feed similarity, "find related", and RAG retrieval.
- NLP. Entity extraction, key-phrases, sentiment, classifiers — all run over the extract by default.
- Chat assistant. "Chat with this document" uses the extracted text (and page numbers, when available) as grounding context.
- Highlights and deep links. Search hits surface the OCR span or transcript segment that matched, and the viewer jumps to the right page or timestamp.
Related pages
- Multimodal Search — supported formats, languages, throughput numbers, and STT details.
- Extracting files to Markdown — how-to for calling the extractor from C#.
- Internationalization — adding OCR language packs.
- Reindexing and re-embedding — after enabling new OCR/STT languages.