Extract files to Markdown
When you need a file's text — a PDF, an Office document, a scanned image — as a single Markdown string, call ChatAI.GetFileAsMarkdownAsync(fileUID). The workspace's built-in extractor runs underneath: text PDFs and Office files come through with structure preserved, scans and photos go through OCR transparently, and you get back one Markdown string ready to feed into an LLM, store on a sibling node, or embed.
For an overview of what the extractor does and when, see OCR & File Extraction.
The shortest version
GetFileAsMarkdownAsync is available wherever the safe-graph ChatAI helper is — Custom Endpoints, Code Indexes, scheduled tasks, search scopes:
// Inside any code scope that exposes ChatAI
var markdown = await ChatAI.GetFileAsMarkdownAsync(fileUID, CancellationToken);
A few things to know:
- The call blocks on extraction, so the first invocation for a large scanned PDF can take minutes. The cancellation token is honoured at every stage.
- The result is the file's text only. Captions, thumbnails, attachments, and audio transcripts are returned by the lower-level extractor (
ExtractorManager.ExtractAsync) — reach for that when you need them. - Empty result means the file produced no text (e.g. a blank scan). It is not an error.
Recommended pattern: extract, cache on a Source node, attach to the file
For anything that runs more than once — a Code Index, an endpoint your UI calls on demand — store the extract on a separate Source node linked to the file. That way re-indexing or re-rendering doesn't trigger another OCR pass.
Why a separate `Source` node?
Storing the extracted Markdown on a sibling node rather than directly on the File keeps the file metadata clean, lets you re-extract without touching the original file, and means multiple sources (raw extract, summary, translated copy) can attach to the same file independently.
In a Code Index attached to the File type
This is the canonical place to run extraction: every time a File node is added or re-indexed, the workspace calls into your code with the batch of UIDs as ToIndex. Cache the result and skip already-extracted files.
var failed = new List<UID128>();
foreach (var fileUid in ToIndex)
{
if (CancellationToken.IsCancellationRequested)
return ToIndex;
if (!await TryExtractAndSaveAsync(fileUid))
failed.Add(fileUid);
}
return failed;
async Task<bool> TryExtractAndSaveAsync(UID128 fileUid)
{
LockedNode fileNode = null;
LockedNode sourceNode = null;
try
{
var fileKey = fileUid.ToString();
var sourceKey = "extracted-from-" + fileKey;
fileNode = await Graph.GetOrAddLockedAsync(N._FileEntry.Type, fileKey);
sourceNode = await Graph.GetOrAddLockedAsync(N.Source.Type, sourceKey);
var existing = sourceNode.GetString(N.Source.ExtractedText);
if (!string.IsNullOrWhiteSpace(existing))
{
Logger.LogDebug("Skipping {UID}: already extracted", fileUid);
await Graph.CommitAsync(fileNode, sourceNode);
return true;
}
var markdown = await ChatAI.GetFileAsMarkdownAsync(fileUid, CancellationToken);
sourceNode.SetString(N.Source.ExtractedText, markdown);
fileNode.AddUniqueEdge(E.HasDocument, sourceNode);
await Graph.CommitAsync(fileNode, sourceNode);
Logger.LogInformation("Extracted {Length} chars for {UID}", markdown.Length, fileUid);
return true;
}
catch (Exception ex)
{
Logger.LogError(ex, "Failed to extract {UID}", fileUid);
if (fileNode != null) Graph.AbandonChanges(fileNode);
if (sourceNode != null) Graph.AbandonChanges(sourceNode);
return false;
}
}
How this maps to the code index execution scope:
| Identifier | Role |
|---|---|
ToIndex |
The batch of file UIDs the indexer wants you to process this run. |
Graph |
Used to lock, mutate, and commit the File and Source nodes. |
ChatAI |
Exposes GetFileAsMarkdownAsync — the entry point for extraction. |
CancellationToken |
Honoured per iteration so big batches can be interrupted cleanly. |
Logger |
Surfaces extraction progress and failures in the workspace logs. |
The return value matters: anything you return from the code index is treated as failed UIDs and requeued with backoff. Returning ToIndex on cancellation hands the remaining batch back so the next run picks up where this one stopped.
Idempotency
Two pieces make this re-runnable safely:
- Deterministic key for the
Sourcenode."extracted-from-" + fileKeymeans the same file always maps to the same source node, so a second pass finds the existing node rather than creating a duplicate. - The early-out check. If
Source.ExtractedTextis non-empty,GetFileAsMarkdownAsyncis never called. Re-indexing the file won't re-OCR documents you've already processed.
To force a re-extraction for a specific file — e.g. after enabling a new OCR language — clear Source.ExtractedText on its Source node and re-queue the file.
In a Custom Endpoint
Same call, returned straight to the caller. Good when the UI wants to render a file as Markdown on demand:
// POST /api/file/extract — file UID in the request body
var fileUID = UID128.Parse(Body);
var markdown = await ChatAI.GetFileAsMarkdownAsync(fileUID, CancellationToken);
return Ok(markdown);
Because OCR can be slow on large PDFs, prefer Pooling mode for an endpoint that does extraction. See Creating endpoints — Mode picker.
Linking the Code Index to the File type
When you create the Code Index in the UI (Manage → Indexes → Code Indexes → + New), set its target type to File. From then on, every time a File node is added or re-indexed, this code runs against it, and the extractor — including OCR for image-only PDFs — is invoked transparently.
Related pages
- OCR & File Extraction — what the extractor does, what's supported, when it runs.
- Code Indexes — Introduction — when to reach for a code index over a connector or a search scope.
- Code Index Scope — the full list of helpers (
Graph,ToIndex,ChatAI,Logger, …) available inside the body. - Creating Endpoints — endpoint authoring, modes, and authorization.
- Multimodal Search — supported formats, languages, and throughput numbers for the extractor's OCR backend.