Extract files to Markdown

When you need a file's text — a PDF, an Office document, a scanned image — as a single Markdown string, call ChatAI.GetFileAsMarkdownAsync(fileUID). The workspace's built-in extractor runs underneath: text PDFs and Office files come through with structure preserved, scans and photos go through OCR transparently, and you get back one Markdown string ready to feed into an LLM, store on a sibling node, or embed.

For an overview of what the extractor does and when, see OCR & File Extraction.

The shortest version

GetFileAsMarkdownAsync is available wherever the safe-graph ChatAI helper is — Custom Endpoints, Code Indexes, scheduled tasks, search scopes:

// Inside any code scope that exposes ChatAI
var markdown = await ChatAI.GetFileAsMarkdownAsync(fileUID, CancellationToken);

A few things to know:

The call blocks on extraction, so the first invocation for a large scanned PDF can take minutes. The cancellation token is honoured at every stage.
The result is the file's text only. Captions, thumbnails, attachments, and audio transcripts are returned by the lower-level extractor (ExtractorManager.ExtractAsync) — reach for that when you need them.
Empty result means the file produced no text (e.g. a blank scan). It is not an error.

Recommended pattern: extract, cache on a `Source` node, attach to the file

For anything that runs more than once — a Code Index, an endpoint your UI calls on demand — store the extract on a separate Source node linked to the file. That way re-indexing or re-rendering doesn't trigger another OCR pass.

Why a separate `Source` node?

Storing the extracted Markdown on a sibling node rather than directly on the File keeps the file metadata clean, lets you re-extract without touching the original file, and means multiple sources (raw extract, summary, translated copy) can attach to the same file independently.

In a Code Index attached to the `File` type

This is the canonical place to run extraction: every time a File node is added or re-indexed, the workspace calls into your code with the batch of UIDs as ToIndex. Cache the result and skip already-extracted files.

Code Index — File extraction

var failed = new List<UID128>();

foreach (var fileUid in ToIndex)
{
    if (CancellationToken.IsCancellationRequested)
        return ToIndex;

    if (!await TryExtractAndSaveAsync(fileUid))
        failed.Add(fileUid);
}

return failed;

async Task<bool> TryExtractAndSaveAsync(UID128 fileUid)
{
    LockedNode fileNode   = null;
    LockedNode sourceNode = null;

    try
    {
        var fileKey   = fileUid.ToString();
        var sourceKey = "extracted-from-" + fileKey;
        fileNode   = await Graph.GetOrAddLockedAsync(N._FileEntry.Type, fileKey);
        sourceNode = await Graph.GetOrAddLockedAsync(N.Source.Type,     sourceKey);

        var existing = sourceNode.GetString(N.Source.ExtractedText);
        if (!string.IsNullOrWhiteSpace(existing))
        {
            Logger.LogDebug("Skipping {UID}: already extracted", fileUid);
            await Graph.CommitAsync(fileNode, sourceNode);
            return true;
        }

        var markdown = await ChatAI.GetFileAsMarkdownAsync(fileUid, CancellationToken);

        sourceNode.SetString(N.Source.ExtractedText, markdown);
        fileNode.AddUniqueEdge(E.HasDocument, sourceNode);

        await Graph.CommitAsync(fileNode, sourceNode);

        Logger.LogInformation("Extracted {Length} chars for {UID}", markdown.Length, fileUid);
        return true;
    }
    catch (Exception ex)
    {
        Logger.LogError(ex, "Failed to extract {UID}", fileUid);
        if (fileNode   != null) Graph.AbandonChanges(fileNode);
        if (sourceNode != null) Graph.AbandonChanges(sourceNode);
        return false;
    }
}

How this maps to the code index execution scope:

Identifier	Role
`ToIndex`	The batch of file UIDs the indexer wants you to process this run.
`Graph`	Used to lock, mutate, and commit the `File` and `Source` nodes.
`ChatAI`	Exposes `GetFileAsMarkdownAsync` — the entry point for extraction.
`CancellationToken`	Honoured per iteration so big batches can be interrupted cleanly.
`Logger`	Surfaces extraction progress and failures in the workspace logs.

The return value matters: anything you return from the code index is treated as failed UIDs and requeued with backoff. Returning ToIndex on cancellation hands the remaining batch back so the next run picks up where this one stopped.

Idempotency

Two pieces make this re-runnable safely:

Deterministic key for the Source node. "extracted-from-" + fileKey means the same file always maps to the same source node, so a second pass finds the existing node rather than creating a duplicate.
The early-out check. If Source.ExtractedText is non-empty, GetFileAsMarkdownAsync is never called. Re-indexing the file won't re-OCR documents you've already processed.

To force a re-extraction for a specific file — e.g. after enabling a new OCR language — clear Source.ExtractedText on its Source node and re-queue the file.

In a Custom Endpoint

Same call, returned straight to the caller. Good when the UI wants to render a file as Markdown on demand:

file/extract

// POST /api/file/extract   — file UID in the request body
var fileUID = UID128.Parse(Body);
var markdown = await ChatAI.GetFileAsMarkdownAsync(fileUID, CancellationToken);
return Ok(markdown);

Because OCR can be slow on large PDFs, prefer Pooling mode for an endpoint that does extraction. See Creating endpoints — Mode picker.

Linking the Code Index to the `File` type

When you create the Code Index in the UI (Manage → Indexes → Code Indexes → + New), set its target type to File. From then on, every time a File node is added or re-indexed, this code runs against it, and the extractor — including OCR for image-only PDFs — is invoked transparently.

flowchart LR Upload[File uploaded] --> FileNode[(File node committed)] FileNode --> Queue[(Index queue)] Queue --> CodeIndex[Custom Code Index on File] CodeIndex -->|already extracted?| Skip[Skip] CodeIndex -->|no| Extract[ChatAI.GetFileAsMarkdownAsync] Extract --> Source[(Source node — ExtractedText)] Source --> Indexes[Text + embedding indexes]

OCR & File Extraction — what the extractor does, what's supported, when it runs.
Code Indexes — Introduction — when to reach for a code index over a connector or a search scope.
Code Index Scope — the full list of helpers (Graph, ToIndex, ChatAI, Logger, …) available inside the body.
Creating Endpoints — endpoint authoring, modes, and authorization.
Multimodal Search — supported formats, languages, and throughput numbers for the extractor's OCR backend.

Referenced by

index

Extract files to Markdown

The shortest version

Recommended pattern: extract, cache on a Source node, attach to the file