Curiosity

Extract files to Markdown

When you need a file's text — a PDF, an Office document, a scanned image — as a single Markdown string, call ChatAI.GetFileAsMarkdownAsync(fileUID). The workspace's built-in extractor runs underneath: text PDFs and Office files come through with structure preserved, scans and photos go through OCR transparently, and you get back one Markdown string ready to feed into an LLM, store on a sibling node, or embed.

For an overview of what the extractor does and when, see OCR & File Extraction.

The shortest version

GetFileAsMarkdownAsync is available wherever the safe-graph ChatAI helper is — Custom Endpoints, Code Indexes, scheduled tasks, search scopes:

// Inside any code scope that exposes ChatAI
var markdown = await ChatAI.GetFileAsMarkdownAsync(fileUID, CancellationToken);

A few things to know:

  • The call blocks on extraction, so the first invocation for a large scanned PDF can take minutes. The cancellation token is honoured at every stage.
  • The result is the file's text only. Captions, thumbnails, attachments, and audio transcripts are returned by the lower-level extractor (ExtractorManager.ExtractAsync) — reach for that when you need them.
  • Empty result means the file produced no text (e.g. a blank scan). It is not an error.

For anything that runs more than once — a Code Index, an endpoint your UI calls on demand — store the extract on a separate Source node linked to the file. That way re-indexing or re-rendering doesn't trigger another OCR pass.

Why a separate `Source` node?

Storing the extracted Markdown on a sibling node rather than directly on the File keeps the file metadata clean, lets you re-extract without touching the original file, and means multiple sources (raw extract, summary, translated copy) can attach to the same file independently.

In a Code Index attached to the File type

This is the canonical place to run extraction: every time a File node is added or re-indexed, the workspace calls into your code with the batch of UIDs as ToIndex. Cache the result and skip already-extracted files.

Code Index — File extraction
var failed = new List<UID128>();

foreach (var fileUid in ToIndex)
{
    if (CancellationToken.IsCancellationRequested)
        return ToIndex;

    if (!await TryExtractAndSaveAsync(fileUid))
        failed.Add(fileUid);
}

return failed;

async Task<bool> TryExtractAndSaveAsync(UID128 fileUid)
{
    LockedNode fileNode   = null;
    LockedNode sourceNode = null;

    try
    {
        var fileKey   = fileUid.ToString();
        var sourceKey = "extracted-from-" + fileKey;
        fileNode   = await Graph.GetOrAddLockedAsync(N._FileEntry.Type, fileKey);
        sourceNode = await Graph.GetOrAddLockedAsync(N.Source.Type,     sourceKey);

        var existing = sourceNode.GetString(N.Source.ExtractedText);
        if (!string.IsNullOrWhiteSpace(existing))
        {
            Logger.LogDebug("Skipping {UID}: already extracted", fileUid);
            await Graph.CommitAsync(fileNode, sourceNode);
            return true;
        }

        var markdown = await ChatAI.GetFileAsMarkdownAsync(fileUid, CancellationToken);

        sourceNode.SetString(N.Source.ExtractedText, markdown);
        fileNode.AddUniqueEdge(E.HasDocument, sourceNode);

        await Graph.CommitAsync(fileNode, sourceNode);

        Logger.LogInformation("Extracted {Length} chars for {UID}", markdown.Length, fileUid);
        return true;
    }
    catch (Exception ex)
    {
        Logger.LogError(ex, "Failed to extract {UID}", fileUid);
        if (fileNode   != null) Graph.AbandonChanges(fileNode);
        if (sourceNode != null) Graph.AbandonChanges(sourceNode);
        return false;
    }
}

How this maps to the code index execution scope:

Identifier Role
ToIndex The batch of file UIDs the indexer wants you to process this run.
Graph Used to lock, mutate, and commit the File and Source nodes.
ChatAI Exposes GetFileAsMarkdownAsync — the entry point for extraction.
CancellationToken Honoured per iteration so big batches can be interrupted cleanly.
Logger Surfaces extraction progress and failures in the workspace logs.

The return value matters: anything you return from the code index is treated as failed UIDs and requeued with backoff. Returning ToIndex on cancellation hands the remaining batch back so the next run picks up where this one stopped.

Idempotency

Two pieces make this re-runnable safely:

  • Deterministic key for the Source node. "extracted-from-" + fileKey means the same file always maps to the same source node, so a second pass finds the existing node rather than creating a duplicate.
  • The early-out check. If Source.ExtractedText is non-empty, GetFileAsMarkdownAsync is never called. Re-indexing the file won't re-OCR documents you've already processed.

To force a re-extraction for a specific file — e.g. after enabling a new OCR language — clear Source.ExtractedText on its Source node and re-queue the file.

In a Custom Endpoint

Same call, returned straight to the caller. Good when the UI wants to render a file as Markdown on demand:

file/extract
// POST /api/file/extract   — file UID in the request body
var fileUID = UID128.Parse(Body);
var markdown = await ChatAI.GetFileAsMarkdownAsync(fileUID, CancellationToken);
return Ok(markdown);

Because OCR can be slow on large PDFs, prefer Pooling mode for an endpoint that does extraction. See Creating endpoints — Mode picker.

Linking the Code Index to the File type

When you create the Code Index in the UI (Manage → Indexes → Code Indexes → + New), set its target type to File. From then on, every time a File node is added or re-indexed, this code runs against it, and the extractor — including OCR for image-only PDFs — is invoked transparently.

flowchart LR Upload[File uploaded] --> FileNode[(File node committed)] FileNode --> Queue[(Index queue)] Queue --> CodeIndex[Custom Code Index on File] CodeIndex -->|already extracted?| Skip[Skip] CodeIndex -->|no| Extract[ChatAI.GetFileAsMarkdownAsync] Extract --> Source[(Source node — ExtractedText)] Source --> Indexes[Text + embedding indexes]

Referenced by

© 2026 Curiosity. All rights reserved.