Curiosity - PDF + Office

PDF + Office recipe

Source: PdfSample/ · PDFs and DOCX files with optional JSON sidecar metadata per file. The only recipe in the set that targets a non-academic domain — it builds an industrial maintenance graph (equipment, procedures, parts, manufacturers, technicians, safety hazards).

What it teaches

File-blob handling with SHA-256 content hashing for change detection.
Text extraction page-by-page using PdfPig (PDF) and the OpenXML SDK (DOCX).
Chunking with overlap (800 chars / 80 char overlap) — embedding-friendly subgraphs you can later use for RAG.
Sidecar JSON metadata that drives schema creation (no OCR guessing).
Page + chunk decomposition so vector search can backtrack from a hit to the structured entity above it.

flowchart TB PDF[(equipment-manual.pdf)] --> Ex[Text extraction] Side[(equipment-manual.json)] --> Meta[Sidecar parsing] Ex --> Pages[ManualPage 1..N] Pages --> Chunks[TextChunk - 800 char with overlap] Meta --> Equipment[Equipment] Meta --> Parts[Part 1..N] Meta --> Procedures[Procedure 1..N] Equipment -. DocumentedBy .-> Pages Procedures -. Describes .-> Equipment Chunks -. ChunkOf .-> Pages

File extraction + chunking

public sealed record ExtractedDocument(
    string SourceFile, string SourceFileName, string ContentHash,
    IReadOnlyList<string> Pages, string? MetadataJson);

for (var i = 0; i < doc.Pages.Count; i++)
{
    var pageNum = i + 1;
    var pageKey = $"{meta.DocumentNumber}#p{pageNum}";

    var page = graph.AddOrUpdate(new Nodes.ManualPage
    {
        Id      = pageKey,
        Number  = pageNum,
        Content = doc.Pages[i],
    });
    graph.Link(manual, page, Edges.HasPage, Edges.PageOf);

    var chunkIdx = 0;
    foreach (var chunk in DocumentSource.Chunk(doc.Pages[i]))
    {
        var chunkKey = $"{pageKey}#c{chunkIdx:000}";
        var chunkNode = graph.AddOrUpdate(new Nodes.TextChunk
        {
            Id      = chunkKey,
            PageNum = pageNum,
            Content = chunk,
        });
        graph.Link(chunkNode, page, Edges.ChunkedFrom, Edges.ChunkOf);
        chunkIdx++;
    }
}

Configuration

Variable	Purpose	Default
`RECIPE_DOCS_ROOT`	Folder of PDF/DOCX + sidecar JSON	`data/manuals/`

Reuse notes

DocumentSource.cs (PdfPig + OpenXML + sidecar JSON) is dataset-agnostic.
Chunking is a helper — keep raw pages if your search prefers them.
The chunk → page → manual hierarchy is what makes RAG citations possible: a vector hit on a chunk traces back to the page (and the manual) it came from.