PDF + Office recipe
Source: PdfSample/ · PDFs and DOCX files with optional JSON sidecar metadata per file. The only recipe in the set that targets a non-academic domain — it builds an industrial maintenance graph (equipment, procedures, parts, manufacturers, technicians, safety hazards).
What it teaches
- File-blob handling with SHA-256 content hashing for change detection.
- Text extraction page-by-page using PdfPig (PDF) and the OpenXML SDK (DOCX).
- Chunking with overlap (800 chars / 80 char overlap) — embedding-friendly subgraphs you can later use for RAG.
- Sidecar JSON metadata that drives schema creation (no OCR guessing).
- Page + chunk decomposition so vector search can backtrack from a hit to the structured entity above it.
File extraction + chunking
public sealed record ExtractedDocument(
string SourceFile, string SourceFileName, string ContentHash,
IReadOnlyList<string> Pages, string? MetadataJson);
for (var i = 0; i < doc.Pages.Count; i++)
{
var pageNum = i + 1;
var pageKey = $"{meta.DocumentNumber}#p{pageNum}";
var page = graph.AddOrUpdate(new Nodes.ManualPage
{
Id = pageKey,
Number = pageNum,
Content = doc.Pages[i],
});
graph.Link(manual, page, Edges.HasPage, Edges.PageOf);
var chunkIdx = 0;
foreach (var chunk in DocumentSource.Chunk(doc.Pages[i]))
{
var chunkKey = $"{pageKey}#c{chunkIdx:000}";
var chunkNode = graph.AddOrUpdate(new Nodes.TextChunk
{
Id = chunkKey,
PageNum = pageNum,
Content = chunk,
});
graph.Link(chunkNode, page, Edges.ChunkedFrom, Edges.ChunkOf);
chunkIdx++;
}
}
Configuration
| Variable | Purpose | Default |
|---|---|---|
RECIPE_DOCS_ROOT |
Folder of PDF/DOCX + sidecar JSON | data/manuals/ |
Reuse notes
DocumentSource.cs(PdfPig + OpenXML + sidecar JSON) is dataset-agnostic.- Chunking is a helper — keep raw pages if your search prefers them.
- The chunk → page → manual hierarchy is what makes RAG citations possible: a vector hit on a chunk traces back to the page (and the manual) it came from.