Curiosity

Parquet / Avro recipe

Source: ParquetSample/ · columnar batch files (Parquet or Avro), auto-detected by extension.

Owns in the academic graph: course grades — Grade (composite key <student>/<course>/<term>), linked to Student, Course, Subject, Term.

What it teaches

  • Column projection — declare exactly the columns you need; the reader skips the rest. Often a 5–10× speedup on wide files.
  • Row-group streaming — bounded memory regardless of file size.
  • A format abstraction (IColumnarSource returning ColumnarRow dictionaries) with two implementations: ParquetSource (Parquet.Net) and AvroSource (Apache.Avro).
  • Auto-detection of the source format from the file extension.

Column projection + typed read

public static readonly string[] Columns = new[]
{
    "student_id", "course_code", "subject", "term",
    "letter_grade", "gpa_points", "credit_hours",
};

public static void Ingest(Graph graph, ColumnarRow row)
{
    var studentId  = row.Get<string>("student_id")   ?? string.Empty;
    var courseCode = row.Get<string>("course_code")  ?? string.Empty;
    var letter     = row.Get<string>("letter_grade") ?? string.Empty;
    var gpaPoints  = row.Get<double>("gpa_points");

    var gradeKey = $"{studentId}/{courseCode}/{termName}";
    var grade = graph.AddOrUpdate(new Nodes.Grade
    {
        Id          = gradeKey,
        Letter      = letter,
        GpaPoints   = gpaPoints,
        CreditHours = credits,
    });

    var student = graph.TryAdd(new Nodes.Student { Id = studentId });
    graph.Link(student, grade, Edges.Received, Edges.ReceivedBy);
}

Configuration

Variable Purpose Default
RECIPE_DATA_PATH Parquet or Avro file path data/grades.parquet

Reuse notes

  • For Parquet, set the row-group size on the producer side (~100k rows) for predictable I/O.
  • For Avro, prefer schema-evolution-aware reads when your schema changes over time.
  • row.Get<T>(name) returns default(T) if a column is missing — validate at ingestion when columns are not guaranteed to exist.