Curiosity

Entity Extraction

Entity extraction finds meaningful spans in text and turns them into structured outputs you can search, facet, and link to the graph. Examples: product names, device IDs, customer names, ticket numbers, locations.

This page covers configuration, model types, confidence thresholds, and the review loop. For where extraction sits in the bigger picture, see NLP overview. For the dictionary/pattern/ML/LLM comparison, see Types of models.

Extraction vs linking

  • Extraction finds the span in text. Output: { field, start, end, text, type, confidence }.
  • Linking maps that span to an existing graph node (or creates a new one). Output: an edge from the document to the entity node.

You can extract without linking (just annotations on text). You can't link without extracting.

A minimal pipeline configuration

Configured per (nodeType, field). In YAML the shape is:

pipeline:
  language: auto                  # or "en", "fr", …
  spotters:
    - kind: dictionary
      name: products
      entries:
        - value: MacBook Air
          aliases: [MBA, "MacBook Air 2024", "Mac Book Air"]
          link_to: Device:MBA-2024
        - value: MacBook Pro
          aliases: [MBP]
          link_to: Device:MBP-2024
      min_confidence: 0.8
      case_sensitive: false
    - kind: pattern
      name: ticket-id
      regex: "TICKET-\\d{4,6}"
      link_to_type: SupportCase
      link_strategy: by-key
  exclusions:
    - context_includes: ["e.g.", "for example"]
      reason: "training-mention false positive"

You typically edit this in the admin UI; the YAML above shows what's stored under the hood.

Model types

Dictionary / spotter

Curated list of canonical terms with aliases. Use for finite vocabularies that change rarely — product catalogs, customer names, country lists.

- kind: dictionary
  name: countries
  entries:
    - value: "United States"
      aliases: [USA, "U.S.", "United States of America"]
  case_sensitive: false

Strengths: high precision, fast, deterministic. Weakness: misses anything not in the list.

Pattern

Regex-style matcher. Use for structured identifiers, codes, and formats.

- kind: pattern
  name: ticket-id
  regex: "TICKET-\\d{4,6}"
  context_must_include: ["ticket", "case"]      # optional disambiguation

Strengths: catches IDs the dictionary can't enumerate. Weakness: over-fires on unrelated number strings unless you constrain by context.

ML / NER

Pre-trained model recognizes generic types: PERSON, ORG, LOCATION, DATE, MONEY. Use when the vocabulary is open-ended and a generic type is what you want.

- kind: ml-ner
  model: spacy-en-core-web-md
  types: [PERSON, ORG, DATE]
  min_confidence: 0.7

Strengths: covers unknown entities. Weakness: domain-specific accuracy is uneven; "Apple" in a fruit catalog isn't a company.

LLM extraction

A language model with a structured-output prompt. Use when rules and dictionaries can't reach the long tail — extracting intents, structured fields from quotes, summary-style outputs.

See Prompting patterns → Extraction.

Custom entity examples

Custom dictionary with linking

- kind: dictionary
  name: engineering-teams
  entries:
    - value: Platform Team
      link_to: Team:platform
    - value: Mobile Team
      link_to: Team:mobile
      aliases: [iOS Team, Android Team]
  link_strategy: by-uid

Each extracted mention creates a Mentions edge from the document to the team node — search results can now facet by team.

Pattern + exclusion

- kind: pattern
  name: error-code
  regex: "0x[0-9A-Fa-f]{8}"
  exclusions:
    - context_includes: ["example", "e.g."]

Catches every Windows-style error code; ignores documentation examples.

Hybrid pipeline

spotters:
  - kind: dictionary
    name: products
    entries: [ … ]
  - kind: pattern
    name: ids
    regex: "ASSET-\\d+"
  - kind: ml-ner
    types: [PERSON, ORG]
    min_confidence: 0.85

Dictionary first (cheapest, most precise), pattern for IDs, ML for everything else.

Confidence thresholds

Every extractor emits a confidence score. The right threshold depends on what you do with the result.

Downstream use Suggested floor
Display the extracted entity in the UI ≥ 0.6
Use as a facet ≥ 0.7
Auto-link to the graph ≥ 0.85
Auto-merge / canonicalize ≥ 0.95

Don't auto-link below 0.85 — wrong edges are expensive to clean up later.

Review workflow

  1. Sample 100–200 documents. Stratify by source, length, language.
  2. Run extraction in shadow mode. Don't write to the graph yet.
  3. Hand-label. Mark each extracted span as TP (correct), FP (wrong), or FN (missed).
  4. Compute precision / recall.
    • precision = TP / (TP + FP)
    • recall = TP / (TP + FN)
  5. Iterate. High FP → add exclusions, raise threshold. Low recall → expand dictionary, add aliases.
  6. Promote to production when precision is acceptable. "Acceptable" depends on use:
    • Search/facet use: precision ≥ 0.8.
    • Auto-link: precision ≥ 0.95.
  7. Schedule re-review. Quarterly, or after every dictionary update.

Common pitfalls

  • Aliases as an afterthought. "MBA," "Mac Book Air," "MacBook Air 2024" — write them all in the dictionary, not just the canonical form.
  • Patterns without context. "\d{4}" matches every year, postcode, and asset ID. Constrain with surrounding tokens.
  • No exclusions. "For example, error 0x12345678" is a documentation mention, not a real error. Exclude.
  • Linking before extraction is trustworthy. Wrong edges in the graph are worse than no edges.
  • Single-model bias. Hybrid pipelines (dictionary + pattern + ML) consistently beat any single model.
  • No re-extraction after schema change. Adding a new spotter? Re-process the existing corpus or the new rules only apply to new documents.

Where to go next

© 2026 Curiosity. All rights reserved.
Powered by Neko