Entity Extraction
Entity extraction finds meaningful spans in text and turns them into structured outputs you can search, facet, and link to the graph. Examples: product names, device IDs, customer names, ticket numbers, locations.
This page covers configuration, model types, confidence thresholds, and the review loop. For where extraction sits in the bigger picture, see NLP overview. For the dictionary/pattern/ML/LLM comparison, see Types of models.
Extraction vs linking
- Extraction finds the span in text. Output:
{ field, start, end, text, type, confidence }. - Linking maps that span to an existing graph node (or creates a new one). Output: an edge from the document to the entity node.
You can extract without linking (just annotations on text). You can't link without extracting.
A minimal pipeline configuration
Configured per (nodeType, field). In YAML the shape is:
pipeline:
language: auto # or "en", "fr", …
spotters:
- kind: dictionary
name: products
entries:
- value: MacBook Air
aliases: [MBA, "MacBook Air 2024", "Mac Book Air"]
link_to: Device:MBA-2024
- value: MacBook Pro
aliases: [MBP]
link_to: Device:MBP-2024
min_confidence: 0.8
case_sensitive: false
- kind: pattern
name: ticket-id
regex: "TICKET-\\d{4,6}"
link_to_type: SupportCase
link_strategy: by-key
exclusions:
- context_includes: ["e.g.", "for example"]
reason: "training-mention false positive"
You typically edit this in the admin UI; the YAML above shows what's stored under the hood.
Model types
Dictionary / spotter
Curated list of canonical terms with aliases. Use for finite vocabularies that change rarely — product catalogs, customer names, country lists.
- kind: dictionary
name: countries
entries:
- value: "United States"
aliases: [USA, "U.S.", "United States of America"]
case_sensitive: false
Strengths: high precision, fast, deterministic. Weakness: misses anything not in the list.
Pattern
Regex-style matcher. Use for structured identifiers, codes, and formats.
- kind: pattern
name: ticket-id
regex: "TICKET-\\d{4,6}"
context_must_include: ["ticket", "case"] # optional disambiguation
Strengths: catches IDs the dictionary can't enumerate. Weakness: over-fires on unrelated number strings unless you constrain by context.
ML / NER
Pre-trained model recognizes generic types: PERSON, ORG, LOCATION, DATE, MONEY. Use when the vocabulary is open-ended and a generic type is what you want.
- kind: ml-ner
model: spacy-en-core-web-md
types: [PERSON, ORG, DATE]
min_confidence: 0.7
Strengths: covers unknown entities. Weakness: domain-specific accuracy is uneven; "Apple" in a fruit catalog isn't a company.
LLM extraction
A language model with a structured-output prompt. Use when rules and dictionaries can't reach the long tail — extracting intents, structured fields from quotes, summary-style outputs.
See Prompting patterns → Extraction.
Custom entity examples
Custom dictionary with linking
- kind: dictionary
name: engineering-teams
entries:
- value: Platform Team
link_to: Team:platform
- value: Mobile Team
link_to: Team:mobile
aliases: [iOS Team, Android Team]
link_strategy: by-uid
Each extracted mention creates a Mentions edge from the document to the team node — search results can now facet by team.
Pattern + exclusion
- kind: pattern
name: error-code
regex: "0x[0-9A-Fa-f]{8}"
exclusions:
- context_includes: ["example", "e.g."]
Catches every Windows-style error code; ignores documentation examples.
Hybrid pipeline
spotters:
- kind: dictionary
name: products
entries: [ … ]
- kind: pattern
name: ids
regex: "ASSET-\\d+"
- kind: ml-ner
types: [PERSON, ORG]
min_confidence: 0.85
Dictionary first (cheapest, most precise), pattern for IDs, ML for everything else.
Confidence thresholds
Every extractor emits a confidence score. The right threshold depends on what you do with the result.
| Downstream use | Suggested floor |
|---|---|
| Display the extracted entity in the UI | ≥ 0.6 |
| Use as a facet | ≥ 0.7 |
| Auto-link to the graph | ≥ 0.85 |
| Auto-merge / canonicalize | ≥ 0.95 |
Don't auto-link below 0.85 — wrong edges are expensive to clean up later.
Review workflow
- Sample 100–200 documents. Stratify by source, length, language.
- Run extraction in shadow mode. Don't write to the graph yet.
- Hand-label. Mark each extracted span as TP (correct), FP (wrong), or FN (missed).
- Compute precision / recall.
- precision = TP / (TP + FP)
- recall = TP / (TP + FN)
- Iterate. High FP → add exclusions, raise threshold. Low recall → expand dictionary, add aliases.
- Promote to production when precision is acceptable. "Acceptable" depends on use:
- Search/facet use: precision ≥ 0.8.
- Auto-link: precision ≥ 0.95.
- Schedule re-review. Quarterly, or after every dictionary update.
Common pitfalls
- Aliases as an afterthought. "MBA," "Mac Book Air," "MacBook Air 2024" — write them all in the dictionary, not just the canonical form.
- Patterns without context.
"\d{4}"matches every year, postcode, and asset ID. Constrain with surrounding tokens. - No exclusions. "For example, error 0x12345678" is a documentation mention, not a real error. Exclude.
- Linking before extraction is trustworthy. Wrong edges in the graph are worse than no edges.
- Single-model bias. Hybrid pipelines (dictionary + pattern + ML) consistently beat any single model.
- No re-extraction after schema change. Adding a new spotter? Re-process the existing corpus or the new rules only apply to new documents.
Where to go next
- Types of models — dictionary / pattern / ML / LLM head-to-head.
- Prompting patterns → Extraction — LLM-driven extraction.
- Embeddings — when semantic similarity covers what extraction can't.
- Reindexing and re-embedding — re-processing after rule changes.