Curiosity

Types of NLP Models

A side-by-side look at the four kinds of extraction model Curiosity Workspace can run: dictionaries, patterns, ML/NER, and LLM extraction. Pick the right one — or the right combination — based on your vocabulary, volume, and accuracy needs.

For when extraction itself is the right tool, see NLP overview. For configuration, see Entity extraction.

Comparison at a glance

Aspect	Dictionary / spotter	Pattern (regex)	ML / NER	LLM extraction
Best for	Finite vocabularies	Structured identifiers	Generic types (PERSON, ORG, …)	Open-ended fields, intents, summaries
Precision	Very high	Medium–high	Medium	Medium–high (depends on prompt)
Recall	Limited to the list	Wide on shape, miss novel	Wide for trained types	Widest
Latency	Microseconds	Microseconds	Milliseconds	Hundreds of ms — seconds
Cost / 1k docs	Negligible	Negligible	CPU/GPU compute	Provider API call (per token)
Maintenance	Update list as vocab grows	Update regex on new shapes	Re-train when accuracy drifts	Update prompt; track model versions
Determinism	Deterministic	Deterministic	Mostly deterministic	Stochastic (set `temperature = 0`)
Multilingual	Per-language dictionary	Pattern-dependent	Per-language model	Inherent (most modern LLMs)
Confidence	Binary or boost-weighted	Binary or context-weighted	Probabilistic score	Self-reported, unreliable

Dictionary / spotter

Curated list of canonical terms with aliases, matched against tokenized text.

Pick this when:

Your vocabulary is enumerable — product catalog, customer list, internal team names.
You need deterministic results for compliance.
You can keep the list updated as the catalog changes.

Avoid when:

The space is open-ended (every possible person name, every potential brand).
Your alias coverage is poor — you'll under-extract.

Configuration: see Entity extraction → Dictionary / spotter.

Pattern (regex)

Shape-based matching for codes and identifiers.

Pick this when:

Your entity has a consistent format — TICKET-12345, 0xDEADBEEF, SKU-AB-1234.
The format is unambiguous in context.

Avoid when:

The pattern is too generic (\d{4} matches years, postcodes, ages, asset IDs — all of them).
The format changes across vendors or sources without you knowing.

Always pair patterns with context constraints (context_must_include) and exclusions (context_includes → reject) to control over-firing.

ML / NER

Pre-trained named-entity recognizers. Curiosity ships SpaCy-class models per language and integrates with external NER services.

Pick this when:

You need generic types and don't have time to curate dictionaries (PERSON, ORG, LOCATION, MONEY, DATE).
Your text is well-formed prose (news, emails, articles).

Avoid when:

Domain ambiguity is high — "Apple" in a tech corpus is unambiguous; in a grocery corpus it isn't.
You need to extract specific business entities, not generic types.

Mix with dictionaries: run the dictionary first for known entities, fall back to NER for anything else.

LLM extraction

Use a language model with a structured-output prompt.

Pick this when:

The field is genuinely open-ended — extracting "customer's intent" from a conversation, structured fields from a contract clause.
You can afford the latency and per-call cost (typically batch / offline workflows).
You can stomach occasional drift in output format — and you have validation in place.

Avoid when:

High volume / low latency requirements (real-time ingestion).
The vocabulary is small and stable — dictionaries are cheaper and more precise.
You need exact reproducibility — LLM outputs drift between model versions.

See Prompting patterns → Extraction for templates.

Recommended combinations

Most production systems run at least two model types in a pipeline.

Combination	Where it shines
Dictionary → Pattern	Known products + ticket/asset IDs in support text.
Dictionary → ML NER	Known accounts/customers + generic people/orgs in CRM notes.
Pattern → LLM (long-tail)	Catch IDs cheaply, send everything else to a model for soft extraction.
Dictionary → Pattern → ML → LLM	"Throw everything at it" for high-stakes corpora where coverage matters.

The order matters: cheaper extractors first remove certainty cases so the expensive ones don't reprocess them.

Migration paths

Start with dictionaries. They give the most signal per hour of work. Build the vocabulary from the canonical entities already in your graph (MapAsync(...) in the SDK).
Add patterns next. Find the high-volume IDs the support team copy-pastes.
Add ML when generic types matter. Run it on the longest fields, not headers.
Add LLM last. Only when the gap left by the other three is worth the cost.

Selecting per use case

Translate the business question into a model choice:

Business question	Best model type
"Show me every ticket about MacBook Air."	Dictionary (product list).
"List every error code that appeared in the last week."	Pattern.
"Which customers and partners did this email mention?"	Dictionary + ML NER.
"What does the customer actually want in this 800-word email?"	LLM.
"Build a topic taxonomy from 100k support cases."	LLM (offline batch).

Where to go next

Entity extraction — full configuration and review workflow.
NLP overview — how extraction fits with embeddings and the graph.
Prompting patterns → Extraction — LLM templates.
Search optimization — turning extracted entities into facets.

Types of NLP Models

Comparison at a glance

Dictionary / spotter

Pattern (regex)

ML / NER

LLM extraction

Recommended combinations

Migration paths

Selecting per use case

Where to go next

Referenced by