Curiosity

Types of NLP Models

A side-by-side look at the four kinds of extraction model Curiosity Workspace can run: dictionaries, patterns, ML/NER, and LLM extraction. Pick the right one — or the right combination — based on your vocabulary, volume, and accuracy needs.

For when extraction itself is the right tool, see NLP overview. For configuration, see Entity extraction.

Comparison at a glance

Aspect Dictionary / spotter Pattern (regex) ML / NER LLM extraction
Best for Finite vocabularies Structured identifiers Generic types (PERSON, ORG, …) Open-ended fields, intents, summaries
Precision Very high Medium–high Medium Medium–high (depends on prompt)
Recall Limited to the list Wide on shape, miss novel Wide for trained types Widest
Latency Microseconds Microseconds Milliseconds Hundreds of ms — seconds
Cost / 1k docs Negligible Negligible CPU/GPU compute Provider API call (per token)
Maintenance Update list as vocab grows Update regex on new shapes Re-train when accuracy drifts Update prompt; track model versions
Determinism Deterministic Deterministic Mostly deterministic Stochastic (set temperature = 0)
Multilingual Per-language dictionary Pattern-dependent Per-language model Inherent (most modern LLMs)
Confidence Binary or boost-weighted Binary or context-weighted Probabilistic score Self-reported, unreliable

Dictionary / spotter

Curated list of canonical terms with aliases, matched against tokenized text.

Pick this when:

  • Your vocabulary is enumerable — product catalog, customer list, internal team names.
  • You need deterministic results for compliance.
  • You can keep the list updated as the catalog changes.

Avoid when:

  • The space is open-ended (every possible person name, every potential brand).
  • Your alias coverage is poor — you'll under-extract.

Configuration: see Entity extraction → Dictionary / spotter.

Pattern (regex)

Shape-based matching for codes and identifiers.

Pick this when:

  • Your entity has a consistent format — TICKET-12345, 0xDEADBEEF, SKU-AB-1234.
  • The format is unambiguous in context.

Avoid when:

  • The pattern is too generic (\d{4} matches years, postcodes, ages, asset IDs — all of them).
  • The format changes across vendors or sources without you knowing.

Always pair patterns with context constraints (context_must_include) and exclusions (context_includes → reject) to control over-firing.

ML / NER

Pre-trained named-entity recognizers. Curiosity ships SpaCy-class models per language and integrates with external NER services.

Pick this when:

  • You need generic types and don't have time to curate dictionaries (PERSON, ORG, LOCATION, MONEY, DATE).
  • Your text is well-formed prose (news, emails, articles).

Avoid when:

  • Domain ambiguity is high — "Apple" in a tech corpus is unambiguous; in a grocery corpus it isn't.
  • You need to extract specific business entities, not generic types.

Mix with dictionaries: run the dictionary first for known entities, fall back to NER for anything else.

LLM extraction

Use a language model with a structured-output prompt.

Pick this when:

  • The field is genuinely open-ended — extracting "customer's intent" from a conversation, structured fields from a contract clause.
  • You can afford the latency and per-call cost (typically batch / offline workflows).
  • You can stomach occasional drift in output format — and you have validation in place.

Avoid when:

  • High volume / low latency requirements (real-time ingestion).
  • The vocabulary is small and stable — dictionaries are cheaper and more precise.
  • You need exact reproducibility — LLM outputs drift between model versions.

See Prompting patterns → Extraction for templates.

Most production systems run at least two model types in a pipeline.

Combination Where it shines
Dictionary → Pattern Known products + ticket/asset IDs in support text.
Dictionary → ML NER Known accounts/customers + generic people/orgs in CRM notes.
Pattern → LLM (long-tail) Catch IDs cheaply, send everything else to a model for soft extraction.
Dictionary → Pattern → ML → LLM "Throw everything at it" for high-stakes corpora where coverage matters.

The order matters: cheaper extractors first remove certainty cases so the expensive ones don't reprocess them.

Migration paths

  • Start with dictionaries. They give the most signal per hour of work. Build the vocabulary from the canonical entities already in your graph (MapAsync(...) in the SDK).
  • Add patterns next. Find the high-volume IDs the support team copy-pastes.
  • Add ML when generic types matter. Run it on the longest fields, not headers.
  • Add LLM last. Only when the gap left by the other three is worth the cost.

Selecting per use case

Translate the business question into a model choice:

Business question Best model type
"Show me every ticket about MacBook Air." Dictionary (product list).
"List every error code that appeared in the last week." Pattern.
"Which customers and partners did this email mention?" Dictionary + ML NER.
"What does the customer actually want in this 800-word email?" LLM.
"Build a topic taxonomy from 100k support cases." LLM (offline batch).

Where to go next

© 2026 Curiosity. All rights reserved.
Powered by Neko