Types of NLP Models
A side-by-side look at the four kinds of extraction model Curiosity Workspace can run: dictionaries, patterns, ML/NER, and LLM extraction. Pick the right one — or the right combination — based on your vocabulary, volume, and accuracy needs.
For when extraction itself is the right tool, see NLP overview. For configuration, see Entity extraction.
Comparison at a glance
| Aspect | Dictionary / spotter | Pattern (regex) | ML / NER | LLM extraction |
|---|---|---|---|---|
| Best for | Finite vocabularies | Structured identifiers | Generic types (PERSON, ORG, …) | Open-ended fields, intents, summaries |
| Precision | Very high | Medium–high | Medium | Medium–high (depends on prompt) |
| Recall | Limited to the list | Wide on shape, miss novel | Wide for trained types | Widest |
| Latency | Microseconds | Microseconds | Milliseconds | Hundreds of ms — seconds |
| Cost / 1k docs | Negligible | Negligible | CPU/GPU compute | Provider API call (per token) |
| Maintenance | Update list as vocab grows | Update regex on new shapes | Re-train when accuracy drifts | Update prompt; track model versions |
| Determinism | Deterministic | Deterministic | Mostly deterministic | Stochastic (set temperature = 0) |
| Multilingual | Per-language dictionary | Pattern-dependent | Per-language model | Inherent (most modern LLMs) |
| Confidence | Binary or boost-weighted | Binary or context-weighted | Probabilistic score | Self-reported, unreliable |
Dictionary / spotter
Curated list of canonical terms with aliases, matched against tokenized text.
Pick this when:
- Your vocabulary is enumerable — product catalog, customer list, internal team names.
- You need deterministic results for compliance.
- You can keep the list updated as the catalog changes.
Avoid when:
- The space is open-ended (every possible person name, every potential brand).
- Your alias coverage is poor — you'll under-extract.
Configuration: see Entity extraction → Dictionary / spotter.
Pattern (regex)
Shape-based matching for codes and identifiers.
Pick this when:
- Your entity has a consistent format —
TICKET-12345,0xDEADBEEF,SKU-AB-1234. - The format is unambiguous in context.
Avoid when:
- The pattern is too generic (
\d{4}matches years, postcodes, ages, asset IDs — all of them). - The format changes across vendors or sources without you knowing.
Always pair patterns with context constraints (context_must_include) and exclusions (context_includes → reject) to control over-firing.
ML / NER
Pre-trained named-entity recognizers. Curiosity ships SpaCy-class models per language and integrates with external NER services.
Pick this when:
- You need generic types and don't have time to curate dictionaries (PERSON, ORG, LOCATION, MONEY, DATE).
- Your text is well-formed prose (news, emails, articles).
Avoid when:
- Domain ambiguity is high — "Apple" in a tech corpus is unambiguous; in a grocery corpus it isn't.
- You need to extract specific business entities, not generic types.
Mix with dictionaries: run the dictionary first for known entities, fall back to NER for anything else.
LLM extraction
Use a language model with a structured-output prompt.
Pick this when:
- The field is genuinely open-ended — extracting "customer's intent" from a conversation, structured fields from a contract clause.
- You can afford the latency and per-call cost (typically batch / offline workflows).
- You can stomach occasional drift in output format — and you have validation in place.
Avoid when:
- High volume / low latency requirements (real-time ingestion).
- The vocabulary is small and stable — dictionaries are cheaper and more precise.
- You need exact reproducibility — LLM outputs drift between model versions.
See Prompting patterns → Extraction for templates.
Recommended combinations
Most production systems run at least two model types in a pipeline.
| Combination | Where it shines |
|---|---|
| Dictionary → Pattern | Known products + ticket/asset IDs in support text. |
| Dictionary → ML NER | Known accounts/customers + generic people/orgs in CRM notes. |
| Pattern → LLM (long-tail) | Catch IDs cheaply, send everything else to a model for soft extraction. |
| Dictionary → Pattern → ML → LLM | "Throw everything at it" for high-stakes corpora where coverage matters. |
The order matters: cheaper extractors first remove certainty cases so the expensive ones don't reprocess them.
Migration paths
- Start with dictionaries. They give the most signal per hour of work. Build the vocabulary from the canonical entities already in your graph (
MapAsync(...)in the SDK). - Add patterns next. Find the high-volume IDs the support team copy-pastes.
- Add ML when generic types matter. Run it on the longest fields, not headers.
- Add LLM last. Only when the gap left by the other three is worth the cost.
Selecting per use case
Translate the business question into a model choice:
| Business question | Best model type |
|---|---|
| "Show me every ticket about MacBook Air." | Dictionary (product list). |
| "List every error code that appeared in the last week." | Pattern. |
| "Which customers and partners did this email mention?" | Dictionary + ML NER. |
| "What does the customer actually want in this 800-word email?" | LLM. |
| "Build a topic taxonomy from 100k support cases." | LLM (offline batch). |
Where to go next
- Entity extraction — full configuration and review workflow.
- NLP overview — how extraction fits with embeddings and the graph.
- Prompting patterns → Extraction — LLM templates.
- Search optimization — turning extracted entities into facets.