NLP Overview
Curiosity Workspace ships five distinct text-processing capabilities. They overlap in vocabulary ("AI", "entities", "extraction") but they solve different problems. This page is the map.
Use it to answer "which knob do I turn for this problem?" before diving into the per-capability pages.
The five capabilities
| Capability | What it does | When to use |
|---|---|---|
| NLP pipelines | Per-field configurable sequence: tokenization, language detection, spotters, etc. | Whenever you ingest text. The default pipeline does the right thing for most fields. |
| Embeddings | Vector representation of text for semantic similarity and vector search. | "Find similar X." Hybrid search. Recommendations. |
| Entity extraction | Find product names, IDs, people, etc. inside unstructured text. | When the signal you care about is buried in free-form text, not a structured field. |
| Entity linking | Connect extracted entity mentions to actual graph nodes. | After extraction, when you want navigation: "every ticket that mentions this device." |
| LLM extraction | Use a language model to pull structured fields out of free text. | Long-tail extraction where rules and dictionaries don't generalize — quotes, contracts, summaries. |
| Graph enrichment | Use extracted/linked entities to materialize new edges or summary nodes. | When you want the graph itself to grow as new content arrives. |
The decision tree:
NLP pipelines
A pipeline is the per-field workflow text runs through during ingestion. Typical steps:
- Language detection — what language is this?
- Tokenization — split into words/tokens, language-aware.
- Normalization — lowercase, strip punctuation, etc.
- Spotters — dictionary and pattern matchers.
- Linkers — connect mentions to graph nodes.
You configure pipelines per (nodeType, field) in the admin UI. The default is sensible — only customize when extraction quality demands it.
Embeddings
Embeddings are the semantics layer. They turn text into vectors so two strings can be compared by meaning. The mechanics are detailed in Embeddings and the call surface in Embeddings API.
Use embeddings when keyword search alone can't capture intent — paraphrases, conceptual queries, "more like this." Don't use them for things keyword search handles well (exact IDs, codes).
Entity extraction vs entity linking
Extraction = "I see MacBook Air 2024 in this sentence." It produces a span and a candidate label.
Linking = "MacBook Air 2024 in this sentence corresponds to graph node Device:MBA-2024." It produces an edge in the graph.
You can run extraction without linking (just text annotations). You can't link without extraction. See Entity extraction and Types of models.
LLM extraction
Where rules and dictionaries fail — open-ended fields like "what action did the customer ask for" — call an LLM with a structured-output prompt. See Prompting patterns → Extraction.
LLM extraction is slower and more expensive than rule-based; reserve it for fields where the long tail matters more than per-record cost.
Graph enrichment
The composite move: every time a ticket arrives, run extraction, link mentions to existing devices/customers, optionally summarize with an LLM, and write the summary + links back into the graph as a durable artifact. Future search hits then carry the structured links for free.
See Graph design patterns for enrichment patterns.
How NLP fits into the platform
- Graph. Linked entities become edges. Enrichment nodes become new graph data.
- Search. Extracted fields can be indexed for text or used as facets.
- AI. Grounded answers and tool-using agents rely on extracted/linked entities to know what to look up.
When to use NLP — and when not to
Use NLP when:
- the value you need lives inside unstructured text (tickets, notes, transcripts);
- you need facets that don't exist as explicit fields;
- you want navigation from a mention in text to the entity it references.
Skip NLP when:
- the source already exposes the structured field — map it in the connector;
- the entity vocabulary is open-ended and small-volume — call an LLM at query time instead of pre-extracting;
- the field is too short to add signal (status codes, single-token labels).
Where to go next
- Embeddings — semantic similarity, vector indexes.
- Entity extraction — rules, dictionaries, patterns.
- Types of models — comparing dictionary / pattern / ML / LLM approaches.
- Prompting patterns — LLM-driven extraction and grounded Q&A.