Curiosity

NLP Overview

Curiosity Workspace ships five distinct text-processing capabilities. They overlap in vocabulary ("AI", "entities", "extraction") but they solve different problems. This page is the map.

Use it to answer "which knob do I turn for this problem?" before diving into the per-capability pages.

The five capabilities

Capability What it does When to use
NLP pipelines Per-field configurable sequence: tokenization, language detection, spotters, etc. Whenever you ingest text. The default pipeline does the right thing for most fields.
Embeddings Vector representation of text for semantic similarity and vector search. "Find similar X." Hybrid search. Recommendations.
Entity extraction Find product names, IDs, people, etc. inside unstructured text. When the signal you care about is buried in free-form text, not a structured field.
Entity linking Connect extracted entity mentions to actual graph nodes. After extraction, when you want navigation: "every ticket that mentions this device."
LLM extraction Use a language model to pull structured fields out of free text. Long-tail extraction where rules and dictionaries don't generalize — quotes, contracts, summaries.
Graph enrichment Use extracted/linked entities to materialize new edges or summary nodes. When you want the graph itself to grow as new content arrives.

The decision tree:

flowchart TD A[Is the signal already structured?] -->|yes| skip[Use connector mapping] A -->|no| B{What do you need?} B -->|Similarity, paraphrase| emb[Embeddings + vector search] B -->|Find named things in text| C{Enumerable vocabulary?} C -->|yes - product list, IDs| ext[Entity extraction: dictionaries + patterns] C -->|no - free-form| llm[LLM extraction] ext --> link[Optional: entity linking → graph] llm --> link link --> enrich[Optional: graph enrichment]

NLP pipelines

A pipeline is the per-field workflow text runs through during ingestion. Typical steps:

  1. Language detection — what language is this?
  2. Tokenization — split into words/tokens, language-aware.
  3. Normalization — lowercase, strip punctuation, etc.
  4. Spotters — dictionary and pattern matchers.
  5. Linkers — connect mentions to graph nodes.

You configure pipelines per (nodeType, field) in the admin UI. The default is sensible — only customize when extraction quality demands it.

Embeddings

Embeddings are the semantics layer. They turn text into vectors so two strings can be compared by meaning. The mechanics are detailed in Embeddings and the call surface in Embeddings API.

Use embeddings when keyword search alone can't capture intent — paraphrases, conceptual queries, "more like this." Don't use them for things keyword search handles well (exact IDs, codes).

Entity extraction vs entity linking

Extraction = "I see MacBook Air 2024 in this sentence." It produces a span and a candidate label.

Linking = "MacBook Air 2024 in this sentence corresponds to graph node Device:MBA-2024." It produces an edge in the graph.

You can run extraction without linking (just text annotations). You can't link without extraction. See Entity extraction and Types of models.

LLM extraction

Where rules and dictionaries fail — open-ended fields like "what action did the customer ask for" — call an LLM with a structured-output prompt. See Prompting patterns → Extraction.

LLM extraction is slower and more expensive than rule-based; reserve it for fields where the long tail matters more than per-record cost.

Graph enrichment

The composite move: every time a ticket arrives, run extraction, link mentions to existing devices/customers, optionally summarize with an LLM, and write the summary + links back into the graph as a durable artifact. Future search hits then carry the structured links for free.

See Graph design patterns for enrichment patterns.

How NLP fits into the platform

  • Graph. Linked entities become edges. Enrichment nodes become new graph data.
  • Search. Extracted fields can be indexed for text or used as facets.
  • AI. Grounded answers and tool-using agents rely on extracted/linked entities to know what to look up.

When to use NLP — and when not to

Use NLP when:

  • the value you need lives inside unstructured text (tickets, notes, transcripts);
  • you need facets that don't exist as explicit fields;
  • you want navigation from a mention in text to the entity it references.

Skip NLP when:

  • the source already exposes the structured field — map it in the connector;
  • the entity vocabulary is open-ended and small-volume — call an LLM at query time instead of pre-extracting;
  • the field is too short to add signal (status codes, single-token labels).

Where to go next

© 2026 Curiosity. All rights reserved.
Powered by Neko