Curiosity Workspaces

# Entity Extraction

# Entity Extraction

Entity extraction identifies meaningful spans of text (entities) and turns them into structured outputs. Examples:

products and device names
people and organizations
identifiers (ticket IDs, asset tags)
locations and topics (depending on your domain)

In Curiosity Workspace, extraction is typically configured via NLP pipelines and models, then optionally linked into the graph.

# Extraction vs linking

Extraction finds entities in text.
Linking connects extracted entities to graph nodes (or creates nodes) so you can:
- navigate from text to entities
- filter by entities
- use entities as grounding context for AI workflows

# Common model types (conceptual)

Dictionary/spotter models
- match known terms (product catalog, customer names)
Pattern models
- capture structured forms (IDs, serial formats, codes)
ML models
- detect entities that cannot be enumerated (optional, domain-dependent)

# Recommended workflow

Start with extraction on a single high-value field (e.g., Summary).
Run experiments to evaluate:
- precision (how many captures are correct?)
- recall (how much is missed?)
Iterate on model coverage and exclusions.
Enable linking into the graph only when extraction is reliable enough.

# Common pitfalls

High false positives: pattern models can over-capture; add constraints and test broadly.
Ambiguous names: dictionary models need aliases and disambiguation strategy.
No evaluation loop: extraction needs iteration with real examples.

# Next steps

Add semantic retrieval to complement extraction: Embeddings
Implement domain-specific extraction rules: Custom NLP Rules