Catalyst

The Document Class

The Document class is the primary data structure in Catalyst. It represents the text being processed and stores all the linguistic annotations generated by the NLP pipeline.

Creating a Document

You can create a new document by providing the raw text and its language.

using Catalyst;
using Mosaik.Core;

var doc = new Document("The quick brown fox jumps over the lazy dog", Language.English);

Important Properties

  • Value: The original raw text of the document.
  • Language: The language of the document.
  • UID: A unique identifier for the document.
  • Metadata: A dictionary for storing arbitrary key-value pairs associated with the document.
  • Labels: A list of string labels (e.g., for document classification).
  • Spans: An enumerable of all Span objects (sentences) in the document.
  • SpansCount: The total number of spans in the document.
  • TokensCount: The total number of tokens across all spans.
  • EntitiesCount: The total number of recognized entities in the document.
  • IsParsed: A boolean indicating if the document has been tokenized.

Key Methods

TokenizedValue

Returns the tokenized text of the document. You can optionally merge recognized entities into single tokens.

string tokenized = doc.TokenizedValue(mergeEntities: true);

ToTokenList

Flattens all tokens from all spans into a single list of IToken objects.

List<IToken> allTokens = doc.ToTokenList();

AddSpan

Manually adds a span to the document by specifying the start and end character indices.

var span = doc.AddSpan(0, 10);

ToStringWithReplacements

Allows you to generate a new string where recognized entities are replaced based on a custom function.

string anonymized = doc.ToStringWithReplacements(entity =>
{
    if (entity.EntityType.Type == "Person") return "[REDACTED]";
    return null; // Keep original
});

Clear

Removes all tokens and spans from the document, but keeps the raw text and metadata.

doc.Clear();

Serialization and Deserialization

Catalyst provides several ways to save and load documents.

JSON Serialization

You can easily convert a document to and from a JSON string.

// Serialize to JSON
string json = doc.ToJson();

// Deserialize from JSON
Document doc2 = Document.FromJson(json);

Binary Serialization (MessagePack)

For high-performance scenarios, Catalyst supports binary serialization using MessagePack. This is often used internally when storing models or large corpora.

Immutable Documents

The ImmutableDocument class provides an immutable, memory-efficient representation of a document. It is useful for scenarios where you want to ensure the document data is not changed after processing.

// Convert to ImmutableDocument
ImmutableDocument immutableDoc = doc.ToImmutable();

// Convert back to mutable Document
Document mutableDoc = immutableDoc.ToMutable();

Referenced by

© 2026 Catalyst. All rights reserved.