Curiosity

Internationalization

Curiosity Workspace supports multi-language content end-to-end: ingestion, NLP, embeddings, search, and UI. This page is the reference for what's supported and what to configure when you operate in more than one language.

Language support matrix

Capability Support
Text tokenization English, French, Spanish, German, Portuguese, Italian, Dutch, Swedish, Norwegian, Danish, Finnish, Polish, Czech, Russian, Turkish, Japanese, Chinese (simplified + traditional), Korean, Arabic, Hindi, plus 80+ via the underlying tokenizer.
Stemming / lemmatization Language-specific stemmers for all major European languages; lemmatizers for English, French, German, Spanish, Italian, Portuguese.
Stop-word filtering Per-language stop-word lists. Customize via the admin UI.
NER (built-in) English, French, Spanish, German, Portuguese. Other languages available via add-on packs.
OCR languages English, French, Spanish, German, Portuguese out of the box. Additional languages via Tesseract packs.
Speech-to-text (Whisper) ~100 languages with auto-detection. Quality varies; English is highest.
Local embedding models MiniLM is English-centric. ArcticXS is English. For multilingual, configure an external multilingual model (e.g. OpenAI text-embedding-3-large, Cohere embed-multilingual-v3).
UI localization Workspace UI ships with English, French, Spanish, German, Portuguese, Italian. Other languages via the tntc translation pipeline.

NLP and embedding implications

Multilingual support is not automatic — it's a series of decisions per workspace.

  1. Pick the right tokenizer per field. A Description field that mixes English and German benefits from auto-detected tokenization. A field always in Japanese should be pinned to Japanese.
  2. Use multilingual embeddings if your corpus is multilingual. A monolingual English model embedding French text produces low-quality vectors. Switch to a multilingual model (and re-embed) when introducing new languages.
  3. Don't mix languages in a single vector index. Either use a multilingual model, or run one embedding index per language. Mixed-language indexes with a monolingual model degrade silently.
  4. NER language packs are additive. Adding French NER doesn't replace English; both run, each on its detected-language slice.

See Embeddings and NLP overview.

Tokenization details

The workspace's tokenizer is language-aware:

Language family Tokenization
Whitespace-separated (Latin) Whitespace + punctuation, language-specific stemmer.
Agglutinative (Finnish, Turkish, Korean) Whitespace + morphological splitter.
Logographic (Chinese, Japanese) Statistical word-segmenter. No reliance on spaces.
Right-to-left (Arabic, Hebrew) Bidi-aware. Stems handle suffix variations.

Search relevance depends critically on matching the tokenizer to the content. Mis-configured tokenizers (e.g. running the whitespace tokenizer on Chinese) silently degrade recall — every word looks like one giant token.

Configuring language support

  1. Sign into the workspace as admin.
  2. Open Admin → Workspace Configuration.
  3. Set the default language — the workspace's UI language and the fallback for fields with no detected language.
  4. Add supported languages — each enables its tokenizer, stop-word list, and per-language analyzers.
  5. (Optional) Add NER and OCR language packs for any language you ingest content in.
  6. Update connectors to set a per-record language hint when the source has one. Without a hint, auto detection runs at ingest time.

Per-field language strategy

Three patterns, in order of preference:

Pattern When to use
Auto-detect. Field has language: auto. Most fields. Mixed-language corpora.
Pin per field. language: ja. Field is always in one language (e.g. a translated description column).
Separate fields per language. Title_en, Title_fr. High-volume sites with formal translations. Easiest to search per locale.

UI localization

The workspace front-end uses curiosity-ai/tntc for translation. Wrap user-visible strings with .t():

using TNT;
using static TNT.T;

var greeting = "Hello World".t();
var dateLine = t($"Today is {DateTime.Now}");

tntc scans your code, extracts all .t() strings, and emits a JSON file per language. Translators fill the JSON; the workspace loads the right file based on the user's locale or workspace default.

Per-language assets (logos, banners) live under /assets/{lang}/ and are picked up automatically.

Per-tenant locale

In multi-tenant workspaces, each tenant can have its own default locale and its own list of supported languages. Set the tenant-level config through the admin API or by adding properties on the tenant's _AccessGroup node:

graph.AddOrUpdate(tenantTeam, new {
    Locale = "fr-FR",
    SupportedLanguages = new[] { "fr", "en" },
});

User sessions inherit the tenant locale unless the user explicitly overrides it.

Operational checklist

  • Default language matches the majority of your content.
  • Tokenizers enabled for every language you ingest.
  • Embedding model is multilingual if your corpus spans languages.
  • NER packs installed for every language you extract entities from.
  • OCR packs installed for every language in scanned documents.
  • UI translations cover all supported locales (or you accept English fallback).
  • Connectors set per-record language hints whenever the source provides them.

Common pitfalls

  • Adding a language without re-embedding. Existing English vectors don't help with French queries. Re-embed after switching to a multilingual model.
  • Default-language drift. Setting the default to English in a French-majority corpus produces low-quality stemming for the bulk of your content.
  • Mixing tokenizers in one field. Pin the language if the field is known to be monolingual; auto-detect otherwise. Never both.
  • Forgetting OCR languages. A French PDF processed with English-only OCR comes out as gibberish.
  • UI strings not wrapped in .t(). They show up untranslated in non-default locales.

Where to go next

© 2026 Curiosity. All rights reserved.
Powered by Neko