Internationalization
Curiosity Workspace supports multi-language content end-to-end: ingestion, NLP, embeddings, search, and UI. This page is the reference for what's supported and what to configure when you operate in more than one language.
Language support matrix
| Capability | Support |
|---|---|
| Text tokenization | English, French, Spanish, German, Portuguese, Italian, Dutch, Swedish, Norwegian, Danish, Finnish, Polish, Czech, Russian, Turkish, Japanese, Chinese (simplified + traditional), Korean, Arabic, Hindi, plus 80+ via the underlying tokenizer. |
| Stemming / lemmatization | Language-specific stemmers for all major European languages; lemmatizers for English, French, German, Spanish, Italian, Portuguese. |
| Stop-word filtering | Per-language stop-word lists. Customize via the admin UI. |
| NER (built-in) | English, French, Spanish, German, Portuguese. Other languages available via add-on packs. |
| OCR languages | English, French, Spanish, German, Portuguese out of the box. Additional languages via Tesseract packs. |
| Speech-to-text (Whisper) | ~100 languages with auto-detection. Quality varies; English is highest. |
| Local embedding models | MiniLM is English-centric. ArcticXS is English. For multilingual, configure an external multilingual model (e.g. OpenAI text-embedding-3-large, Cohere embed-multilingual-v3). |
| UI localization | Workspace UI ships with English, French, Spanish, German, Portuguese, Italian. Other languages via the tntc translation pipeline. |
NLP and embedding implications
Multilingual support is not automatic — it's a series of decisions per workspace.
- Pick the right tokenizer per field. A
Descriptionfield that mixes English and German benefits fromauto-detected tokenization. A field always in Japanese should be pinned to Japanese. - Use multilingual embeddings if your corpus is multilingual. A monolingual English model embedding French text produces low-quality vectors. Switch to a multilingual model (and re-embed) when introducing new languages.
- Don't mix languages in a single vector index. Either use a multilingual model, or run one embedding index per language. Mixed-language indexes with a monolingual model degrade silently.
- NER language packs are additive. Adding French NER doesn't replace English; both run, each on its detected-language slice.
See Embeddings and NLP overview.
Tokenization details
The workspace's tokenizer is language-aware:
| Language family | Tokenization |
|---|---|
| Whitespace-separated (Latin) | Whitespace + punctuation, language-specific stemmer. |
| Agglutinative (Finnish, Turkish, Korean) | Whitespace + morphological splitter. |
| Logographic (Chinese, Japanese) | Statistical word-segmenter. No reliance on spaces. |
| Right-to-left (Arabic, Hebrew) | Bidi-aware. Stems handle suffix variations. |
Search relevance depends critically on matching the tokenizer to the content. Mis-configured tokenizers (e.g. running the whitespace tokenizer on Chinese) silently degrade recall — every word looks like one giant token.
Configuring language support
- Sign into the workspace as admin.
- Open Admin → Workspace Configuration.
- Set the default language — the workspace's UI language and the fallback for fields with no detected language.
- Add supported languages — each enables its tokenizer, stop-word list, and per-language analyzers.
- (Optional) Add NER and OCR language packs for any language you ingest content in.
- Update connectors to set a per-record language hint when the source has one. Without a hint,
autodetection runs at ingest time.
Per-field language strategy
Three patterns, in order of preference:
| Pattern | When to use |
|---|---|
Auto-detect. Field has language: auto. |
Most fields. Mixed-language corpora. |
Pin per field. language: ja. |
Field is always in one language (e.g. a translated description column). |
Separate fields per language. Title_en, Title_fr. |
High-volume sites with formal translations. Easiest to search per locale. |
UI localization
The workspace front-end uses curiosity-ai/tntc for translation. Wrap user-visible strings with .t():
using TNT;
using static TNT.T;
var greeting = "Hello World".t();
var dateLine = t($"Today is {DateTime.Now}");
tntc scans your code, extracts all .t() strings, and emits a JSON file per language. Translators fill the JSON; the workspace loads the right file based on the user's locale or workspace default.
Per-language assets (logos, banners) live under /assets/{lang}/ and are picked up automatically.
Per-tenant locale
In multi-tenant workspaces, each tenant can have its own default locale and its own list of supported languages. Set the tenant-level config through the admin API or by adding properties on the tenant's _AccessGroup node:
graph.AddOrUpdate(tenantTeam, new {
Locale = "fr-FR",
SupportedLanguages = new[] { "fr", "en" },
});
User sessions inherit the tenant locale unless the user explicitly overrides it.
Operational checklist
- Default language matches the majority of your content.
- Tokenizers enabled for every language you ingest.
- Embedding model is multilingual if your corpus spans languages.
- NER packs installed for every language you extract entities from.
- OCR packs installed for every language in scanned documents.
- UI translations cover all supported locales (or you accept English fallback).
- Connectors set per-record language hints whenever the source provides them.
Common pitfalls
- Adding a language without re-embedding. Existing English vectors don't help with French queries. Re-embed after switching to a multilingual model.
- Default-language drift. Setting the default to English in a French-majority corpus produces low-quality stemming for the bulk of your content.
- Mixing tokenizers in one field. Pin the language if the field is known to be monolingual; auto-detect otherwise. Never both.
- Forgetting OCR languages. A French PDF processed with English-only OCR comes out as gibberish.
- UI strings not wrapped in
.t(). They show up untranslated in non-default locales.
Where to go next
- Embeddings — multilingual model selection.
- Multimodal search — OCR / STT languages.
- User management — per-user locale preference.
- Custom front-end — building localized UIs.