LLM Configuration
The LLM and embedding providers your workspace uses are configured under Settings → AI Settings. Three things live there:
- Chat models — used by the chat assistant and any tool/endpoint that calls
Graph.CallChatModelAsync(...). - Embedding models — used by vector search and similarity queries.
- Tool policies — which tools are available, response timeouts, max tokens.
This page covers the supported providers, the per-provider setup, and the policy decisions you make once a provider is wired up.
Provider compatibility matrix
| Provider | Chat | Embeddings | Notes |
|---|---|---|---|
| OpenAI | ✓ | ✓ | Hosted. Latest GPT and embedding model families. |
| Azure OpenAI | ✓ | ✓ | OpenAI models on Azure, with regional deployment + Entra ID auth. Best fit for Microsoft-shop deployments. |
| Anthropic (Claude) | ✓ | — | Hosted. Claude Opus / Sonnet / Haiku. No embedding service — pair with an OpenAI / Azure / local embedding model. |
| Local OpenAI-compatible server | ✓ | ✓ | Run any OpenAI-protocol-speaking server (Ollama, vLLM, LM Studio, your own). Strongest data-residency story. |
| Local embedding (MiniLM / FastText) | — | ✓ | Built-in CPU-friendly embedding models — no external network required for vector search. |
Mix and match: it's common to use a hosted chat provider with local embeddings, so most of your text never leaves your network.
Configuration goals
- Predictability — same prompt + same context → same answer, modulo provider-side variance.
- Safety — no unintended actions, no data leakage to the wrong provider.
- Cost control — explicit caps on tokens per response and tokens per turn.
- Traceability — every chat turn and every tool call logs which model produced what.
- Failure tolerance — degrade gracefully if the provider is unreachable.
Per-provider setup
OpenAI
- Create an API key from the OpenAI dashboard. Use a project key so you can revoke it independently.
- Settings → AI Settings → Add provider → OpenAI.
- Paste the API key. The workspace encrypts it with
MSK_GRAPH_MASTER_KEYbefore storing. - Pick a chat model (e.g.,
gpt-4o,gpt-4o-mini) and an embedding model (e.g.,text-embedding-3-small). - Click Test. The workspace makes a probe call to each model.
Limits to set: max output tokens (start at 1024), per-call timeout (30s).
Azure OpenAI
- Create an Azure OpenAI resource and a deployment for each model you want to use.
- Settings → AI Settings → Add provider → Azure OpenAI.
- Provide:
- Endpoint (e.g.,
https://my-resource.openai.azure.com/). - API key or Entra ID identity (recommended; uses the workspace's managed identity).
- Deployment name for chat and for embeddings.
- API version (use the latest stable).
- Click Test.
Notes: Azure deployments are per-region, per-quota. Set realistic timeouts (30–60s) for cold starts on lightly-used deployments.
Anthropic (Claude)
- Create a Claude API key from the Anthropic Console.
- Settings → AI Settings → Add provider → Anthropic.
- Paste the API key.
- Pick a chat model (e.g.,
claude-opus-4-7,claude-sonnet-4-6,claude-haiku-4-5). - Configure an embedding provider separately (OpenAI, Azure OpenAI, or local) — Anthropic does not currently offer an embedding service.
- Click Test. ===
Local OpenAI-compatible server
Useful for air-gapped or data-residency-sensitive deployments.
- Run a server that speaks the OpenAI HTTP protocol — examples: Ollama, vLLM, LM Studio, Text Generation Inference.
- Settings → AI Settings → Add provider → Custom (OpenAI-compatible).
- Provide:
- Base URL (e.g.,
http://ollama.internal:11434/v1). - An API key if the server requires one.
- The model identifier (e.g.,
llama3.1:70b).
- Click Test.
Notes: the local server runs on your own infrastructure — its capacity and latency are your problem. Size it to match your chat volume.
Local embeddings (built-in)
The workspace can produce embeddings locally without an external provider. Useful when:
- the corpus must never leave your network;
- the embedding cost from a hosted provider would be prohibitive;
- you want predictable latency under variable load.
- Settings → AI Settings → Embeddings → Local.
- Pick a model family (MiniLM, FastText, or another supported local encoder).
- Trigger a rebuild from Settings → Maintenance → Rebuild embeddings.
Tradeoff: local embeddings are typically smaller and less accurate than hosted models, but on enterprise corpora the gap is usually narrow enough to not matter, especially when combined with text retrieval (hybrid search).
Picking the right chat model
| If you want… | Pick… |
|---|---|
| Lowest latency for tool-call-heavy chat | A small fast model (gpt-4o-mini, claude-haiku-4-5, a local 7B-class model) |
| Highest answer quality on long, dense context | A frontier model (gpt-4o, claude-opus-4-7) |
| Predictable cost at scale | A mid-tier model with a hard max_tokens cap |
| Air-gapped operation | A local 70B-class model on a GPU box |
| Strict data residency in EU/US/JP/… | A regional Azure OpenAI deployment |
Whatever you pick, set hard caps:
- Max output tokens per turn (start at 1024; raise for summarization workloads).
- Per-call timeout (start at 30s).
- Max tool calls per turn (start at 5; large values let the LLM thrash).
Prompt templates
Maintain a small set of reusable templates and version them with the rest of your code:
- Grounded Q&A — "Answer using ONLY the snippets below. Cite with
[1]." (Prompting Patterns) - Summarization — explicit length/structure constraint.
- Classification / extraction — strict JSON schema for output.
- Tool-using assistant — short, clear, list available tools.
Templates live in Settings → AI Settings → Prompts; export them to git for promotion.
Tooling and endpoint access
If your chat surface has tools available:
- Curate the catalog deliberately. Vague or overlapping tools cause unreliable selection. See AI Tools.
- Mark admin-only tools so they don't appear for regular users.
- Bound expensive tools with a hard
Take(...)and a cancellation check.
Fallback and degradation
Production deployments configure a fallback provider:
- If the primary chat provider returns
5xxor times out, the workspace transparently retries against the fallback. - If both fail, the workspace returns a
tool_invocation_failed(in tools) orexternal_provider_timeout(in endpoints) — see Error codes.
For embeddings, the workspace falls back to text retrieval only when the embedding provider is unavailable. New nodes won't get vector entries until the provider recovers; rebuild after recovery with Reindexing and re-embedding.
Cost guardrails
Every provider call is logged with token counts and (when known) cost. Operationalize this:
- Set a hard ceiling on tokens-per-day under Settings → AI Settings → Quotas.
- Alert when daily spend exceeds a threshold.
- Monitor
/api/chatai/tools/metricsfor the tools that consume the most tokens — they often hide a prompt that's growing unbounded.
Validation
After provider config changes, walk this checklist:
- Test button succeeds for chat and embeddings.
- A search for a known phrase returns expected results (validates embeddings).
- A chat turn returns an answer with citations (validates chat + tool flow).
- Token counts appear in monitoring.
- Failure mode: temporarily block the provider's egress; verify the workspace returns a clear error rather than hanging.
Next steps
- The architecture this configuration plugs into: RAG and agent architecture.
- Prompt patterns: Prompting Patterns.
- Tool design: AI Tools.
- Embedding strategy: Vector Search, Embeddings.
- Data handling: Security → AI / model-provider data handling.