Adding a model¶
You found a great new ONNX or GGUF model. This page walks the eight steps from "I have a download URL" to "the model is shipping in Sauti".
Sauti is structured so that most of the work is data, not code. Adding a new model variant in an existing stage is mostly a manifest edit. Adding a new model that requires a custom runner is more work — also covered below.
The eight steps¶
1. Pick a stage.
2. Download into ai-models/<stage>/.
3. Verify the SHA-256.
4. Add (or update) a manifest entry.
5. Confirm the license.
6. Update the per-platform selection table if relevant.
7. (If a new stage / new model family) write a runner.
8. Wire the runner to a Sauti subsystem (if step 7 applied).
Step 1 — Pick a stage¶
The five stages are fixed:
| Stage | What lives here |
|---|---|
stt |
Speech-to-text models (Whisper variants today). |
llm |
Large language models (Qwen3, Gemma3 deferred). |
embeddings |
Sentence encoders for RAG (MiniLM). |
tts |
Text-to-speech models (Kokoro). |
rag |
Built artefacts (knowledge.db). |
If your model doesn't fit one of these, you're introducing a new stage — which means a spec amendment. Open a discussion in memory/todo.md under ### Open Questions first.
Step 2 — Download into ai-models/<stage>/¶
Conventions:
- Single-file model: drop directly under
ai-models/<stage>/(e.g.ai-models/llm/Qwen3-1.7B-Q5_K_M.gguf). - Multi-file model: put into a subfolder named after the model variant (e.g.
ai-models/stt/whisper-small/). - The filename in the manifest's
fileNamefield must match the on-disk filename exactly (use forward slashes for subfolder paths:"whisper-small/encoder_model_quantized.onnx").
The Sauti repo expects model files to land here first. The build pre-processor (planned, BUILD-001) reads from here and copies the platform-appropriate subset into Assets/StreamingAssets/VoiceAI/<stage>/ at build time.
Step 3 — Verify the SHA-256¶
shasum -a 256 ai-models/llm/Qwen3-1.7B-Q5_K_M.gguf
# b0949de5b2e06cbed6aa96517f9bd8afb334584b6f95ee83479292ff4bdd8ed3 ai-models/llm/Qwen3-1.7B-Q5_K_M.gguf
Record the lowercase hex digest. This goes in the manifest's sha256 field.
The hash is the only trust anchor. The build pre-processor refuses to ship files whose hash doesn't match.
Step 4 — Add or update the manifest entry¶
The manifest is ai-models/<stage>/manifest.json. The schema is documented in detail at Manifest schema.
Required fields (eleven, all must-not-be-null):
{
"fileName": "your-model-file.onnx",
"displayName": "Human-readable name",
"format": "ONNX | GGUF | Binary",
"sizeBytes": 12345678,
"language": "en",
"sha256": "lowercase-hex-from-step-3",
"source": {
"type": "huggingface | github | url",
"repo": "owner/repo",
"url": "https://canonical-page"
},
"license": "Apache-2.0 | MIT | (SPDX id) | (non-SPDX label)",
"licenseConfirmedAt": "YYYY-MM-DD",
"targets": ["windows", "macos", "..."],
"status": "ready"
}
Optional but often relevant:
quantisation—"INT8","Q5_K_M", etc.licenseUrl+requiresExplicitAcceptance— required when the license is non-SPDX or requires click-through.supportsNoThinkDirective— LLM-stage only.notes— free text. Record source-remap rationale, tracker IDs, anything the next contributor needs to know.
If your editor has JSON-schema support (VS Code with the JSON extension; JetBrains IDEs), validation is live as you edit because of the $schema reference at the top of every manifest.
Step 5 — Confirm the license¶
For permissive SPDX licenses (Apache-2.0, MIT, etc.):
- Record today's date in
licenseConfirmedAt(ISO-8601:"2026-05-26"). - That's it.
For non-SPDX licenses (Gemma TOS, Llama community license, etc.):
- Record the label in
license(e.g."Gemma-Terms-of-Use"). - Add
licenseUrlpointing at the terms document. - Set
requiresExplicitAcceptance: true. - Open the URL. Read the terms. Verify redistribution is permitted. If the terms require a click-through and the maintainer hasn't clicked through, status should be
"deferred", not"ready". - The Editor download tool will surface this before fetching.
The deferred-Gemma3 entry in ai-models/llm/manifest.json is the worked example.
Step 6 — Update the per-platform selection table¶
If your model changes which file ships on which platform — e.g. it's a smaller variant that should replace an existing one on Quest — update:
- The
targetsarray in your manifest entry (which platforms ship this file). - The per-platform table in
memory/voice_ai_architecture.md § 6. - The mirror table in Architecture — per-platform model selection.
- The mirror table in Per-platform notes.
If your model is just another variant at an existing stage (e.g. a new voice file under tts/voices/), only the manifest changes.
Step 7 — Write a runner (only for new model families)¶
Skip this step if your new model uses one of the existing runners. Examples that don't need a new runner:
- Another Whisper variant —
Macoron/whisper.unityhandles it. - Another GGUF LLM —
LLMUnityhandles it. - Another MiniLM-shaped sentence encoder —
MiniLmRagEmbedderhandles it (adjustOutputDimensionsif dim differs). - Another Kokoro voice —
KokoroTtsRunnerdiscovers it from thevoices/directory.
You do need a new runner if you're introducing a new model family with a different ONNX input schema or a new external runtime.
Where to put it¶
- ONNX-based runner ->
Assets/Sauti/Runtime/Scripts/Tts/(or a new subfolder for the stage). OrAssets/Sauti/Editor/if it's offline-only. - Non-ONNX runner (e.g. a new GGUF-based runtime) -> introduce a new asmdef. Discuss with the architect first.
The template¶
The canonical "raw ONNX Runtime runner" template is KokoroTtsRunner.cs. It demonstrates:
- Lazy initialisation pattern (
EnsureInitialised). - Dynamic input-name discovery from
InferenceSession.InputMetadata.Keys. - Dynamic output-name discovery (rank-based, name-agnostic).
IDisposablefor clean teardown.- Defensive error messages that name the available inputs / outputs.
Copy the file, rename, adjust:
- The input names list in
PickFirstPresent(inputKeys, ...). - The output discovery (rank, dtype).
- The tensor shapes for your model's inputs.
- The
Synthesize/Embed/...Asyncpublic method signature.
Reference: see also MiniLmRagEmbedder.cs for the embedder variant of the same pattern.
Tests¶
Tests for the new runner go in Assets/Sauti/Tests/Editor/. Mirror the shape of existing tests:
- Construct the runner.
- Drive
EmbedAsync/SynthesizeAsync/ equivalent. - Assert on output shape, length, value range.
- Use a small model file (or none, for shape-only tests) so tests stay fast.
Step 8 — Wire the runner to a Sauti subsystem¶
If the new runner replaces an existing one at the same stage:
- Update the orchestrator (e.g.
FullVoiceLoop.cs) to add the new file to its*ModelFileNamePreferencearray. - Order the array by preference — first present file wins.
If the new runner introduces a new pipeline step:
- This is a spec change. Open a discussion before writing code.
A worked example — adding a new Kokoro voice¶
The lightest possible flow. Suppose Kokoro publishes a new voice af_jessica:
- Stage:
tts. - Download:
wget https://huggingface.co/onnx-community/Kokoro-82M-ONNX/resolve/main/voices/af_jessica.bin -O ai-models/tts/voices/af_jessica.bin. - SHA-256:
shasum -a 256 ai-models/tts/voices/af_jessica.bin. - Manifest: append to
ai-models/tts/manifest.json:{ "fileName": "voices/af_jessica.bin", "displayName": "Kokoro voice — af_jessica (American Female, Jessica)", "format": "Binary", "sizeBytes": 524288, "approxSizeMB": 1, "language": "en", "sha256": "<digest>", "source": { "type": "huggingface", "repo": "onnx-community/Kokoro-82M-ONNX", "url": "https://huggingface.co/onnx-community/Kokoro-82M-ONNX/resolve/main/voices/af_jessica.bin" }, "license": "Apache-2.0", "licenseConfirmedAt": "<today>", "targets": ["windows", "macos", "linux", "ios", "android_flagship", "android_lowend", "quest"], "status": "ready", "notes": "Tracked as KOKORO-VOICES-DL-001." } - License: Apache-2.0 (same as all Kokoro voices). Date confirmed.
- Per-platform table: unchanged. All voices ship on all platforms.
- Runner: unchanged.
KokoroTtsRunnerdiscovers voices from the directory. - Wire-up: unchanged. The new voice id is automatically in
runner.AvailableVoiceIds. - Docs: add a row to Voice IDs.
Five-minute task, no code.
A worked example — adding a new LLM variant¶
Suppose Phi-4 ships a Q5_K_M GGUF and you want to add it as an alternative to Qwen3.
- Stage:
llm. - Download into
ai-models/llm/phi-4-q5_k_m.gguf. - SHA-256.
- Manifest entry in
ai-models/llm/manifest.jsonwithformat: "GGUF",supportsNoThinkDirective: false(Phi-4 doesn't honour the Qwen3 directive). - License: Phi-4 is
MIT. Easy. - Per-platform table: decide where it ships. If it's a Quest-friendlier size than Qwen3, update the Quest row in
voice_ai_architecture.md § 6to prefer Phi-4. - Runner: unchanged —
LLMUnityloads any GGUF viaLLM.SetModel(path). - Wire-up: add
"phi-4-q5_k_m.gguf"to thellmModelFileNamePreferencearray inexperiments/05-full-voice-loop/FullVoiceLoop.cs(and any other orchestrator). Choose its position based on preference order. - Prompt assembler: because Phi-4 doesn't honour
/no_think, branch the system prompt assembly per resolved model. See Voice prompt rules — non-thinking directive. - Test: add an integration test that loads the new model and runs one short
Chatcall. Verify the response doesn't include<think>blocks (or other reasoning markers).
Half-day task. The bulk is the prompt-assembler branching.
Cross-references¶
- Manifest schema in full: Manifest schema.
- Per-platform model selection: Architecture.
- Runner templates:
KokoroTtsRunner.cs,MiniLmRagEmbedder.cs. - The contributor charter on "no fictional APIs": Contributing — overview.