AI models catalogue¶
Every model file Sauti can load, by stage. Each row is verbatim from the per-stage manifest under ai-models/<stage>/manifest.json. Total bundled assets: ~1.6 GiB (with Qwen3-only; ~2.3 GiB once Gemma3 lands).
Source of truth
These tables mirror ai-models/<stage>/manifest.json. If you change a manifest, the docs should follow. The build pre-processor reads the manifest at build time to pick the platform-relevant subset.
STT — Speech-to-text¶
Stage: stt
Manifest: ai-models/stt/manifest.json
Runtime: asus4/onnxruntime-unity via Macoron/whisper.unity
Language: English only (language = "en")
Whisper Small (flagship)¶
Targets: windows, macos, linux, ios, android_flagship. Lives under ai-models/stt/whisper-small/.
| File | Size | SHA-256 (first 16 chars) | Format | Status |
|---|---|---|---|---|
encoder_model_quantized.onnx |
88 MB | a43a83f3c5361cd5... |
ONNX INT8 | ready |
decoder_model_merged_quantized.onnx |
149 MB | ec07c3cbb64172c3... |
ONNX INT8 | ready |
tokenizer.json |
2 MB | 27fc476bfe7f1729... |
Binary | ready |
config.json |
2 KB | 457854d452f17661... |
Binary | ready |
generation_config.json |
4 KB | f538b28220c6a6d6... |
Binary | ready |
Source: onnx-community/whisper-small — MIT licensed, license confirmed 2026-05-26.
Total Whisper Small: ~239 MB.
Whisper Tiny (Quest / low-end)¶
Targets: quest, android_lowend. Lives under ai-models/stt/whisper-tiny/.
| File | Size | SHA-256 (first 16 chars) | Format | Status |
|---|---|---|---|---|
encoder_model_quantized.onnx |
10 MB | 2af4a414ca47aa30... |
ONNX INT8 | ready |
decoder_model_merged_quantized.onnx |
29 MB | 25e807a962b63493... |
ONNX INT8 | ready |
tokenizer.json |
2 MB | 27fc476bfe7f1729... |
Binary | ready |
config.json |
2 KB | 46aeea0a406afbeb... |
Binary | ready |
generation_config.json |
4 KB | f5c67e5a4f7102f8... |
Binary | ready |
Source: onnx-community/whisper-tiny — MIT licensed.
Total Whisper Tiny: ~43 MB.
The Whisper Tiny tokenizer is byte-identical to the Whisper Small tokenizer (sha256 matches) — all Whisper variants share one tokeniser.
LLM — Large language model¶
Stage: llm
Manifest: ai-models/llm/manifest.json
Runtime: undreamai/LLMUnity (wraps llama.cpp)
Qwen3-1.7B Q5_K_M (flagship)¶
| Property | Value |
|---|---|
| File | Qwen3-1.7B-Q5_K_M.gguf |
| Display name | Qwen3 1.7B (GGUF Q5_K_M) |
| Format | GGUF Q5_K_M |
| Size | 1.26 GB (1 257 880 128 bytes) |
| SHA-256 | b0949de5b2e06cbed6aa96517f9bd8afb334584b6f95ee83479292ff4bdd8ed3 |
| Source | unsloth/Qwen3-1.7B-GGUF |
| License | Apache-2.0 (confirmed 2026-05-26) |
| Targets | windows, macos, linux, ios, android_flagship |
Honours /no_think? |
Yes |
| Status | ready |
Source remap
The original voice_ai_architecture.md spec pointed at Qwen/Qwen3-1.7B-GGUF, which only publishes the Q8_0 (1.83 GB) variant. Sauti remaps to unsloth/Qwen3-1.7B-GGUF which provides the spec's Q5_K_M variant at 1.20 GB. See the manifest notes field for the full rationale.
Gemma3-1B Q4_K_M (Quest / low-end) — deferred post-v1.2¶
| Property | Value |
|---|---|
| File | gemma3-1b-q4_k_m.gguf |
| Display name | Gemma 3 1B Instruct (GGUF Q4_K_M) |
| Format | GGUF Q4_K_M |
| Size | 0.72 GB (751 619 276 bytes) approx |
| SHA-256 | TODO_FILL_AFTER_DOWNLOAD |
| Source | google/gemma-3-1b-it-GGUF |
| License | Gemma Terms of Use (non-SPDX) |
| Requires explicit acceptance? | Yes |
| Targets | quest, android_lowend |
Honours /no_think? |
No |
| Status | deferred |
Deferred to post-v1.2
The Gemma Terms of Use require manual acceptance via Hugging Face login. The team chose simplicity-of-shipping over second-LLM-variety for v1.2 — Quest builds in v1.2 fall back to Qwen3-1.7B-Q5_K_M (1.26 GB, tight on Quest 3's 8 GB RAM but functional). Future v1.3+ re-activates this entry: accept terms, download with an HF token, fill sha256 + licenseConfirmedAt, flip status to ready.
Embeddings — RAG encoder¶
Stage: embeddings
Manifest: ai-models/embeddings/manifest.json
Runtime: asus4/onnxruntime-unity
Used by: offline build (KnowledgeBaseChunker -> MiniLmRagEmbedder) and runtime query path. Same encoder for both is mandatory.
all-MiniLM-L6-v2 INT8¶
| Property | Value |
|---|---|
| File | model_int8.onnx |
| Display name | all-MiniLM-L6-v2 (INT8) |
| Format | ONNX INT8 |
| Size | 22 MB (22 972 370 bytes) |
| SHA-256 | afdb6f1a0e45b715d0bb9b11772f032c399babd23bfc31fed1c170afc848bdb1 |
| Output dim | 384 |
| Source | Xenova/all-MiniLM-L6-v2 |
| License | Apache-2.0 (confirmed 2026-05-26) |
| Targets | all platforms |
| Status | ready |
WordPiece vocab¶
| Property | Value |
|---|---|
| File | vocab.txt |
| Size | 232 KB (231 508 bytes) |
| SHA-256 | 07eced375cec144d27c900241f3e339478dec958f92fddbc551f295c992038a3 |
| Vocab size | 30 522 tokens (standard bert-base-uncased) |
| Source | Xenova/all-MiniLM-L6-v2 |
Source remap
Original manifest pointed at optimum/all-MiniLM-L6-v2, which only ships FP32 model.onnx. Sauti remaps to Xenova/all-MiniLM-L6-v2 which provides onnx/model_int8.onnx. The vocab is byte-identical to the optimum copy.
TTS — Text-to-speech¶
Stage: tts
Manifest: ai-models/tts/manifest.json
Runtime: asus4/onnxruntime-unity via Sauti's own KokoroTtsRunner
Source: all files from onnx-community/Kokoro-82M-ONNX — Apache-2.0.
Core model + tokenizer¶
| File | Size | SHA-256 (first 16) | Notes |
|---|---|---|---|
model_quantized.onnx |
88 MB | 0d55b15d4b735d61... |
Kokoro 82M INT8. Sample rate 24 kHz. |
tokenizer.json |
5 KB | ee301fc39cf903dd... |
177-entry IPA + ASCII-punct vocab. Pad token "$" has id 0. |
Voices (×11)¶
Each .bin is raw float32 of shape (-1, 1, 256), 524 288 bytes (= 131 072 floats = 512 × 1 × 256). The leading dim is "max token length"; the runner indexes by len(tokens) to pick the row.
Voice id convention: first letter = accent (a = American, b = British), second letter = gender (f = female, m = male). See Voice IDs for the full table.
| File | Display name | SHA-256 (first 16) |
|---|---|---|
voices/af.bin |
American Female (default blend) | a4f11d9d055a12bf... |
voices/af_bella.bin |
American Female, Bella | 38e12d4b9b31a751... |
voices/af_nicole.bin |
American Female, Nicole | f27666996f2d2277... |
voices/af_sarah.bin |
American Female, Sarah | fe4f8b49c272dc5e... |
voices/af_sky.bin |
American Female, Sky | f8017c8507ec6a55... |
voices/am_adam.bin |
American Male, Adam | 6d5255a4b4803f59... |
voices/am_michael.bin |
American Male, Michael | 9c3be118019ddb41... |
voices/bf_emma.bin |
British Female, Emma | fd71ce57d2d69ccb... |
voices/bf_isabella.bin |
British Female, Isabella | d3c6f2737d586f01... |
voices/bm_george.bin |
British Male, George | 68736d5397fcbc46... |
voices/bm_lewis.bin |
British Male, Lewis | 45b693a17544cc98... |
Total Kokoro footprint: ~88 MB model + 5 KB tokenizer + 11 × 512 KB voices = ~94 MB.
RAG — Knowledge base¶
Stage: rag
Manifest: none (built artefact, not a downloaded file)
| Property | Value |
|---|---|
| File | knowledge.db |
| Format | Sauti binary (magic 0x01474152, format documented at RagDatabaseBuilder.WriteDatabase) |
| Location (source-of-truth) | ai-models/rag/knowledge.db |
| Location (runtime) | Assets/StreamingAssets/VoiceAI/rag/knowledge.db |
| Built by | Sauti -> Build Knowledge Base Editor menu (RagDatabaseBuilder.BuildFromMenu) |
| Input | All *.md / *.txt under knowledge-base/ except README.md |
| Status | pending — build via Editor menu once MiniLmRagEmbedder model is in place |
See Knowledge base authoring for the full build pipeline.
Per-platform shipping matrix¶
Which files end up in a given build:
| Platform | STT | LLM | Embeddings | TTS | Total bundle |
|---|---|---|---|---|---|
| Windows / Linux | Whisper Small (239 MB) | Qwen3 (1.26 GB) | MiniLM (22 MB) | Kokoro + voices (94 MB) | ~1.6 GiB |
| macOS / iOS | Whisper Small (239 MB) | Qwen3 (1.26 GB) | MiniLM (22 MB) | Kokoro + voices (94 MB) | ~1.6 GiB |
| Android (flagship) | Whisper Small (239 MB) | Qwen3 (1.26 GB) | MiniLM (22 MB) | Kokoro + voices (94 MB) | ~1.6 GiB |
| Android (low-end) | Whisper Tiny (43 MB) | Qwen3 (1.26 GB) 1 | MiniLM (22 MB) | Kokoro + voices (94 MB) | ~1.4 GiB |
| Quest 2 / 3 | Whisper Tiny (43 MB) | Qwen3 (1.26 GB) 1 | MiniLM (22 MB) | Kokoro + voices (94 MB) | ~1.4 GiB |
How to verify a model on disk¶
shasum -a 256 ai-models/llm/Qwen3-1.7B-Q5_K_M.gguf
# Expected: b0949de5b2e06cbed6aa96517f9bd8afb334584b6f95ee83479292ff4bdd8ed3
The Editor build pre-processor (planned, tracked as BUILD-001) will perform this verification before copying into StreamingAssets/. Mismatches abort the build with a clear error.