Architecture¶
The canonical specification for Sauti's runtime lives in
memory/voice_ai_architecture.md. This page reframes that spec for an outside reader. Where the two ever disagree, the canonical file wins.
The one-line architecture¶
Mic -> Whisper ONNX -> text -> Memory (history + RAG + temp KV) -> enriched prompt
|
v
Qwen3 / Gemma3 GGUF -> tokens -> Kokoro ONNX -> Audio
Four pipeline stages. Each stage gets its optimal model format. The two runtimes (ONNX Runtime and llama.cpp) share no memory and no GPU context — they exchange data only through C# string.
The hybrid-runtime invariant¶
Sauti runs two model runtimes side by side:
| Runtime | Stages | Why |
|---|---|---|
ONNX Runtime (via asus4/onnxruntime-unity) |
STT, embeddings, TTS | Best format for these workloads. Quantised INT8 weights, CPU/GPU EPs (CoreML / DirectML / NNAPI), small footprints. |
llama.cpp (via undreamai/LLMUnity) |
LLM | Purpose-built KV-cache. Q4/Q5 GGUF weights mmap-friendly. Streaming token callback API. Metal / Vulkan offload that just works. |
The earlier "single ONNX runtime" design was reversed because GGUF + llama.cpp is materially better than ONNX for autoregressive LLM inference on consumer CPUs and mobile/VR. The hybrid cost is paid once, in build configuration; the benefit is paid back every inference.
Invariant
The two runtimes only ever exchange string over the C# boundary. No native interop between them. If that invariant is ever broken, the hybrid decision is no longer safe and must be reopened.
This invariant has consequences:
- All data crossing between runtimes is text. Audio (PCM
float[]) never crosses the boundary; only the transcript or generated string does. - A stage that needs structured data from the LLM gets it as JSON-in-a-string and parses it on the C# side. There is no "shared tensor" path.
- Failure modes are isolated. A llama.cpp crash cannot corrupt an ONNX Runtime session and vice versa.
Runtime stack¶
| Stage | Model | Format | Runtime | Bundled size |
|---|---|---|---|---|
| STT | Whisper Small | ONNX INT8 | asus4/onnxruntime-unity (via whisper.unity) |
~230 MB |
| STT (Quest / low-end) | Whisper Tiny | ONNX INT8 | same | ~38 MB |
| Embeddings (RAG) | all-MiniLM-L6-v2 |
ONNX INT8 | asus4/onnxruntime-unity |
~22 MB |
| LLM | Qwen3-1.7B | GGUF Q5_K_M | LLMUnity (llama.cpp) | ~1.2 GB |
| LLM (Quest / low-end) | Gemma3-1B (deferred) | GGUF Q4_K_M | LLMUnity (llama.cpp) | ~0.7 GB |
| TTS | Kokoro 82M | ONNX INT8 | asus4/onnxruntime-unity |
~42 MB |
See the AI models catalogue for SHA-256 hashes and exact byte counts.
The three-layer memory architecture¶
+----------------------------------------------------------------+
| Layer 1 Conversation History rolling 10-turn window |
| Layer 2 Temporary Memory Dict<string,string> in C# |
| Layer 3 Vector DB (RAG) pre-built knowledge.db |
+----------------------------------------------------------------+
|
v
Combined into one prompt
|
v
LLM inference call
Each layer has a different lifecycle, a different write path, and a different reason to exist.
Layer 1 — Conversation history¶
- Scope: current session only. Cleared on app exit via
await llmAgent.ClearHistory(). - Storage:
List<ChatMessage>exposed asllmAgent.chat, managed byLLMUnity.LLMAgentinternally. - Behaviour: LLMUnity manages history by context-window-fill (
overflowStrategy+overflowTargetRatio), not by message count. - Sauti convention: for a hard 10-turn cap, set
overflowStrategyto truncate AND keepllmAgent.chat.Count <= 20(10 user + 10 assistant messages) by trimming from the front each turn.
See Memory layers — Layer 1 for the full pattern.
Layer 2 — Temporary memory¶
Named facts learned mid-session (player name, current quest, stated preferences). Survives turn rollovers; gone on app exit. Implemented as a pure-C# static class — no Unity API dependency, unit-testable headlessly.
public static class TemporaryMemory
{
private static readonly Dictionary<string, string> _store = new();
public static void Set(string key, string value) => _store[key] = value;
public static void Clear() => _store.Clear();
public static string BuildPromptBlock()
{
if (_store.Count == 0) return string.Empty;
var facts = string.Join(", ", _store.Select(kv => $"{kv.Key}={kv.Value}"));
return $"Known facts about this session: {facts}.\n";
}
}
Source: Assets/Sauti/Runtime/Scripts/TemporaryMemory.cs.
Layer 3 — Vector database (RAG)¶
Semantic search over a pre-built, read-only knowledge base (lore, NPC backstories, world facts).
- Scope: persistent, read-only, built offline, bundled with the plugin.
- Storage: flat binary file (
knowledge.db) atAssets/StreamingAssets/VoiceAI/rag/knowledge.db. Built by an Editor tool that converts plain-text sources underknowledge-base/into 384-dim embeddings. - Embedding model:
all-MiniLM-L6-v2ONNX INT8 — the same model encodes both the knowledge base (offline) and each user query (at runtime). Mixing encoders breaks semantic similarity. - Top-K: default
numResults: 3.
The build pipeline is implemented by Sauti.Editor.Rag.RagDatabaseBuilder and writes to both ai-models/rag/knowledge.db (source-of-truth) and Assets/StreamingAssets/VoiceAI/rag/knowledge.db (runtime read path), keeping the two in lockstep.
See Memory layers — Layer 3 for the embedder and chunker internals.
Prompt assembly — how all three layers combine¶
This is the single most important code shape in Sauti. The verbatim § 4.5 pattern from voice_ai_architecture.md:
string BuildPrompt(string userMessage, string[] ragChunks)
{
var sb = new StringBuilder();
sb.AppendLine("Respond only in plain spoken English sentences. No markdown. Under 40 words. /no_think");
sb.Append(TemporaryMemory.BuildPromptBlock()); // Layer 2
if (ragChunks.Length > 0) // Layer 3
{
sb.AppendLine("Relevant context:");
foreach (var chunk in ragChunks) sb.AppendLine($"- {chunk}");
}
// Layer 1: conversation history is appended internally by LLMUnity
return sb.ToString();
}
The system-prompt rules come from § 9 (see Voice prompt rules). The /no_think tail is a Qwen3-specific directive (§ 9.1).
The real implementation lives in experiments/05-full-voice-loop/FullVoiceLoop.cs's BuildPrompt method — wired to all three layers and the Layer 1 trim helper.
Asset flow — two locations, one source of truth¶
Source of truth Runtime read path
--------------- -----------------
ai-models/ Assets/StreamingAssets/VoiceAI/
stt/ stt/
whisper-small/ whisper-small/ (PC, iOS, Android flagship)
whisper-tiny/ whisper-tiny/ (Quest, Android low-end)
llm/ llm/
Qwen3-1.7B-Q5_K_M.gguf Qwen3-1.7B-Q5_K_M.gguf
gemma3-1b-q4_k_m.gguf (Quest / low-end only — deferred v1.2)
embeddings/ embeddings/
model_int8.onnx model_int8.onnx
vocab.txt vocab.txt
rag/ rag/
knowledge.db ----- editor menu ----> knowledge.db
tts/ tts/
model_quantized.onnx model_quantized.onnx
tokenizer.json tokenizer.json
voices/*.bin voices/*.bin
ai-models/ — repo source-of-truth¶
The repository root contains a checked-out copy of every model file Sauti can load, organised by stage and tagged with a manifest (ai-models/<stage>/manifest.json). The manifest records: SHA-256, source URL, license, target platforms, and lifecycle status.
These files are large. They are either checked in via Git LFS or downloaded via the Editor menu Sauti -> Download Default Models (planned).
Assets/StreamingAssets/VoiceAI/ — Unity runtime path¶
At build time the Editor build pre-processor copies the platform-relevant subset of ai-models/ into StreamingAssets/VoiceAI/. Only the models tagged for the current build target ship.
StreamingAssets/is read-only at runtime on every platform. Models are read from disk; never downloaded at runtime. Fully offline.- Android caveat:
StreamingAssets/on Android lives inside a compressed.jarand cannot be memory-mapped directly. The plugin must copy each model toApplication.persistentDataPath/on first launch and load from there.
Per-platform model selection¶
| Platform | STT | LLM | Embeddings | TTS |
|---|---|---|---|---|
| PC (Windows / Linux) | Whisper Small | Qwen3-1.7B Q5_K_M | MiniLM | Kokoro |
| Mac (Apple Silicon) | Whisper Small | Qwen3-1.7B Q5_K_M | MiniLM | Kokoro |
| iOS / visionOS | Whisper Small | Qwen3-1.7B Q5_K_M | MiniLM | Kokoro |
| Android (flagship) | Whisper Small | Qwen3-1.7B Q5_K_M | MiniLM | Kokoro |
| Quest 2 / 3 | Whisper Tiny | Qwen3-1.7B Q5_K_M 1 | MiniLM | Kokoro |
| Android (low-end) | Whisper Tiny | Qwen3-1.7B Q5_K_M 1 | MiniLM | Kokoro |
A Quest build must not ship Whisper Small or omit the Tiny variant — the Editor build pre-processor strips unused model files per target. See Per-platform notes for designer-facing guidance.
GPU acceleration — automatic, per-runtime¶
| Platform | STT (ONNX) | Embeddings (ONNX) | LLM (GGUF / llama.cpp) | TTS (ONNX) |
|---|---|---|---|---|
| Windows | DirectML / CUDA | DirectML | Vulkan | DirectML |
| Mac / iOS | CoreML | CoreML | Metal | CoreML |
| Android | NNAPI | NNAPI | CPU (ARM NEON) | NNAPI |
| Quest | CPU | CPU | CPU | CPU |
All runtimes auto-detect and fall back to CPU silently. No manual configuration.
Streaming — required for conversational feel¶
Do not wait for the full LLM response before starting TTS. Buffer LLM tokens until a sentence boundary, then synthesise immediately. This is what makes the conversation feel responsive rather than batched.
void OnLLMToken(string cumulativeText)
{
int boundary = LastIndexOfTerminator(cumulativeText, _emittedThroughOffset);
if (boundary >= _emittedThroughOffset + 8)
{
string sentence = cumulativeText.Substring(_emittedThroughOffset, boundary + 1 - _emittedThroughOffset);
_emittedThroughOffset = boundary + 1;
ttsEngine.SpeakAsync(sentence); // Kokoro ONNX
}
}
Cumulative, not delta
LLMUnity's LLMAgent.Chat first callback receives the cumulative assembled response so far, not per-token deltas. Sauti's sentence-boundary loop tracks an _emittedThroughOffset cursor into the cumulative string and only emits sentences past that cursor. See experiments/03-llm-chat/LlmChat.cs and experiments/05-full-voice-loop/FullVoiceLoop.cs for the verified pattern.
Target latency (user speaks to hears first word):
- PC / Mac: 1.5–2 s
- Quest: 3–5 s
LLM prompt rules for voice¶
Every system prompt must include the four behavioural rules:
- Respond only in plain spoken English sentences.
- No markdown, asterisks, bullet points, headers, or lists.
- Keep every response under 40 words.
- Speak as if in a live conversation.
LLM output feeds directly into Kokoro TTS — markdown or list syntax becomes spoken garbage. See Voice prompt rules for the full rule set, the per-model /no_think table, and the assembled string from Sauti's reference scaffold.
Hard constraints¶
- Language: English only. Whisper language is fixed to
"en"inWhisperManager. Other languages are out of scope for v1.0. - Models are read-only, static files. Never retrained or updated at runtime.
- RAG knowledge base is read-only at runtime. Rebuild offline via the Editor tool.
- Temporary memory is session-scoped. Call
TemporaryMemory.Clear()on scene unload. - Conversation history is session-scoped. Call
llmAgent.ClearHistory()on session end. - No internet required or used. Ever.
- No user audio or conversation data leaves the device.
- The two runtimes share no memory and no GPU context. Only
stringflows across the C# boundary. - Android: load models from
Application.persistentDataPath(copy fromStreamingAssetson first launch).
Where to go next¶
-
Dive into the memory layers
The three layers, what they store, how they trim, and how to wire all three into a single prompt.
-
Extend Sauti
Write your own
ISautiRagBackend, swap MiniLM for a smaller embedder, or author a custom prompt assembler. -
API reference
Every public class and method in the
Sauti.*namespaces, with line-number links to source. -
Try the experiments
Six runnable scenes that demonstrate each pipeline stage in isolation, then composed.
-
v1.2 status: Quest LLM falls back to Qwen3-1.7B. Gemma3-1B Q4_K_M was the spec's intended Quest pick (smaller footprint, ~0.7 GB) but is deferred to post-v1.2 — Gemma's non-SPDX Terms of Use require manual Hugging Face acceptance. v1.2 Quest builds ship Qwen3-1.7B-Q5_K_M (~1.26 GB); on a Quest 3's 8 GB RAM, headroom is tight but functional. Future releases can re-introduce Gemma3 by flipping its manifest entry from
status: deferredtostatus: readyafter the TOS is accepted. ↩↩