Architecture¶

The canonical specification for Sauti's runtime lives in memory/voice_ai_architecture.md. This page reframes that spec for an outside reader. Where the two ever disagree, the canonical file wins.

The one-line architecture¶

Mic -> Whisper ONNX -> text -> Memory (history + RAG + temp KV) -> enriched prompt
                                                     |
                                                     v
                                       Qwen3 / Gemma3 GGUF -> tokens -> Kokoro ONNX -> Audio

Four pipeline stages. Each stage gets its optimal model format. The two runtimes (ONNX Runtime and llama.cpp) share no memory and no GPU context — they exchange data only through C# string.

The hybrid-runtime invariant¶

Sauti runs two model runtimes side by side:

Runtime	Stages	Why
ONNX Runtime (via `asus4/onnxruntime-unity`)	STT, embeddings, TTS	Best format for these workloads. Quantised INT8 weights, CPU/GPU EPs (CoreML / DirectML / NNAPI), small footprints.
llama.cpp (via `undreamai/LLMUnity`)	LLM	Purpose-built KV-cache. Q4/Q5 GGUF weights mmap-friendly. Streaming token callback API. Metal / Vulkan offload that just works.

The earlier "single ONNX runtime" design was reversed because GGUF + llama.cpp is materially better than ONNX for autoregressive LLM inference on consumer CPUs and mobile/VR. The hybrid cost is paid once, in build configuration; the benefit is paid back every inference.

Invariant

The two runtimes only ever exchange string over the C# boundary. No native interop between them. If that invariant is ever broken, the hybrid decision is no longer safe and must be reopened.

This invariant has consequences:

All data crossing between runtimes is text. Audio (PCM float[]) never crosses the boundary; only the transcript or generated string does.
A stage that needs structured data from the LLM gets it as JSON-in-a-string and parses it on the C# side. There is no "shared tensor" path.
Failure modes are isolated. A llama.cpp crash cannot corrupt an ONNX Runtime session and vice versa.

Runtime stack¶

Stage	Model	Format	Runtime	Bundled size
STT	Whisper Small	ONNX INT8	`asus4/onnxruntime-unity` (via `whisper.unity`)	~230 MB
STT (Quest / low-end)	Whisper Tiny	ONNX INT8	same	~38 MB
Embeddings (RAG)	`all-MiniLM-L6-v2`	ONNX INT8	`asus4/onnxruntime-unity`	~22 MB
LLM	Qwen3-1.7B	GGUF Q5_K_M	LLMUnity (llama.cpp)	~1.2 GB
LLM (Quest / low-end)	Gemma3-1B (deferred)	GGUF Q4_K_M	LLMUnity (llama.cpp)	~0.7 GB
TTS	Kokoro 82M	ONNX INT8	`asus4/onnxruntime-unity`	~42 MB

See the AI models catalogue for SHA-256 hashes and exact byte counts.

The three-layer memory architecture¶

+----------------------------------------------------------------+
|  Layer 1  Conversation History    rolling 10-turn window       |
|  Layer 2  Temporary Memory        Dict<string,string> in C#    |
|  Layer 3  Vector DB (RAG)         pre-built knowledge.db       |
+----------------------------------------------------------------+
                            |
                            v
                  Combined into one prompt
                            |
                            v
                     LLM inference call

Each layer has a different lifecycle, a different write path, and a different reason to exist.

Layer 1 — Conversation history¶

Scope: current session only. Cleared on app exit via await llmAgent.ClearHistory().
Storage: List<ChatMessage> exposed as llmAgent.chat, managed by LLMUnity.LLMAgent internally.
Behaviour: LLMUnity manages history by context-window-fill (overflowStrategy + overflowTargetRatio), not by message count.
Sauti convention: for a hard 10-turn cap, set overflowStrategy to truncate AND keep llmAgent.chat.Count <= 20 (10 user + 10 assistant messages) by trimming from the front each turn.

// Sauti-side hard cap:
while (llmAgent.chat.Count > 20) llmAgent.chat.RemoveAt(0);

See Memory layers — Layer 1 for the full pattern.

Layer 2 — Temporary memory¶

Named facts learned mid-session (player name, current quest, stated preferences). Survives turn rollovers; gone on app exit. Implemented as a pure-C# static class — no Unity API dependency, unit-testable headlessly.

public static class TemporaryMemory
{
    private static readonly Dictionary<string, string> _store = new();

    public static void Set(string key, string value) => _store[key] = value;
    public static void Clear() => _store.Clear();

    public static string BuildPromptBlock()
    {
        if (_store.Count == 0) return string.Empty;
        var facts = string.Join(", ", _store.Select(kv => $"{kv.Key}={kv.Value}"));
        return $"Known facts about this session: {facts}.\n";
    }
}

Source: Assets/Sauti/Runtime/Scripts/TemporaryMemory.cs.

Layer 3 — Vector database (RAG)¶

Semantic search over a pre-built, read-only knowledge base (lore, NPC backstories, world facts).

Scope: persistent, read-only, built offline, bundled with the plugin.
Storage: flat binary file (knowledge.db) at Assets/StreamingAssets/VoiceAI/rag/knowledge.db. Built by an Editor tool that converts plain-text sources under knowledge-base/ into 384-dim embeddings.
Embedding model: all-MiniLM-L6-v2 ONNX INT8 — the same model encodes both the knowledge base (offline) and each user query (at runtime). Mixing encoders breaks semantic similarity.
Top-K: default numResults: 3.

The build pipeline is implemented by Sauti.Editor.Rag.RagDatabaseBuilder and writes to both ai-models/rag/knowledge.db (source-of-truth) and Assets/StreamingAssets/VoiceAI/rag/knowledge.db (runtime read path), keeping the two in lockstep.

See Memory layers — Layer 3 for the embedder and chunker internals.

Prompt assembly — how all three layers combine¶

This is the single most important code shape in Sauti. The verbatim § 4.5 pattern from voice_ai_architecture.md:

string BuildPrompt(string userMessage, string[] ragChunks)
{
    var sb = new StringBuilder();
    sb.AppendLine("Respond only in plain spoken English sentences. No markdown. Under 40 words. /no_think");
    sb.Append(TemporaryMemory.BuildPromptBlock());          // Layer 2
    if (ragChunks.Length > 0)                                // Layer 3
    {
        sb.AppendLine("Relevant context:");
        foreach (var chunk in ragChunks) sb.AppendLine($"- {chunk}");
    }
    // Layer 1: conversation history is appended internally by LLMUnity
    return sb.ToString();
}

The system-prompt rules come from § 9 (see Voice prompt rules). The /no_think tail is a Qwen3-specific directive (§ 9.1).

The real implementation lives in experiments/05-full-voice-loop/FullVoiceLoop.cs's BuildPrompt method — wired to all three layers and the Layer 1 trim helper.

Asset flow — two locations, one source of truth¶

                  Source of truth                        Runtime read path
                  ---------------                        -----------------
                  ai-models/                             Assets/StreamingAssets/VoiceAI/
                    stt/                                   stt/
                      whisper-small/                         whisper-small/ (PC, iOS, Android flagship)
                      whisper-tiny/                          whisper-tiny/  (Quest, Android low-end)
                    llm/                                   llm/
                      Qwen3-1.7B-Q5_K_M.gguf                 Qwen3-1.7B-Q5_K_M.gguf
                      gemma3-1b-q4_k_m.gguf                  (Quest / low-end only — deferred v1.2)
                    embeddings/                            embeddings/
                      model_int8.onnx                        model_int8.onnx
                      vocab.txt                              vocab.txt
                    rag/                                   rag/
                      knowledge.db  ----- editor menu ---->   knowledge.db
                    tts/                                   tts/
                      model_quantized.onnx                   model_quantized.onnx
                      tokenizer.json                         tokenizer.json
                      voices/*.bin                           voices/*.bin

`ai-models/` — repo source-of-truth¶

The repository root contains a checked-out copy of every model file Sauti can load, organised by stage and tagged with a manifest (ai-models/<stage>/manifest.json). The manifest records: SHA-256, source URL, license, target platforms, and lifecycle status.

These files are large. They are either checked in via Git LFS or downloaded via the Editor menu Sauti -> Download Default Models (planned).

`Assets/StreamingAssets/VoiceAI/` — Unity runtime path¶

At build time the Editor build pre-processor copies the platform-relevant subset of ai-models/ into StreamingAssets/VoiceAI/. Only the models tagged for the current build target ship.

StreamingAssets/ is read-only at runtime on every platform. Models are read from disk; never downloaded at runtime. Fully offline.
Android caveat: StreamingAssets/ on Android lives inside a compressed .jar and cannot be memory-mapped directly. The plugin must copy each model to Application.persistentDataPath/ on first launch and load from there.

Per-platform model selection¶

Platform	STT	LLM	Embeddings	TTS
PC (Windows / Linux)	Whisper Small	Qwen3-1.7B Q5_K_M	MiniLM	Kokoro
Mac (Apple Silicon)	Whisper Small	Qwen3-1.7B Q5_K_M	MiniLM	Kokoro
iOS / visionOS	Whisper Small	Qwen3-1.7B Q5_K_M	MiniLM	Kokoro
Android (flagship)	Whisper Small	Qwen3-1.7B Q5_K_M	MiniLM	Kokoro
Quest 2 / 3	Whisper Tiny	Qwen3-1.7B Q5_K_M ¹	MiniLM	Kokoro
Android (low-end)	Whisper Tiny	Qwen3-1.7B Q5_K_M ¹	MiniLM	Kokoro

A Quest build must not ship Whisper Small or omit the Tiny variant — the Editor build pre-processor strips unused model files per target. See Per-platform notes for designer-facing guidance.

GPU acceleration — automatic, per-runtime¶

Platform	STT (ONNX)	Embeddings (ONNX)	LLM (GGUF / llama.cpp)	TTS (ONNX)
Windows	DirectML / CUDA	DirectML	Vulkan	DirectML
Mac / iOS	CoreML	CoreML	Metal	CoreML
Android	NNAPI	NNAPI	CPU (ARM NEON)	NNAPI
Quest	CPU	CPU	CPU	CPU

All runtimes auto-detect and fall back to CPU silently. No manual configuration.

Streaming — required for conversational feel¶

Do not wait for the full LLM response before starting TTS. Buffer LLM tokens until a sentence boundary, then synthesise immediately. This is what makes the conversation feel responsive rather than batched.

void OnLLMToken(string cumulativeText)
{
    int boundary = LastIndexOfTerminator(cumulativeText, _emittedThroughOffset);
    if (boundary >= _emittedThroughOffset + 8)
    {
        string sentence = cumulativeText.Substring(_emittedThroughOffset, boundary + 1 - _emittedThroughOffset);
        _emittedThroughOffset = boundary + 1;
        ttsEngine.SpeakAsync(sentence);   // Kokoro ONNX
    }
}

Cumulative, not delta

LLMUnity's LLMAgent.Chat first callback receives the cumulative assembled response so far, not per-token deltas. Sauti's sentence-boundary loop tracks an _emittedThroughOffset cursor into the cumulative string and only emits sentences past that cursor. See experiments/03-llm-chat/LlmChat.cs and experiments/05-full-voice-loop/FullVoiceLoop.cs for the verified pattern.

Target latency (user speaks to hears first word):

PC / Mac: 1.5–2 s
Quest: 3–5 s

LLM prompt rules for voice¶

Every system prompt must include the four behavioural rules:

- Respond only in plain spoken English sentences.
- No markdown, asterisks, bullet points, headers, or lists.
- Keep every response under 40 words.
- Speak as if in a live conversation.

LLM output feeds directly into Kokoro TTS — markdown or list syntax becomes spoken garbage. See Voice prompt rules for the full rule set, the per-model /no_think table, and the assembled string from Sauti's reference scaffold.

Hard constraints¶

Language: English only. Whisper language is fixed to "en" in WhisperManager. Other languages are out of scope for v1.0.
Models are read-only, static files. Never retrained or updated at runtime.
RAG knowledge base is read-only at runtime. Rebuild offline via the Editor tool.
Temporary memory is session-scoped. Call TemporaryMemory.Clear() on scene unload.
Conversation history is session-scoped. Call llmAgent.ClearHistory() on session end.
No internet required or used. Ever.
No user audio or conversation data leaves the device.
The two runtimes share no memory and no GPU context. Only string flows across the C# boundary.
Android: load models from Application.persistentDataPath (copy from StreamingAssets on first launch).

Where to go next¶

Dive into the memory layers

The three layers, what they store, how they trim, and how to wire all three into a single prompt.

-> Memory layers
Extend Sauti

Write your own ISautiRagBackend, swap MiniLM for a smaller embedder, or author a custom prompt assembler.

-> Extending Sauti
API reference

Every public class and method in the Sauti.* namespaces, with line-number links to source.

-> API reference
Try the experiments

Six runnable scenes that demonstrate each pipeline stage in isolation, then composed.

-> Experiments

v1.2 status: Quest LLM falls back to Qwen3-1.7B. Gemma3-1B Q4_K_M was the spec's intended Quest pick (smaller footprint, ~0.7 GB) but is deferred to post-v1.2 — Gemma's non-SPDX Terms of Use require manual Hugging Face acceptance. v1.2 Quest builds ship Qwen3-1.7B-Q5_K_M (~1.26 GB); on a Quest 3's 8 GB RAM, headroom is tight but functional. Future releases can re-introduce Gemma3 by flipping its manifest entry from status: deferred to status: ready after the TOS is accepted. ↩↩