Experiment 03 — LLM Chat¶
Text -> Qwen3 / Gemma3 GGUF via LLMUnity -> streamed tokens -> on-screen text + sentence-boundary
UnityEvent<string>. The sentence event is the integration seam EXP-05 (full voice loop) plugs Kokoro TTS into without changing this scaffold.
The scaffold lives at experiments/03-llm-chat/. The full README is at experiments/03-llm-chat/README.md.
What this experiment proves¶
- LLMUnity (which wraps llama.cpp) initialises against either
Qwen3-1.7B-Q5_K_M.gguf(flagship) orgemma3-1b-q4_k_m.gguf(Quest / low-end — deferred v1.2). - Tokens stream incrementally — the on-screen label grows letter by letter, not in one final blob.
- The sentence-boundary buffer fires
OnSentenceStreamed(sentence)per terminator (./!/?) at offset ≥ 8 chars, matchingvoice_ai_architecture.md § 8verbatim. - The four voice prompt rules from
§ 9+ the Qwen3/no_thinkdirective from § 9.1 hold under inference.
Code walkthrough¶
Source: experiments/03-llm-chat/LlmChat.cs.
The MonoBehaviour:
- On
Awake, picks the first GGUF found inAssets/StreamingAssets/VoiceAI/llm/. Order: Qwen3, then Gemma3. - Adds an
LLMUnity.LLMcomponent, callsSetModel(path), awaitsWaitUntilReady. - Adds an
LLMUnity.LLMAgentcomponent, assignsllmAgent.llm = _llm(the backend reference), and setsllmAgent.systemPrompt = AssembleSystemPrompt(). - On
Ask(), fires_ = llmAgent.Chat(prompt, OnCumulative, HandleStreamComplete, addToHistory: true).
The critical AssembleSystemPrompt method — verbatim from the source:
private string AssembleSystemPrompt()
{
// voice_ai_architecture.md § 9 rules verbatim.
// /no_think is Qwen3-specific (Gemma3 ignores per memory/api_surfaces.md). For Gemma3 builds
// the directive is harmless but unused; tracked under VOICE-AI-SPEC-FIX-001 for a proper
// model-branched prompt assembler.
return
"Respond only in plain spoken English sentences. " +
"No markdown, asterisks, bullet points, headers, or lists. " +
"Keep every response under 40 words. " +
"Speak as if in a live conversation. " +
"/no_think";
}
This is the canonical Sauti system prompt the rest of the experiments inherit from.
Sentence-boundary buffer¶
The first callback to LLMAgent.Chat receives cumulative assembled text, not per-token deltas. The scaffold tracks an _emittedThroughOffset cursor:
private void OnCumulative(string cumulative)
{
if (string.IsNullOrEmpty(cumulative)) return;
_lastCumulativeLen = cumulative.Length;
int searchStart = _emittedThroughOffset;
int boundary = LastIndexOfTerminator(cumulative, searchStart);
if (boundary >= searchStart + minSentenceOffset)
{
string sentence = cumulative.Substring(searchStart, boundary + 1 - searchStart);
_emittedThroughOffset = boundary + 1;
OnSentenceStreamed?.Invoke(sentence);
}
}
When the buffer hits a ./!/? at index ≥ 8 past the last emitted offset, the prefix is extracted, the cursor advances, and OnSentenceStreamed(sentence) fires. The minSentenceOffset = 8 guard prevents one-word sentences from spamming the TTS hook.
This pattern is reused verbatim by experiments/05-full-voice-loop/FullVoiceLoop.cs — the orchestration is the same, only the upstream source of Chat calls differs.
Manual scene creation¶
Follow experiments/03-llm-chat/ChatScene.unity.placeholder.md. The short version:
- New empty scene; save as
ChatScene.unityunderexperiments/03-llm-chat/. - Empty
GameObjectnamedLlmChat. AttachLlmChat.cs. - Canvas with: a
TMP_InputField, a UI Button, aTextMeshProUGUIoutput label. - Wire the button's
OnClicktoLlmChat.Ask. Bind the input field'stextto thepromptfield onLlmChatvia your favourite UI-bind pattern. - Wire
LlmChat.OnToken(orOnSentenceStreamed) to update the output label. - Press Play. Type a question. Click Ask.
Expected console output:
[Sauti][LLM] init model=Qwen3-1.7B-Q5_K_M.gguf ready=true
[Sauti][LLM] ask len=N model=Qwen3-1.7B-Q5_K_M.gguf
[Sauti][LLM] sentence "Stormwall lies on the northern coast."
[Sauti][LLM] full response: "..."
Try this¶
Three modifications to try:
- Watch the cumulative-not-delta behaviour. Add a
Debug.Log($"cumulative len={cumulative.Length}: {cumulative}")at the top ofOnCumulative. You'll see the string grow every callback — that's the LLMUnity API contract. Diff-against-previous if you need true per-token deltas. - Disable
/no_think. EditAssembleSystemPromptand remove the/no_thinktail. Re-run. Notice the Qwen3 reply may now include<think>...</think>blocks before the actual response. That's reasoning-mode output — useful for debugging but unwanted in a voice pipeline (Kokoro would read the think tags aloud). - Try a longer prompt. The default scaffold prompt is short. Paste in a multi-paragraph user query. Watch the streaming behaviour change — first token TTFA is similar, but total response time scales with output length.
Known limitations¶
/no_thinkis hard-coded. Gemma3 doesn't honour it; the directive becomes harmless-but-pointless when Gemma3 is the resolved model. Per-model branching is tracked asVOICE-AI-SPEC-FIX-001follow-up.- No conversation history yet. EXP-03 is single-shot Q&A; the rolling 10-turn history per
voice_ai_architecture.md § 4.1lands in EXP-05. - No streaming TTS. The sentence event is wired but nothing subscribes by default. EXP-05 connects a Kokoro runner.
Cross-references¶
- Upstream:
undreamai/LLMUnity - API surface notes:
memory/api_surfaces.md—LLMUnity.LLMAgentsection - AI models — LLM
- Spec:
voice_ai_architecture.md § 8 + § 9 + § 9.1 - Voice prompt rules: Voice prompt rules
- Previous experiment: 02 — STT Loopback
- Next experiment: 04 — RAG Grounding