Experiment 06 — VR Quest NPC¶

The VR variant of the integrated voice loop. Quest 3 controller trigger starts mic capture -> Whisper Tiny ONNX -> memory + RAG -> Qwen3 GGUF -> Kokoro TTS -> spatialised audio at the NPC's position. Demonstrates the Quest-platform path through the Sauti pipeline.

The scaffold lives at experiments/06-vr-quest-npc/. The full README is at experiments/06-vr-quest-npc/README.md.

What this experiment proves¶

The four pipeline stages from voice_ai_architecture.md § 0 run on Quest with the Quest-targeted model variants (Whisper Tiny + Qwen3 ¹ + MiniLM + Kokoro).
XR Toolkit controller bindings drive push-to-talk without conflicting with the existing audio capture path.
The NPC's spatial position is honoured — audio plays from the NPC GameObject's AudioSource (3D), not the player's camera.
The Sauti pipeline is the same on Quest as on flagship — only the model selection differs (handled by the runtime-detection convention from EXP-02 / 03 / 05).

Code walkthrough¶

Source: experiments/06-vr-quest-npc/QuestVrCompanion.cs.

The MonoBehaviour mirrors FullVoiceLoop.cs for the inner orchestration (mic -> STT -> memory + RAG -> LLM -> sentence-stream), but adds:

XR controller polling. On Update, reads the right-hand controller's primary trigger via UnityEngine.XR.InputDevices.GetDeviceAtXRNode(XRNode.RightHand). Trigger-down calls StartListening; trigger-up calls StopAndProcess. (The XR binding is fenced as XR-API-001 — confirm against the modern XR Interaction Toolkit InputAction pattern when you wire your own.)
Spatial audio playback. The Kokoro TTS PCM is wrapped in an AudioClip and played through an AudioSource attached to the NPC GameObject (a child of the scene, not the camera). The AudioSource uses 3D spatial blend = 1.0, so distance attenuation works.
Quest-aware model picks. The same model-resolution code from EXP-05 picks whisper-tiny/encoder_model_quantized.onnx before whisper-small/... based on filename presence under StreamingAssets/. On a properly-built Quest APK, only whisper-tiny/ ships (the build pre-processor strips the unused variant).

The Kokoro integration shape:

voiceLoop.OnSpeechReady += async sentence =>
{
    float[] pcm = await _kokoro.SynthesizeAsync(sentence, voiceId);
    var clip = AudioClip.Create("vo", pcm.Length, 1, _kokoro.SampleRate, false);
    clip.SetData(pcm, 0);
    npcAudioSource.clip = clip;
    npcAudioSource.Play();
};

npcAudioSource is an Inspector-assigned reference to the AudioSource on the NPC GameObject. The default scaffold sets spatialBlend = 1.0, minDistance = 1.5, maxDistance = 12.

Manual scene creation¶

Follow experiments/06-vr-quest-npc/VrCompanionScene.unity.placeholder.md. The short version:

Build settings: File -> Build Settings -> Switch Platform -> Android.
XR config: Edit -> Project Settings -> XR Plugin Management -> Android tab -> check "OpenXR". Then OpenXR -> Android tab -> Interaction Profiles -> add "Oculus Touch Controller Profile".
Install XR Interaction Toolkit via Window -> Package Manager.
New empty scene; save as VrCompanionScene.unity under experiments/06-vr-quest-npc/.
Add the XR Origin (XR Rig) prefab from the XR Interaction Toolkit samples.
Add an empty GameObject called NPC. Give it any visible mesh (a stand-in capsule is fine). Position it 2 m in front of the XR Origin.
Attach an AudioSource to NPC. Set Spatial Blend = 1, Min Distance = 1.5, Max Distance = 12.
Empty GameObject named QuestVrCompanion. Attach QuestVrCompanion.cs. Drag the NPC's AudioSource into the npcAudioSource field.
First-time only: run Sauti -> Build Knowledge Base in the Editor (the runtime needs knowledge.db in StreamingAssets/).
Build & Run to a connected Quest 3 / Quest 2.
Press the right controller trigger to start listening. Release to stop and trigger the pipeline.

Expected: NPC speaks (3D-positioned) within 3–5 s of trigger release (Quest TTFA target).

Try this¶

Three modifications to try:

Move the NPC away from the player. Lift NPC.transform.position 8 m back. Notice the audio fades with distance — that's AudioSource.maxDistance doing its job. Lower it to 4 m and the NPC becomes effectively silent when the player walks past.
Swap voices per NPC personality. Use a bm_george (British male) voice for the Stormwall captain, bf_emma (British female) for an islander envoy. Voices set via the voiceId Inspector field. See Voice IDs.
Add a "thinking" placeholder. Kokoro inference on Quest CPU is 500 ms–1 s per sentence. Wire voiceLoop.OnTranscript to a particle effect / animation that signals "the NPC heard you and is thinking" — improves perceived latency. Use voiceLoop.OnSpeechReady (the first sentence callback) to switch the particle off.

Known limitations¶

XR controller binding is fenced as XR-API-001. The scaffold uses UnityEngine.XR.InputDevices.GetDeviceAtXRNode + the legacy primary-button check. Confirm against the modern XR Interaction Toolkit InputAction pattern when wiring your own.
No fallback to UI button on non-XR runtime. The script disables itself if no XR device is detected at startup.
Audio synthesis on Quest CPU is the long pole. Kokoro alone is ~500 ms-1 s per sentence on Quest 3 CPU. Consider showing a "thinking" placeholder while the first sentence is being generated.
Quest 3 RAM budget is tight when running Qwen3-1.7B (1.2 GB model + ~1.5 GB Unity baseline + Android OS = pushing the 6 GB headroom on 8 GB devices). Gemma3 fits more comfortably — strongly prefer Gemma once GEMMA-DL-001 is resolved post-v1.2.
XR Interaction Toolkit package is not yet in Packages/manifest.json. Manual install required (tracked as XR-PKG-001).

Cross-references¶

The orchestration shape: 05 — Full Voice Loop
Per-platform model selection: Architecture — per-platform
Quest-specific tips: Per-platform notes
Microphone permissions: Per-platform notes — Microphone permissions
Spec: voice_ai_architecture.md § 6, § 7, § 8, § 9
Previous experiment: 05 — Full Voice Loop

v1.2 ships Qwen3 on Quest because Gemma3 is deferred. See Per-platform notes — Quest 3 RAM tightness. ↩