Experiment 02 — STT Loopback¶
Mic -> Whisper ONNX -> on-screen text. The smallest end-to-end STT slice. Validates the speech-to-text pipeline before adding memory, RAG, or LLM.
The scaffold lives at experiments/02-stt-loopback/. The full README is at experiments/02-stt-loopback/README.md.
What this experiment proves¶
- Unity's
MicrophoneAPI captures audio on the host platform without device-permission friction. Macoron/whisper.unity(which wraps Whisper ONNX overasus4/onnxruntime-unity) initialises against eitherwhisper-small/...(flagship) orwhisper-tiny/...(Quest / low-end).- Audio chunks are transcribed to English text and surfaced to an on-screen label.
- The platform-aware model selection convention from Architecture § Per-platform model selection works at runtime — Small preferred, Tiny fallback.
Code walkthrough¶
Source: experiments/02-stt-loopback/WhisperLoopback.cs.
The MonoBehaviour:
- On
Awake, picks the first Whisper model directory it finds underAssets/StreamingAssets/VoiceAI/stt/(order:whisper-small/, thenwhisper-tiny/). The anchor file isencoder_model_quantized.onnx— the rest of the Whisper bundle (decoder + tokenizer + configs) is loaded by the upstream package. - Attaches a
WhisperManagercomponent, setsModelPath,IsModelPathInStreamingAssets = false,language = "en", then callsawait manager.InitModel(). - On button press (
StartListening/StopListening), opens / closes a rolling mic buffer (UnityEngine.Microphone) and passes the captured clip intoawait manager.GetTextAsync(audioClip). - Surfaces the
WhisperResult.Resultstring to aTextMeshProUGUIlabel and optionally fires per-segment events via the upstreamOnNewSegmentcallback.
The transcription call shape:
manager.ModelPath = Path.Combine(sttDir, "encoder_model_quantized.onnx");
manager.IsModelPathInStreamingAssets = false;
manager.language = "en";
await manager.InitModel();
WhisperResult res = await manager.GetTextAsync(audioClip);
string transcript = res.Result;
For raw PCM (Sauti's likely use case once VAD is wired):
The full upstream API: see API reference — Upstream APIs and the verified surface notes at memory/api_surfaces.md.
Manual scene creation¶
Follow experiments/02-stt-loopback/LoopbackScene.unity.placeholder.md. The short version:
- New empty scene; save as
LoopbackScene.unityunderexperiments/02-stt-loopback/. - Empty
GameObjectnamedWhisperLoopback. AttachWhisperLoopback.cs. - Canvas with a
TextMeshProUGUIlabel + a UI Button. - Wire the button's
OnClicktoWhisperLoopback.StartListening(and a second button toStopListeningif you prefer two-button vs hold-to-talk). - Press Play. Click the button. Speak a short English phrase. Click again to stop.
Expected console output:
[Sauti][STT] init model=encoder_model_quantized.onnx (whisper-small) ok
[Sauti][STT] segment "what's the weather in Stormwall" TTFA=NNNms
Latency target: ≤ 300 ms desktop CPU TTFA per voice_ai_architecture.md § 8.
Try this¶
Three modifications to try as you read the code:
- Switch to Whisper Tiny. Delete
Assets/StreamingAssets/VoiceAI/stt/whisper-small/(keep a backup) and rerun. The scaffold should pickwhisper-tiny/automatically. Notice the smaller model is faster but loses accuracy on long words and accents. - Subscribe to streaming segments. The upstream package fires
OnNewSegment(WhisperSegment)as the model produces output. Wire it up: Useful for displaying partial transcripts before the full clip is processed. - Try a longer utterance. The current scaffold uses time-window chunking with no Voice Activity Detection. Speak for >5 seconds. Notice the transcript still arrives — Whisper handles longer audio gracefully — but latency scales with length. VAD-driven auto-stop is the natural next feature; not in scope for v1.2.
Known limitations¶
- The
.unityscene is not committed. Build manually per the placeholder. - No VAD. Push-to-talk relies on explicit user-driven start/stop. Production-grade end-of-utterance detection (originally planned via Silero VAD) is demoted to "legacy / opt-in" per
project_context.md § 4. - English only.
language = "en"is hard-coded pervoice_ai_architecture.md § 10. Whisper itself is multilingual but Sauti v1.x doesn't expose that switch.
Cross-references¶
- Upstream:
Macoron/whisper.unity - API surface notes:
memory/api_surfaces.md—Whisper.WhisperManagersection - AI models — STT
- Spec:
voice_ai_architecture.md § 2 + § 6 - Previous experiment: 01 — TTS Hello
- Next experiment: 03 — LLM Chat