Experiment 01 — TTS Hello¶
The smallest end-to-end TTS slice. Type a string in the Inspector -> Kokoro ONNX synthesises PCM -> Unity's
AudioSourceplays it.
The scaffold lives at experiments/01-tts-hello/. The full README is at experiments/01-tts-hello/README.md.
What this experiment proves¶
- The Unity project loads with the three required packages installed.
- The Kokoro model + voices + tokenizer are reachable from
Assets/StreamingAssets/VoiceAI/tts/. - ONNX Runtime initialises on the host platform and produces audio output.
- The Sauti
KokoroTtsRunnerworks end-to-end against a known string.
This is the first experiment to write because it's the smallest. If TTS doesn't work, nothing downstream can — the voice loop ends in silence.
Code walkthrough¶
Source: experiments/01-tts-hello/KokoroHello.cs.
The MonoBehaviour:
- Resolves the Kokoro paths under
StreamingAssets/VoiceAI/tts/. - Constructs a
KokoroTtsRunneronAwake(cheap — initialisation is lazy). - On
Startor button press, callsawait runner.SynthesizeAsync(textToSpeak, voiceId)and pipes the resultingfloat[]PCM into anAudioClipthat the attachedAudioSourceplays.
The synthesis call shape (see the KokoroTtsRunner API):
float[] pcm = await runner.SynthesizeAsync("Hello from Sauti.", "af_bella");
var clip = AudioClip.Create("Kokoro", pcm.Length, 1, runner.SampleRate, false);
clip.SetData(pcm, 0);
audioSource.clip = clip;
audioSource.Play();
Key details Sauti's runner handles for you:
- IPA phonemisation of the input string via
EnglishG2P. - Loading the chosen voice's
.binfile and slicing the correct style-vector row. - Building the three input tensors (
input_ids,style,speed) and discovering their names from the ONNX metadata. - Pulling the audio waveform out of the dynamically-named float output.
Manual scene creation¶
Follow experiments/01-tts-hello/HelloScene.unity.placeholder.md. The short version:
- Open the repo as a Unity project.
- Create a new empty scene; save it as
HelloScene.unityunderexperiments/01-tts-hello/. - Add an empty
GameObjectnamedKokoroHello. - Attach
KokoroHello.csto it. - Attach an
AudioSourceto the same GameObject. Disable "Play On Awake". - In the Inspector, set Text To Speak and Voice Id (any of the 11 from Voice IDs).
- Press Play.
Expected console output:
[Sauti][TTS] init model=model_quantized.onnx ok
[Sauti][TTS] speak "Hello from Sauti." voice=af_bella TTFA=NNNms
Latency target on desktop: ~200 ms first-audio (per voice_ai_architecture.md § 8).
Try this¶
Three modifications to try as you read the code:
- Vary the voice. Cycle through
af,bf_emma,bm_george,am_adam. Notice the accent / gender differences. See the Voice IDs table for the full set. - Adjust the speed. Pass a
speedargument toSynthesizeFromPhonemesAsync.0.8ffor slower,1.2ffor faster. Values too far from1.0produce artifacts. - Phonemise externally. The default
SynthesizeAsyncruns the input throughEnglishG2P(a best-effort pure-C# fallback). If you have a higher-quality phonemiser (misaki/espeak-ng), generate the IPA string out-of-process and callSynthesizeFromPhonemesAsyncdirectly. Quality on out-of-distribution words improves noticeably.
Known limitations¶
- The
.unityscene is not committed. You build it once per first-clone of the repo. - The phonemiser is best-effort. Out-of-distribution words sound robotic or wrong. See the
EnglishG2Pcaveat. SynthesizeAsyncis not concurrent-safe. The underlyingInferenceSessionis single-use. If you need parallel synthesis, queue requests externally.
Cross-references¶
KokoroTtsRunnerAPI- Voice IDs catalogue
- AI models — TTS
- Spec:
voice_ai_architecture.md § 8(streaming TTS pattern) - Next experiment: 02 — STT Loopback