Closes the loop on on-device conversational TTS. The LLM's token stream is
now consumed by a SentenceStreamer which fires a callback the moment a
terminal-punctuation boundary appears; each sentence is enqueued to a
persistent TTS streaming session that generates and plays audio through a
single shared AudioTrack. Sentence N's audio plays while sentence N+1 is
being generated on Hexagon+CP — no per-sentence AudioTrack init gap, and
no "wait for full response before hearing anything".
Mocked-LLM validation on the 3-sentence prompt:
"Bonjour. Je suis là pour vous écouter. Comment allez-vous aujourd'hui."
- First sentence detected: 1 ms
- Seg 0 prefill (Hex): 567 ms
- Seg 0 generated: 4 200 ms (18 tokens, 1.4 s audio)
- Seg 1 generated: 9 100 ms (42 tokens)
- Seg 2 generated: 11 000 ms (46 tokens)
- Session closed: 33 500 ms (all audio drained)
Changes:
* tts/SentenceStreamer.kt — 50-line helper that buffers tokens and
fires onSentence when a "." "!" "?" ";" or "\n" appears. minChars = 4
so "Oui." / "Bonjour." count as real sentences; higher thresholds
swallowed conversational openers into the next segment and delayed
first audio. flush() for the final partial sentence.
* Qwen3TtsEngine.startStreamingSession / enqueueSentence / endStreamingSession
triplet. startStreamingSession opens a 30-second MODE_STREAM
AudioTrack plus a background worker coroutine that pulls sentences
from an unlimited Channel. enqueueSentence is non-blocking; the worker
serialises generation so audio order matches enqueue order.
generateSegmentAudioVC is the per-sentence body (tokenize → prefill
build → Hexagon gen → decode) without the WAV-save side effects that
the /stream_text intent path does.
* KazeiaService new intents:
- stream_llm : real LLM path (needs LLM loaded; currently the
debug build runs echo-mode so this path is
shipped but requires production config to
exercise).
- stream_llm_mock : fakes the LLM stream by splitting the given
text on spaces with 50 ms per "token" —
matches the ~20 tok/s rate the on-device LLM
produces and lets Stage 3 be validated without
flipping the LLM on.
Architectural notes:
- AudioTrack buffer is 30 s so generation can run ahead of playback
without blocking writes. RTF on Snapdragon 8 Elite is ~3 for short
sentences, so for a 2-3 sentence response the buffer actually drains
between segments and the user hears a short gap — expected, not a
bug. Masking that gap requires RTF < 1 which is out of scope.
- Hexagon KV is reset between sentences (hexReset) so the talker
doesn't see stale context. Prefill observed cb0 = 1995 on every
sentence that starts with a capital letter, matching the Python
greedy reference — confirms prefill reconstruction is stable across
segments within a session.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>