kazeia

History

Kazeia Team 2f07901ff3 TTS Stage 3: LLM stream → sentence split → TTS session → shared AudioTrack Closes the loop on on-device conversational TTS. The LLM's token stream is now consumed by a SentenceStreamer which fires a callback the moment a terminal-punctuation boundary appears; each sentence is enqueued to a persistent TTS streaming session that generates and plays audio through a single shared AudioTrack. Sentence N's audio plays while sentence N+1 is being generated on Hexagon+CP — no per-sentence AudioTrack init gap, and no "wait for full response before hearing anything". Mocked-LLM validation on the 3-sentence prompt: "Bonjour. Je suis là pour vous écouter. Comment allez-vous aujourd'hui." - First sentence detected: 1 ms - Seg 0 prefill (Hex): 567 ms - Seg 0 generated: 4 200 ms (18 tokens, 1.4 s audio) - Seg 1 generated: 9 100 ms (42 tokens) - Seg 2 generated: 11 000 ms (46 tokens) - Session closed: 33 500 ms (all audio drained) Changes: * tts/SentenceStreamer.kt — 50-line helper that buffers tokens and fires onSentence when a "." "!" "?" ";" or "\n" appears. minChars = 4 so "Oui." / "Bonjour." count as real sentences; higher thresholds swallowed conversational openers into the next segment and delayed first audio. flush() for the final partial sentence. * Qwen3TtsEngine.startStreamingSession / enqueueSentence / endStreamingSession triplet. startStreamingSession opens a 30-second MODE_STREAM AudioTrack plus a background worker coroutine that pulls sentences from an unlimited Channel. enqueueSentence is non-blocking; the worker serialises generation so audio order matches enqueue order. generateSegmentAudioVC is the per-sentence body (tokenize → prefill build → Hexagon gen → decode) without the WAV-save side effects that the /stream_text intent path does. * KazeiaService new intents: - stream_llm : real LLM path (needs LLM loaded; currently the debug build runs echo-mode so this path is shipped but requires production config to exercise). - stream_llm_mock : fakes the LLM stream by splitting the given text on spaces with 50 ms per "token" — matches the ~20 tok/s rate the on-device LLM produces and lets Stage 3 be validated without flipping the LLM on. Architectural notes: - AudioTrack buffer is 30 s so generation can run ahead of playback without blocking writes. RTF on Snapdragon 8 Elite is ~3 for short sentences, so for a 2-3 sentence response the buffer actually drains between segments and the user hears a short gap — expected, not a bug. Masking that gap requires RTF < 1 which is out of scope. - Hexagon KV is reset between sentences (hexReset) so the talker doesn't see stale context. Prefill observed cb0 = 1995 on every sentence that starts with a capital letter, matching the Python greedy reference — confirms prefill reconstruction is stable across segments within a session. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>		2026-04-13 10:52:46 +02:00
..
app	TTS Stage 3: LLM stream → sentence split → TTS session → shared AudioTrack	2026-04-13 10:52:46 +02:00
gradle/wrapper	Initial commit: Kazeia TTS pipeline on NPU via ExecuTorch	2026-04-09 08:42:11 +02:00
COMPILE_WHISPER_NPU.md	Initial commit: Kazeia TTS pipeline on NPU via ExecuTorch	2026-04-09 08:42:11 +02:00
RAPPORT_TTS_NPU.md	Initial commit: Kazeia TTS pipeline on NPU via ExecuTorch	2026-04-09 08:42:11 +02:00
RAPPORT_TTS_QWEN3_TESTS.md	Initial commit: Kazeia TTS pipeline on NPU via ExecuTorch	2026-04-09 08:42:11 +02:00
build.gradle.kts	Initial commit: Kazeia TTS pipeline on NPU via ExecuTorch	2026-04-09 08:42:11 +02:00
gradle.properties	Initial commit: Kazeia TTS pipeline on NPU via ExecuTorch	2026-04-09 08:42:11 +02:00
gradlew	Initial commit: Kazeia TTS pipeline on NPU via ExecuTorch	2026-04-09 08:42:11 +02:00
gradlew.bat	Initial commit: Kazeia TTS pipeline on NPU via ExecuTorch	2026-04-09 08:42:11 +02:00
settings.gradle.kts	Initial commit: Kazeia TTS pipeline on NPU via ExecuTorch	2026-04-09 08:42:11 +02:00