kazeia

History

Kazeia Team f548e02283 TTS: dynamic EOS-rank boost terminates generation cleanly across voices Replaces the fixed maxGen + length-based boost with a fully dynamic end-of-utterance detector that watches the model's own EOS logit rank. End result on the Baer 3-segment monologue, validated by user as "FORMIDABLE" / "impeccable" with both Damien and Zelda voices: - All 3 segments terminate via EOS (no maxGen cap hit) - No "page beg beg" filler tail - No abrupt cuts between segments - Audio durations 5-8 s per segment, matching Python within ~10 % How it works (runHexGenWithPrefill, in tts/Qwen3TtsEngine.kt): 1. At every decode step, compute the rank of CODEC_EOS in the repetition-penalised logits. Mid-utterance the rank sits at 150-700 (model is committed to producing speech). Approaching the natural end, the rank dips toward top-50. 2. Arm the boost only when EOS rank stays below eosRankTrigger=60 for THREE consecutive steps. The 3-step requirement filters out transient single-step dips that occur during low-energy phonemes mid-sentence (without it, short sentences would terminate after ~3 s). Arming is also gated by eosBoostMinStep (50 % of expected speech length) so we never arm in the very first frames. 3. Once armed, the boost increments monotonically: each subsequent step adds boostStepsActive * eosBoostScale to the EOS logit. The accumulated boost lifts EOS above top-1 within 1-3 steps, the argmax check fires, and the loop breaks. Scale=4 gives the model a small natural decay before termination; scale=5 was perfect-but- slightly-clipping, scale=3 wasn't strong enough to outpace the growing top-1 logit. Other tweaks bundled in this commit because they all contribute to the clean output: * Inter-segment gap 120 → 250 ms — gives the listener a perceived sentence boundary instead of a hard concatenation. * fadeOut(audio, 40) on every segment — cosine roll-off over the last 40 ms so the EOS-clipped tail decays naturally instead of sample-clipping. * top_k 50 → 200 in the fallback sample call — wider pool to keep EOS reachable when the boost just fails to hit argmax. Voice swap is a 45 KB file push (damien_voice_prefix.bin and damien_voice_suffix.bin). Successfully tested today with Elodie (female, norm 10.12) and Zelda (norm 9.39) using Damien (norm 10.36) as the baseline — same Kotlin code, no rebuild needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>		2026-04-13 14:13:04 +02:00
..
app	TTS: dynamic EOS-rank boost terminates generation cleanly across voices	2026-04-13 14:13:04 +02:00
gradle/wrapper	Initial commit: Kazeia TTS pipeline on NPU via ExecuTorch	2026-04-09 08:42:11 +02:00
COMPILE_WHISPER_NPU.md	Initial commit: Kazeia TTS pipeline on NPU via ExecuTorch	2026-04-09 08:42:11 +02:00
RAPPORT_TTS_NPU.md	Initial commit: Kazeia TTS pipeline on NPU via ExecuTorch	2026-04-09 08:42:11 +02:00
RAPPORT_TTS_QWEN3_TESTS.md	Initial commit: Kazeia TTS pipeline on NPU via ExecuTorch	2026-04-09 08:42:11 +02:00
build.gradle.kts	Initial commit: Kazeia TTS pipeline on NPU via ExecuTorch	2026-04-09 08:42:11 +02:00
gradle.properties	Initial commit: Kazeia TTS pipeline on NPU via ExecuTorch	2026-04-09 08:42:11 +02:00
gradlew	Initial commit: Kazeia TTS pipeline on NPU via ExecuTorch	2026-04-09 08:42:11 +02:00
gradlew.bat	Initial commit: Kazeia TTS pipeline on NPU via ExecuTorch	2026-04-09 08:42:11 +02:00
settings.gradle.kts	Initial commit: Kazeia TTS pipeline on NPU via ExecuTorch	2026-04-09 08:42:11 +02:00