kazeia/kazeia-android
Kazeia Team f548e02283 TTS: dynamic EOS-rank boost terminates generation cleanly across voices
Replaces the fixed maxGen + length-based boost with a fully dynamic
end-of-utterance detector that watches the model's own EOS logit rank.
End result on the Baer 3-segment monologue, validated by user as
"FORMIDABLE" / "impeccable" with both Damien and Zelda voices:

  - All 3 segments terminate via EOS (no maxGen cap hit)
  - No "page beg beg" filler tail
  - No abrupt cuts between segments
  - Audio durations 5-8 s per segment, matching Python within ~10 %

How it works (runHexGenWithPrefill, in tts/Qwen3TtsEngine.kt):

  1. At every decode step, compute the rank of CODEC_EOS in the
     repetition-penalised logits. Mid-utterance the rank sits at
     150-700 (model is committed to producing speech). Approaching
     the natural end, the rank dips toward top-50.

  2. Arm the boost only when EOS rank stays below eosRankTrigger=60
     for THREE consecutive steps. The 3-step requirement filters out
     transient single-step dips that occur during low-energy phonemes
     mid-sentence (without it, short sentences would terminate after
     ~3 s). Arming is also gated by eosBoostMinStep (50 % of expected
     speech length) so we never arm in the very first frames.

  3. Once armed, the boost increments monotonically: each subsequent
     step adds boostStepsActive * eosBoostScale to the EOS logit. The
     accumulated boost lifts EOS above top-1 within 1-3 steps, the
     argmax check fires, and the loop breaks. Scale=4 gives the model
     a small natural decay before termination; scale=5 was perfect-but-
     slightly-clipping, scale=3 wasn't strong enough to outpace the
     growing top-1 logit.

Other tweaks bundled in this commit because they all contribute to
the clean output:

  * Inter-segment gap 120 → 250 ms — gives the listener a perceived
    sentence boundary instead of a hard concatenation.

  * fadeOut(audio, 40) on every segment — cosine roll-off over the
    last 40 ms so the EOS-clipped tail decays naturally instead of
    sample-clipping.

  * top_k 50 → 200 in the fallback sample call — wider pool to keep
    EOS reachable when the boost just fails to hit argmax.

Voice swap is a 45 KB file push (damien_voice_prefix.bin and
damien_voice_suffix.bin). Successfully tested today with Elodie
(female, norm 10.12) and Zelda (norm 9.39) using Damien (norm 10.36)
as the baseline — same Kotlin code, no rebuild needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 14:13:04 +02:00
..
app TTS: dynamic EOS-rank boost terminates generation cleanly across voices 2026-04-13 14:13:04 +02:00
gradle/wrapper Initial commit: Kazeia TTS pipeline on NPU via ExecuTorch 2026-04-09 08:42:11 +02:00
COMPILE_WHISPER_NPU.md Initial commit: Kazeia TTS pipeline on NPU via ExecuTorch 2026-04-09 08:42:11 +02:00
RAPPORT_TTS_NPU.md Initial commit: Kazeia TTS pipeline on NPU via ExecuTorch 2026-04-09 08:42:11 +02:00
RAPPORT_TTS_QWEN3_TESTS.md Initial commit: Kazeia TTS pipeline on NPU via ExecuTorch 2026-04-09 08:42:11 +02:00
build.gradle.kts Initial commit: Kazeia TTS pipeline on NPU via ExecuTorch 2026-04-09 08:42:11 +02:00
gradle.properties Initial commit: Kazeia TTS pipeline on NPU via ExecuTorch 2026-04-09 08:42:11 +02:00
gradlew Initial commit: Kazeia TTS pipeline on NPU via ExecuTorch 2026-04-09 08:42:11 +02:00
gradlew.bat Initial commit: Kazeia TTS pipeline on NPU via ExecuTorch 2026-04-09 08:42:11 +02:00
settings.gradle.kts Initial commit: Kazeia TTS pipeline on NPU via ExecuTorch 2026-04-09 08:42:11 +02:00