kazeia/kazeia-android/app
Kazeia Team 10fd10fd90 TTS: overlap CP↔BigVGAN — first audio 14.5s → 10.9s per segment
Streaming variant of the per-segment decode pipeline. As soon as SEQ_LEN
codes are accumulated from the talker/CP loop, BigVGAN is dispatched on
a background coroutine while the producer keeps generating the rest of
the segment. The BigVGAN consumer feeds a streaming crossfader that
emits stable audio as it arrives and holds back overlapSamples for the
next chunk's blend.

Mirrors decodeChunked's semantics exactly so final audio is bit-identical
modulo the fadeOut application location (now applied to the final
emission tail instead of the full buffer; the last 40ms still get faded).

Validated A/B on the same prompt 3 used in the recent benchmark:
  prompt: "Je me sens un peu triste aujourdhui…"
  seg 0 first audio:  14 485 ms → 10 936 ms (−3.5 s)
  end-to-end first audio (LLM trigger → audio): 16.2 s → 12.7 s
  Stream LLM total: 33 234 ms → 28 594 ms (−4.6 s)

Short segments (<SEQ_LEN codes) and the legacy non-streaming callers
(generateSegmentAudioVC, decodeChunked, multi-segment pipelines, etc.)
are untouched. The new path is gated behind USE_STREAMING_DECODE so it
can be reverted by flipping a single const if a regression is found.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 16:22:15 +02:00
..
src/main TTS: overlap CP↔BigVGAN — first audio 14.5s → 10.9s per segment 2026-04-14 16:22:15 +02:00
build.gradle.kts TTS tremor investigation: identify cross-arch numerical floor, gate diag flags 2026-04-13 00:15:14 +02:00
proguard-rules.pro Initial commit: Kazeia TTS pipeline on NPU via ExecuTorch 2026-04-09 08:42:11 +02:00