kazeia

Commit Graph

Author	SHA1	Message	Date
Kazeia Team	5e416713ce	TTS Stage 1 streaming: play each segment the moment it's decoded Adds a streaming multi-segment pipeline on top of the Hexagon talker + ONNX CP backend. First audio arrives at ~20s (vs ~65s for the full phrase non-streamed) on the Baer 16.56s reference (3-segment split). Voice cloning is preserved per segment because each segment now ships its own full prefill. Changes: * Qwen3TtsEngine.generateFromEmbedsHexagonStreaming(path, onSegmentReady) reads single- or multi-segment embeds, runs prefill + generation + VQ decode + BigVGAN per segment, and fires the callback with each segment's ShortArray the moment it's ready. Saves per-segment WAVs (kazeia_stream_seg{N}.wav) plus the concatenated kazeia_stream_full.wav for offline inspection. Extracted the common generation loop into runHexSegmentFromEmbeds(prefill, trailing, idx) so single-segment and streaming paths share exactly the same code (no quality drift between modes). Added hexReset() between segments so segment 2's prefill logits don't contain segment 1's KV state. * vqDecode buffer overrun fix: when the talker samples CODEC_EOS as cb0 it stores a vocab id > CODEBOOK_SIZE, which vqDecode then used as a codebook row index — reading past the 2048-row buffer. The short Baer probe never hit this; longer phrases do. Clamp any out-of-vocab code to 0 at allCodebooks build time. * KazeiaService: new stream_pipeline intent extra wires the callback to an AudioTrack MODE_STREAM instance, writing each segment's audio as soon as it comes back. Logs time-to-first-audio. * prepare_tts_segments.py: the previous version only captured 1-token decode calls and substituted a generic 9-embed "prefill_base" pulled from an unrelated single-segment file — dropping the per-segment xvector conditioning AND the text-encoded embeddings, so Hexagon produced garbled mixed speech for segments 2..N. Now captures the multi-token prefill call too (like prepare_tts_voiceclone.py) so each segment is self-contained. Limitation (documented, not fixed in this commit): RTF ~4.4 > 1 on the Snapdragon 8 Elite with current config means each segment takes longer to generate than it takes to play, so audible gaps between segments remain. Removing the gaps requires either (a) producer/consumer parallelism across two coroutines (doesn't help if RTF stays > 1), or (b) faster CP (the ~180ms/step ONNX MLAS CP is the bottleneck; Hexagon HMX has a known NaN bug and the .pte path contends with Hexagon talker on the DSP). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 08:43:30 +02:00
Kazeia Team	3dcf73aa38	Restore KV=100 + fix as-is embeds + multi-segment support - KV_LEN restored to 100 (KV=64 caused quality loss from evicted role tokens) - C++ uses pre-computed embeds as-is (no double codec_sum) - Multi-segment format support in Kotlin (detects n_segments header) - prepare_tts_segments.py: splits text + generates per-segment embeds - Quality issue: Python-captured embeds differ from original working file (original was likely captured on-device, not from Python model.forward) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 22:26:20 +02:00
Kazeia Team	10a3904d7d	Multi-segment TTS for long text: split → generate → concatenate - prepare_tts_segments.py: splits text at sentence boundaries, generates Python pre-computed embeds per segment - Kotlin: detects multi-segment file format, processes each segment independently (fresh KV cache), concatenates audio - Long text tested: 3 segments, 335 tokens, 26.8s audio, RTF 1.67 File format: n_segments, then per segment: nPrefill, nTotal, embeds[] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 14:34:05 +02:00

Author

SHA1

Message

Date

Kazeia Team

5e416713ce

TTS Stage 1 streaming: play each segment the moment it's decoded

Adds a streaming multi-segment pipeline on top of the Hexagon talker + ONNX
CP backend. First audio arrives at ~20s (vs ~65s for the full phrase
non-streamed) on the Baer 16.56s reference (3-segment split). Voice cloning
is preserved per segment because each segment now ships its own full prefill.

Changes:

  * Qwen3TtsEngine.generateFromEmbedsHexagonStreaming(path, onSegmentReady)
    reads single- or multi-segment embeds, runs prefill + generation + VQ
    decode + BigVGAN per segment, and fires the callback with each
    segment's ShortArray the moment it's ready. Saves per-segment WAVs
    (kazeia_stream_seg{N}.wav) plus the concatenated kazeia_stream_full.wav
    for offline inspection. Extracted the common generation loop into
    runHexSegmentFromEmbeds(prefill, trailing, idx) so single-segment and
    streaming paths share exactly the same code (no quality drift between
    modes). Added hexReset() between segments so segment 2's prefill logits
    don't contain segment 1's KV state.

  * vqDecode buffer overrun fix: when the talker samples CODEC_EOS as cb0
    it stores a vocab id > CODEBOOK_SIZE, which vqDecode then used as a
    codebook row index — reading past the 2048-row buffer. The short Baer
    probe never hit this; longer phrases do. Clamp any out-of-vocab code
    to 0 at allCodebooks build time.

  * KazeiaService: new stream_pipeline intent extra wires the callback
    to an AudioTrack MODE_STREAM instance, writing each segment's audio as
    soon as it comes back. Logs time-to-first-audio.

  * prepare_tts_segments.py: the previous version only captured 1-token
    decode calls and substituted a generic 9-embed "prefill_base" pulled
    from an unrelated single-segment file — dropping the per-segment
    xvector conditioning AND the text-encoded embeddings, so Hexagon
    produced garbled mixed speech for segments 2..N. Now captures the
    multi-token prefill call too (like prepare_tts_voiceclone.py) so each
    segment is self-contained.

Limitation (documented, not fixed in this commit): RTF ~4.4 > 1 on the
Snapdragon 8 Elite with current config means each segment takes longer to
generate than it takes to play, so audible gaps between segments remain.
Removing the gaps requires either (a) producer/consumer parallelism across
two coroutines (doesn't help if RTF stays > 1), or (b) faster CP (the
~180ms/step ONNX MLAS CP is the bottleneck; Hexagon HMX has a known NaN bug
and the .pte path contends with Hexagon talker on the DSP).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-13 08:43:30 +02:00

Kazeia Team

3dcf73aa38

Restore KV=100 + fix as-is embeds + multi-segment support

- KV_LEN restored to 100 (KV=64 caused quality loss from evicted role tokens)
- C++ uses pre-computed embeds as-is (no double codec_sum)
- Multi-segment format support in Kotlin (detects n_segments header)
- prepare_tts_segments.py: splits text + generates per-segment embeds
- Quality issue: Python-captured embeds differ from original working file
  (original was likely captured on-device, not from Python model.forward)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-09 22:26:20 +02:00

Kazeia Team

10a3904d7d

Multi-segment TTS for long text: split → generate → concatenate

- prepare_tts_segments.py: splits text at sentence boundaries,
  generates Python pre-computed embeds per segment
- Kotlin: detects multi-segment file format, processes each segment
  independently (fresh KV cache), concatenates audio
- Long text tested: 3 segments, 335 tokens, 26.8s audio, RTF 1.67

File format: n_segments, then per segment: nPrefill, nTotal, embeds[]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-09 14:34:05 +02:00

3 Commits