Adds a streaming multi-segment pipeline on top of the Hexagon talker + ONNX
CP backend. First audio arrives at ~20s (vs ~65s for the full phrase
non-streamed) on the Baer 16.56s reference (3-segment split). Voice cloning
is preserved per segment because each segment now ships its own full prefill.
Changes:
* Qwen3TtsEngine.generateFromEmbedsHexagonStreaming(path, onSegmentReady)
reads single- or multi-segment embeds, runs prefill + generation + VQ
decode + BigVGAN per segment, and fires the callback with each
segment's ShortArray the moment it's ready. Saves per-segment WAVs
(kazeia_stream_seg{N}.wav) plus the concatenated kazeia_stream_full.wav
for offline inspection. Extracted the common generation loop into
runHexSegmentFromEmbeds(prefill, trailing, idx) so single-segment and
streaming paths share exactly the same code (no quality drift between
modes). Added hexReset() between segments so segment 2's prefill logits
don't contain segment 1's KV state.
* vqDecode buffer overrun fix: when the talker samples CODEC_EOS as cb0
it stores a vocab id > CODEBOOK_SIZE, which vqDecode then used as a
codebook row index — reading past the 2048-row buffer. The short Baer
probe never hit this; longer phrases do. Clamp any out-of-vocab code
to 0 at allCodebooks build time.
* KazeiaService: new stream_pipeline intent extra wires the callback
to an AudioTrack MODE_STREAM instance, writing each segment's audio as
soon as it comes back. Logs time-to-first-audio.
* prepare_tts_segments.py: the previous version only captured 1-token
decode calls and substituted a generic 9-embed "prefill_base" pulled
from an unrelated single-segment file — dropping the per-segment
xvector conditioning AND the text-encoded embeddings, so Hexagon
produced garbled mixed speech for segments 2..N. Now captures the
multi-token prefill call too (like prepare_tts_voiceclone.py) so each
segment is self-contained.
Limitation (documented, not fixed in this commit): RTF ~4.4 > 1 on the
Snapdragon 8 Elite with current config means each segment takes longer to
generate than it takes to play, so audible gaps between segments remain.
Removing the gaps requires either (a) producer/consumer parallelism across
two coroutines (doesn't help if RTF stays > 1), or (b) faster CP (the
~180ms/step ONNX MLAS CP is the bottleneck; Hexagon HMX has a known NaN bug
and the .pte path contends with Hexagon talker on the DSP).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extensive investigation of the audible "tremor" in the generated voice-cloned
audio. Conclusion is architectural, not a bug:
* Hexagon HMX fp16 talker logits correlate with PyTorch fp32 at 0.999998
* ONNX Runtime CP V2 is bit-identical to PyTorch greedy CP (0.24% residual
divergence measured by injecting Python's captured cb0 at each step —
14/16 codebooks match 100%, cb14/cb15 miss 1 token out of 53)
* BigVGAN decoder is bit-identical to PyTorch (validated earlier)
* Therefore the tremor is caused entirely by the ~28% of cb0 argmax flips
where the tiny fp16 logits drift crosses the top-1/top-2 margin. This
cascades through the autoregressive chain into a trajectory the model
never saw at training time → incoherent artifacts.
Cross-architecture test (x86 AVX-512 / ARM64 NEON+HMX) cannot be zeroed by
any runtime swap — LibTorch Android would use NEON kernels with a different
reduction order than PyTorch x86, same class of error, smaller but non-zero
residual. Temperature tweaking (0.3 → 0.9) and greedy-vs-sample gave no
perceptual difference: the floor is numeric, not in the sampling layer.
Accepted for MVP. Documented in project_tts_cross_arch_limit.md — this is a
thesis-relevant finding about on-device TTS deployment limits.
Cleanup:
* All diagnostic flags (force_inject_pycb0, force_greedy_cb0, cb0_temp,
force_python_codes, force_cpu_talker, force_cpu_talker_gguf) now gated
behind BuildConfig.DEBUG via diagFlag()/diagFile() helpers. Release
builds JIT-eliminate the file checks; debug builds keep the whole
experimental toolchain for re-running the analysis for demos/thesis.
* force_hexagon + force_cp_v2 stay unconditional — production routing.
* Prefill cb0 now respects force_greedy_cb0 (was always sampleTopK 0.9).
* Native TTS pipeline (executorch-custom/jni_layer_tts.cpp,
app/src/main/jni/tts_pipeline.cpp): pad-zone sampling switched to
greedy argmax so EOS gets a fair chance (temp 0.9 top-k kept producing
audio past EOS where Python's seeded sampler terminated naturally).
* scripts/prepare_tts_voiceclone.py: new script that captures Python
greedy-CP reference (stochastic talker for EOS, deterministic CP) for
token-by-token comparison.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- prepare_tts_native.py: auto-splits long text at sentence/comma
boundaries, max 15 tokens per segment
- Multi-segment format: each segment gets fresh KV cache
- Formula: target_len = n_tokens × 3.2 + 5 per segment
- Tested on Edouard Baer monologue: 28 segments, 102s audio
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The complete solution for native TTS on NPU:
1. Python: tokenize + text_projection only (30ms, no model generation)
2. File: golden prefill[0:9] + text_proj + eos padding (ratio 3.5×)
3. C++ shared Module: codec_sum(our codes) + trailing text/eos/pad
4. RMS-based auto-trim of trailing noise after speech ends
Key insights:
- Shared Module C++ uses SAME QNN compiled graph as Java → self-consistent
- codec_sum from our NPU codes is coherent (same model instance)
- Text tokens consumed 1:1, then eos padding for remaining steps
- RMS trim detects 15% energy drop from peak → cuts garbage
Validated "impeccable" by user on "Bonjour, je m'appelle Kazeia..."
prepare_tts_native.py works for ANY text.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- KV_LEN restored to 100 (KV=64 caused quality loss from evicted role tokens)
- C++ uses pre-computed embeds as-is (no double codec_sum)
- Multi-segment format support in Kotlin (detects n_segments header)
- prepare_tts_segments.py: splits text + generates per-segment embeds
- Quality issue: Python-captured embeds differ from original working file
(original was likely captured on-device, not from Python model.forward)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>