Extensive investigation of the audible "tremor" in the generated voice-cloned
audio. Conclusion is architectural, not a bug:
* Hexagon HMX fp16 talker logits correlate with PyTorch fp32 at 0.999998
* ONNX Runtime CP V2 is bit-identical to PyTorch greedy CP (0.24% residual
divergence measured by injecting Python's captured cb0 at each step —
14/16 codebooks match 100%, cb14/cb15 miss 1 token out of 53)
* BigVGAN decoder is bit-identical to PyTorch (validated earlier)
* Therefore the tremor is caused entirely by the ~28% of cb0 argmax flips
where the tiny fp16 logits drift crosses the top-1/top-2 margin. This
cascades through the autoregressive chain into a trajectory the model
never saw at training time → incoherent artifacts.
Cross-architecture test (x86 AVX-512 / ARM64 NEON+HMX) cannot be zeroed by
any runtime swap — LibTorch Android would use NEON kernels with a different
reduction order than PyTorch x86, same class of error, smaller but non-zero
residual. Temperature tweaking (0.3 → 0.9) and greedy-vs-sample gave no
perceptual difference: the floor is numeric, not in the sampling layer.
Accepted for MVP. Documented in project_tts_cross_arch_limit.md — this is a
thesis-relevant finding about on-device TTS deployment limits.
Cleanup:
* All diagnostic flags (force_inject_pycb0, force_greedy_cb0, cb0_temp,
force_python_codes, force_cpu_talker, force_cpu_talker_gguf) now gated
behind BuildConfig.DEBUG via diagFlag()/diagFile() helpers. Release
builds JIT-eliminate the file checks; debug builds keep the whole
experimental toolchain for re-running the analysis for demos/thesis.
* force_hexagon + force_cp_v2 stay unconditional — production routing.
* Prefill cb0 now respects force_greedy_cb0 (was always sampleTopK 0.9).
* Native TTS pipeline (executorch-custom/jni_layer_tts.cpp,
app/src/main/jni/tts_pipeline.cpp): pad-zone sampling switched to
greedy argmax so EOS gets a fair chance (temp 0.9 top-k kept producing
audio past EOS where Python's seeded sampler terminated naturally).
* scripts/prepare_tts_voiceclone.py: new script that captures Python
greedy-CP reference (stochastic talker for EOS, deterministic CP) for
token-by-token comparison.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- prepare_tts_native.py: auto-splits long text at sentence/comma
boundaries, max 15 tokens per segment
- Multi-segment format: each segment gets fresh KV cache
- Formula: target_len = n_tokens × 3.2 + 5 per segment
- Tested on Edouard Baer monologue: 28 segments, 102s audio
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The complete solution for native TTS on NPU:
1. Python: tokenize + text_projection only (30ms, no model generation)
2. File: golden prefill[0:9] + text_proj + eos padding (ratio 3.5×)
3. C++ shared Module: codec_sum(our codes) + trailing text/eos/pad
4. RMS-based auto-trim of trailing noise after speech ends
Key insights:
- Shared Module C++ uses SAME QNN compiled graph as Java → self-consistent
- codec_sum from our NPU codes is coherent (same model instance)
- Text tokens consumed 1:1, then eos padding for remaining steps
- RMS trim detects 15% energy drop from peak → cuts garbage
Validated "impeccable" by user on "Bonjour, je m'appelle Kazeia..."
prepare_tts_native.py works for ANY text.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause: embeds must come from SAME NPU model instance.
Python fp32 embeds cause divergence on NPU fp16 after ~20 steps.
Solution: Java pipeline captures embeds on-device during generation.
Captured embeds work perfectly with C++ pipeline (validated "bon").
- Added capture mode: touch /data/local/tmp/kazeia/capture_mode
- Embeds saved to captured_embeds.bin (same format as pipeline input)
- KV_LEN restored to 100 (KV=64 lost role tokens → quality loss)
- C++ uses pre-computed embeds as-is (no double codec_sum)
Production path: Java pipeline RTF 1.8 for new texts (good quality)
Replay path: C++ pipeline RTF 1.26 with captured embeds
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- KV_LEN restored to 100 (KV=64 caused quality loss from evicted role tokens)
- C++ uses pre-computed embeds as-is (no double codec_sum)
- Multi-segment format support in Kotlin (detects n_segments header)
- prepare_tts_segments.py: splits text + generates per-segment embeds
- Quality issue: Python-captured embeds differ from original working file
(original was likely captured on-device, not from Python model.forward)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pre-computed embeds from Python already contain codec_sum+text.
Using them as-is works correctly. After exhausted, fallback to
our codec_sum + pad.
Long text: 191 tokens, 15.28s audio, RTF 1.27
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- BigVGAN: 8 threads (2757→1872ms), pre_conv/pre_transformer: 4 threads
- Restored pre-computed embeds format (codec_sum+text from Python)
- Text-only trailing embeds don't work: model needs codec_sum for EOS
For long phrases, pre-computed embeds must be generated from Python.
RTF 1.26 on short phrase.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cache input tensor pointers after first prepare_input_tensors call,
then memcpy directly into them for all subsequent steps.
Eliminates ~14000 mallocs per pipeline run (986 CP + 58 talker calls).
Generation: 4640ms → 4007ms (-633ms), total RTF: 1.6 → 1.51
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Key breakthrough: C++ pipeline loop using the SAME Method* instances
that Java loaded (via Module::method("forward")). This gives:
- Same QNN compiled graph → identical numerical results → no trembling
- C++ loop → no Java Tensor/EValue allocation overhead
- prepare_input_tensors + memcpy + Method::execute (like cp_et_runner)
Pipeline: talker ~20ms/step + CP ~44ms/step + decoder 2.8s = 7.3s for 4.64s
Added to executorch JNI:
- Module.nativeSetCpModule() — registers CP module for pipeline
- Module.nativeRunTtsPipeline(...) — runs full talker+CP loop in C++
- Updated executorch.jar with new native method declarations
From RTF 4.9 (start of session) to RTF 1.6 with impeccable audio quality.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause found: QNN HTP level=1 compilation is not bitwise
deterministic. Two loads of the same .pte produce slightly different
hidden states → audible trembling in decoded speech.
Java pipeline uses single QNN instance → no trembling, validated quality.
C++ pipeline code preserved for future use when QNN context caching
is fixed (would make both loads use same compiled graph).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reuse float arrays and Tensor/EValue objects across talker steps
instead of creating new ones each iteration. Eliminates ~7s of
GC overhead from thousands of JNI object allocations.
Same validated audio quality as before, no C++ pipeline needed.
Talker 35ms/step, CP 58ms/step, total 8.9s for 4.64s audio.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The native pipeline was adding zeros after trailing text tokens
instead of tts_eos_embed then tts_pad_embed. This caused the model
to mispronounce final words (e.g. "développement" → "devopment").
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Full talker+CP autoregressive loop in C++ via JNI.
Talker 20ms/step, CP 44ms/step, total 6.6s for 4.64s audio.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Pre-load all 15 CP heads at first CP call (eliminates lazy-load lag)
- Tested BigVGAN on GPU Adreno: no gain (+300ms vs CPU), kept on CPU
- Added headArgmaxOffset for future batch optimization
- Cancel previous pipeline on new run_pipeline intent
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Warmup forward() for talker+CP during init (avoids 7s DSP compilation
on first pipeline run)
- Cancel previous pipeline job before starting new one
- Use Dispatchers.IO for pipeline intent
First run after warmup: talker 19ms/step, CP 59ms/step → RTF ~1.9
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>