The complete solution for native TTS on NPU:
1. Python: tokenize + text_projection only (30ms, no model generation)
2. File: golden prefill[0:9] + text_proj + eos padding (ratio 3.5×)
3. C++ shared Module: codec_sum(our codes) + trailing text/eos/pad
4. RMS-based auto-trim of trailing noise after speech ends
Key insights:
- Shared Module C++ uses SAME QNN compiled graph as Java → self-consistent
- codec_sum from our NPU codes is coherent (same model instance)
- Text tokens consumed 1:1, then eos padding for remaining steps
- RMS trim detects 15% energy drop from peak → cuts garbage
Validated "impeccable" by user on "Bonjour, je m'appelle Kazeia..."
prepare_tts_native.py works for ANY text.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause: embeds must come from SAME NPU model instance.
Python fp32 embeds cause divergence on NPU fp16 after ~20 steps.
Solution: Java pipeline captures embeds on-device during generation.
Captured embeds work perfectly with C++ pipeline (validated "bon").
- Added capture mode: touch /data/local/tmp/kazeia/capture_mode
- Embeds saved to captured_embeds.bin (same format as pipeline input)
- KV_LEN restored to 100 (KV=64 lost role tokens → quality loss)
- C++ uses pre-computed embeds as-is (no double codec_sum)
Production path: Java pipeline RTF 1.8 for new texts (good quality)
Replay path: C++ pipeline RTF 1.26 with captured embeds
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- KV_LEN restored to 100 (KV=64 caused quality loss from evicted role tokens)
- C++ uses pre-computed embeds as-is (no double codec_sum)
- Multi-segment format support in Kotlin (detects n_segments header)
- prepare_tts_segments.py: splits text + generates per-segment embeds
- Quality issue: Python-captured embeds differ from original working file
(original was likely captured on-device, not from Python model.forward)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Key breakthrough: C++ pipeline loop using the SAME Method* instances
that Java loaded (via Module::method("forward")). This gives:
- Same QNN compiled graph → identical numerical results → no trembling
- C++ loop → no Java Tensor/EValue allocation overhead
- prepare_input_tensors + memcpy + Method::execute (like cp_et_runner)
Pipeline: talker ~20ms/step + CP ~44ms/step + decoder 2.8s = 7.3s for 4.64s
Added to executorch JNI:
- Module.nativeSetCpModule() — registers CP module for pipeline
- Module.nativeRunTtsPipeline(...) — runs full talker+CP loop in C++
- Updated executorch.jar with new native method declarations
From RTF 4.9 (start of session) to RTF 1.6 with impeccable audio quality.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause found: QNN HTP level=1 compilation is not bitwise
deterministic. Two loads of the same .pte produce slightly different
hidden states → audible trembling in decoded speech.
Java pipeline uses single QNN instance → no trembling, validated quality.
C++ pipeline code preserved for future use when QNN context caching
is fixed (would make both loads use same compiled graph).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reuse float arrays and Tensor/EValue objects across talker steps
instead of creating new ones each iteration. Eliminates ~7s of
GC overhead from thousands of JNI object allocations.
Same validated audio quality as before, no C++ pipeline needed.
Talker 35ms/step, CP 58ms/step, total 8.9s for 4.64s audio.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The native pipeline was adding zeros after trailing text tokens
instead of tts_eos_embed then tts_pad_embed. This caused the model
to mispronounce final words (e.g. "développement" → "devopment").
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Full talker+CP autoregressive loop in C++ via JNI.
Talker 20ms/step, CP 44ms/step, total 6.6s for 4.64s audio.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Pre-load all 15 CP heads at first CP call (eliminates lazy-load lag)
- Tested BigVGAN on GPU Adreno: no gain (+300ms vs CPU), kept on CPU
- Added headArgmaxOffset for future batch optimization
- Cancel previous pipeline on new run_pipeline intent
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Warmup forward() for talker+CP during init (avoids 7s DSP compilation
on first pipeline run)
- Cancel previous pipeline job before starting new one
- Use Dispatchers.IO for pipeline intent
First run after warmup: talker 19ms/step, CP 59ms/step → RTF ~1.9
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>