Pre-computed embeds from Python already contain codec_sum+text.
Using them as-is works correctly. After exhausted, fallback to
our codec_sum + pad.
Long text: 191 tokens, 15.28s audio, RTF 1.27
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- BigVGAN: 8 threads (2757→1872ms), pre_conv/pre_transformer: 4 threads
- Restored pre-computed embeds format (codec_sum+text from Python)
- Text-only trailing embeds don't work: model needs codec_sum for EOS
For long phrases, pre-computed embeds must be generated from Python.
RTF 1.26 on short phrase.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cache input tensor pointers after first prepare_input_tensors call,
then memcpy directly into them for all subsequent steps.
Eliminates ~14000 mallocs per pipeline run (986 CP + 58 talker calls).
Generation: 4640ms → 4007ms (-633ms), total RTF: 1.6 → 1.51
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Key breakthrough: C++ pipeline loop using the SAME Method* instances
that Java loaded (via Module::method("forward")). This gives:
- Same QNN compiled graph → identical numerical results → no trembling
- C++ loop → no Java Tensor/EValue allocation overhead
- prepare_input_tensors + memcpy + Method::execute (like cp_et_runner)
Pipeline: talker ~20ms/step + CP ~44ms/step + decoder 2.8s = 7.3s for 4.64s
Added to executorch JNI:
- Module.nativeSetCpModule() — registers CP module for pipeline
- Module.nativeRunTtsPipeline(...) — runs full talker+CP loop in C++
- Updated executorch.jar with new native method declarations
From RTF 4.9 (start of session) to RTF 1.6 with impeccable audio quality.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause found: QNN HTP level=1 compilation is not bitwise
deterministic. Two loads of the same .pte produce slightly different
hidden states → audible trembling in decoded speech.
Java pipeline uses single QNN instance → no trembling, validated quality.
C++ pipeline code preserved for future use when QNN context caching
is fixed (would make both loads use same compiled graph).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reuse float arrays and Tensor/EValue objects across talker steps
instead of creating new ones each iteration. Eliminates ~7s of
GC overhead from thousands of JNI object allocations.
Same validated audio quality as before, no C++ pipeline needed.
Talker 35ms/step, CP 58ms/step, total 8.9s for 4.64s audio.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The native pipeline was adding zeros after trailing text tokens
instead of tts_eos_embed then tts_pad_embed. This caused the model
to mispronounce final words (e.g. "développement" → "devopment").
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Full talker+CP autoregressive loop in C++ via JNI.
Talker 20ms/step, CP 44ms/step, total 6.6s for 4.64s audio.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Pre-load all 15 CP heads at first CP call (eliminates lazy-load lag)
- Tested BigVGAN on GPU Adreno: no gain (+300ms vs CPU), kept on CPU
- Added headArgmaxOffset for future batch optimization
- Cancel previous pipeline on new run_pipeline intent
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Warmup forward() for talker+CP during init (avoids 7s DSP compilation
on first pipeline run)
- Cancel previous pipeline job before starting new one
- Use Dispatchers.IO for pipeline intent
First run after warmup: talker 19ms/step, CP 59ms/step → RTF ~1.9
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>