The root cause was process-credential loss across fork+exec, not the QNN
SDK version mismatch I had hypothesized. Switching the LLM to in-process
ExecuTorch LlmModule (Zygote-forked context, accepted by adsprpcd's
FastRPC credential check) eliminated the su requirement.
The original investigation sections are kept verbatim for reference; the
new section 10 documents the actual fix, the patches applied to ExecuTorch,
the metrics validated end-to-end, and pointers to the project memory entry.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Even with /no_think in the system prompt Qwen3 still emits an empty
<think>…</think> wrapper before the real answer. Without filtering, the
SentenceStreamer treats '<think>' as a sentence boundary and feeds three
tokens of XML into the TTS, producing audible parasites at the start of
each reply.
The new in-callback filter buffers a small lookahead (just enough to span
"</think>"), suppresses everything between the open and close tags, and
flushes the surrounding prose to onToken in order. With the lookahead, tags
that arrive split across decoded pieces ("<thi"+"nk>") still match.
Validated end-to-end: prompt 'Bonjour, comment vas-tu ?' now streams
sentence-by-sentence to the TTS — first segment "Bonjour !" reaches the
talker at 4.6 s, no <think> sneak-through.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
End-to-end validation on OnePlus Pad 3 with stream_llm intent:
Prompt: 'Bonjour, comment vas-tu ?'
Response: 'Bonjour ! Je suis là pour t'écouter. Comment vas-tu aujourd'hui ?'
TTS: Talker(PTE) 37ms/step, CP(PTE) 73ms/step, audio synthesized.
No su, no Magisk prompts.
Two fixes since the previous commit:
1. ExecuTorchLlmEngine: pass echo=false to LlmModule.generate() — by default
the runner echoes the prompt tokens back via the callback, which fed the
ChatML wrap (<|im_start|>user …) into the SentenceStreamer and TTS.
2. jni_layer_llama.cpp: pick Runner<uint8_t> vs Runner<uint16_t> based on the
model's get_kv_io_bit_width metadata, mirroring qnn_llama_runner.cpp main().
The hard-coded uint16_t was wrong for our Qwen3-4B export (which uses 8-bit
KV I/O) and produced fluent-looking but completely random tokens
("blocked罩ug darkestSOLEQuotes作者本人 …") — same symptom whether greedy or
sampled, the smoking gun for a width-mismatched KV cache reinterpretation.
Other tweaks:
- temperature=0.0 in the QNN_LLAMA branch of jni_layer_llama.cpp (greedy,
matches the working qnn_llama_runner --temperature 0 invocation)
- shared_buffer=true (same as binary defaults)
- Kotlin chat template mirrors qnn_llama_runner.cpp's get_formatted_prompt for
Qwen3 (user-first, then optional system, then "<|im_start|>assistant" with
no trailing newline — that quirky ordering is what the .pte was trained on)
TFTT is ~4 s for a 77-token prompt on kv-only mode (sequential prefill, one
forward per token). To get a sub-second TTFT we'd need to re-export the model
in --model_mode hybrid which adds a parallel prefill_forward graph; not
required for the conversational use case.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The root cause of the previous su-c requirement was that Qualcomm's FastRPC
kernel driver rejects processes spawned via ProcessBuilder fork+exec because
they lose supplementary GIDs on exec. Zygote-forked app processes retain the
proper init-configured credentials and are accepted by the adsprpcd service,
which is why ORT-QNN (Whisper, in-process) worked while the subprocess
qnn_llama_runner did not. Running the LLM in-process via ExecuTorch's
LlmModule bypasses the fork+exec path entirely.
What this commit does:
- ExecuTorchLlmEngine now uses org.pytorch.executorch.extension.llm.LlmModule
with MODEL_TYPE_QNN_LLAMA=4 (routes to example::Runner in jni_layer_llama.cpp,
the same C++ runner that qnn_llama_runner embeds).
- All su, ProcessBuilder, file-based prompt/response plumbing, and run_llm.sh
gone. ChatML template is built in Kotlin; tokens stream in via LlmCallback.
Supporting changes under executorch-patches/llm_in_process_jni.patch:
1. backends/qualcomm/CMakeLists.txt — gate PyQnnManagerAdaptor on NOT ANDROID.
The original guard (CMAKE_SYSTEM_PROCESSOR MATCHES x86_64) misfires in a
nested scope during Android cross-compile and tried to build the host
Python bindings.
2. extension/android/jni/jni_layer_llama.cpp — hardcode decoder_model="qwen3"
(was "llama3") and pass eval_mode=0 (EvalMode::kKVCached) + shared_buffer=true
to match our hybrid_llama_qnn.pte which only contains kv_forward, not
prefill_forward.
Build: scripts/build_android_library.sh arm64-v8a with QNN_SDK_ROOT pointing
to /opt/Kazeia/qnn_sdk_242/qairt/2.42.0.251225 and EXECUTORCH_BUILD_QNN=ON.
Produces libexecutorch_jni.so (192 MB) with QNN v2.42 backend + the llama
runner code, plus libqnn_executorch_backend.so. Both staged in jniLibs.
Validated on OnePlus Pad 3: LlmModule.load() completes in 4.2 s, no su
prompts, Pipeline ready with STT(WhisperHybridEngine) → [VoiceCommands →
LLM] → TTS(Qwen3TtsEngine). TTS .pte still loads with the upgraded v2.42
runtime — no regression.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Commit de sauvegarde avant la tentative d'unification QNN SDK v2.37 et
suppression du su -c pour le LLM. État actuel fonctionnel :
- LLM Qwen3-4B via su -c qnn_llama_runner (v2.42 dans /data/local/tmp/kazeia-et/)
- TTS talker + CP via ExecuTorch .pte JNI (v2.31 dans jniLibs)
- STT Whisper via ORT-QNN 1.24.3
Le rapport kazeia-no-root-report.md documente en détail les tentatives de
no-root et leurs échecs.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- ExecuTorchLlmEngine: system prompt forces French, 1-2 short sentences,
/no_think so the full budget goes to the answer (Qwen3 was consuming
120+ tokens on <think>); eval_mode 0 matches our kv-mode export.
- Qwen3TtsEngine.generateSegmentAudioVC: when the Hexagon talker socket
isn't open, fall back to runInterleavedPteFromEmbeds so the Stage 3
streaming session still produces audio. Without this the session opened,
accepted sentences, and silently emitted empty PCM.
Documents the QNN SDK version-skew pitfall in ExecuTorchLlmEngine.kt
ahead of the upcoming migration to a unified v2.42 toolchain.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- ExecuTorchLlmEngine: eval_mode 0 (our .pte is kv-mode, not hybrid)
- KazeiaService: call llm.load() after TTS init; try/catch falls back
to echo mode if the runner or .pte are missing.
Pipeline on device: STT(WhisperHybridEngine) → [VoiceCommands → LLM] → TTS(Qwen3TtsEngine).
Validated on OnePlus Pad 3: LLM ready in ~8 s, gen 21.3 tok/s, RSS 1.76 GB in the
qnn_llama_runner subprocess (out-of-process from the Kazeia app).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds executorch-patches/ with the local modifications to /opt/Kazeia/executorch
(upstream pytorch/executorch v1.2.0) required to export Qwen3-4B to QNN for the
OnePlus Pad 3 Hexagon V79. Tablet runs 18.2 tok/s (gen), TTFT 0.9 s, RSS 1.76 GB.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces the fixed maxGen + length-based boost with a fully dynamic
end-of-utterance detector that watches the model's own EOS logit rank.
End result on the Baer 3-segment monologue, validated by user as
"FORMIDABLE" / "impeccable" with both Damien and Zelda voices:
- All 3 segments terminate via EOS (no maxGen cap hit)
- No "page beg beg" filler tail
- No abrupt cuts between segments
- Audio durations 5-8 s per segment, matching Python within ~10 %
How it works (runHexGenWithPrefill, in tts/Qwen3TtsEngine.kt):
1. At every decode step, compute the rank of CODEC_EOS in the
repetition-penalised logits. Mid-utterance the rank sits at
150-700 (model is committed to producing speech). Approaching
the natural end, the rank dips toward top-50.
2. Arm the boost only when EOS rank stays below eosRankTrigger=60
for THREE consecutive steps. The 3-step requirement filters out
transient single-step dips that occur during low-energy phonemes
mid-sentence (without it, short sentences would terminate after
~3 s). Arming is also gated by eosBoostMinStep (50 % of expected
speech length) so we never arm in the very first frames.
3. Once armed, the boost increments monotonically: each subsequent
step adds boostStepsActive * eosBoostScale to the EOS logit. The
accumulated boost lifts EOS above top-1 within 1-3 steps, the
argmax check fires, and the loop breaks. Scale=4 gives the model
a small natural decay before termination; scale=5 was perfect-but-
slightly-clipping, scale=3 wasn't strong enough to outpace the
growing top-1 logit.
Other tweaks bundled in this commit because they all contribute to
the clean output:
* Inter-segment gap 120 → 250 ms — gives the listener a perceived
sentence boundary instead of a hard concatenation.
* fadeOut(audio, 40) on every segment — cosine roll-off over the
last 40 ms so the EOS-clipped tail decays naturally instead of
sample-clipping.
* top_k 50 → 200 in the fallback sample call — wider pool to keep
EOS reachable when the boost just fails to hit argmax.
Voice swap is a 45 KB file push (damien_voice_prefix.bin and
damien_voice_suffix.bin). Successfully tested today with Elodie
(female, norm 10.12) and Zelda (norm 9.39) using Damien (norm 10.36)
as the baseline — same Kotlin code, no rebuild needed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two small changes:
* export_tts_text_embeddings.py now takes the voice wav as an optional
second CLI arg (defaults to damien_15s_24k.wav). Lets the same script
capture voice-prefix+suffix for any speaker wav without editing the
source — used today to test Elodie alongside Damien.
* synthesizeTextStreaming + generateSegmentAudioVC only run the
trimTailLowEnergy trim when n >= maxGen. The trim's 35%-of-peak
threshold is tuned to catch "page beg beg" filler after the talker
fails to emit EOS — but it was cutting valid speech when EOS fired
early (observed on Elodie seg 1: 10.08 s → 2.92 s, a 4-second over-
trim). With the guard it's a no-op on converging generations and
only fires on the ~15% of segments that hit maxGen.
Validation after the fix (Elodie, Baer monologue):
- seg 1: 126 tokens = maxGen → trimmed 10.08 s → 8.88 s (1.2 s cut,
the filler tail)
- seg 2: 105 tokens < 138 maxGen → no trim, 8.4 s kept as-is
- seg 3: 69 tokens < 96 maxGen → no trim, 5.6 s kept as-is
Voice prefix/suffix shape is speaker-invariant except position 7 (the
xvector). Confirmed by capturing both Damien and Elodie and diffing:
positions 0-6 and 8 identical within 1e-8, suffix identical within
1e-8, only pos 7 has a different xvector embedding (norm 10.36 vs 10.12).
That means swapping speakers on-device is a 45 KB file push — no app
rebuild, no re-export of the 297 MB vocabulary table.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces the four per-sentence TTS entry points (pipeline.speak, REPEAT
voice command, echo-mode TTS, LLM-response TTS) with a single shared
pipeline.speakText() that:
* opens a Qwen3TtsEngine streaming session when the TTS backend is
Qwen3 (voice-cloning path);
* feeds the whole response through a SentenceStreamer so the first
sentence starts playing as soon as it's decoded;
* falls back to the old one-shot synthesizeAndPlay for non-Qwen3 TTS
engines (AndroidTts, Chatterbox) that don't expose a session API.
KazeiaPipeline.speakText is now public so KazeiaService can use the
same dispatch — previously each call site re-implemented the
"streaming-or-fallback" logic or just called synthesizeAndPlay and
waited for the full synthesis.
Enabling the real on-device LLM is a separate task (task #48): the
existing llama-cli binary has ggml-hexagon linked in and fails to
init the DSP (0x80000406) when the TTS Hexagon runners hold the
session. Needs either a CPU-only llama-cli build or the restored
ExecuTorch qnn_llama_runner setup.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Closes the loop on on-device conversational TTS. The LLM's token stream is
now consumed by a SentenceStreamer which fires a callback the moment a
terminal-punctuation boundary appears; each sentence is enqueued to a
persistent TTS streaming session that generates and plays audio through a
single shared AudioTrack. Sentence N's audio plays while sentence N+1 is
being generated on Hexagon+CP — no per-sentence AudioTrack init gap, and
no "wait for full response before hearing anything".
Mocked-LLM validation on the 3-sentence prompt:
"Bonjour. Je suis là pour vous écouter. Comment allez-vous aujourd'hui."
- First sentence detected: 1 ms
- Seg 0 prefill (Hex): 567 ms
- Seg 0 generated: 4 200 ms (18 tokens, 1.4 s audio)
- Seg 1 generated: 9 100 ms (42 tokens)
- Seg 2 generated: 11 000 ms (46 tokens)
- Session closed: 33 500 ms (all audio drained)
Changes:
* tts/SentenceStreamer.kt — 50-line helper that buffers tokens and
fires onSentence when a "." "!" "?" ";" or "\n" appears. minChars = 4
so "Oui." / "Bonjour." count as real sentences; higher thresholds
swallowed conversational openers into the next segment and delayed
first audio. flush() for the final partial sentence.
* Qwen3TtsEngine.startStreamingSession / enqueueSentence / endStreamingSession
triplet. startStreamingSession opens a 30-second MODE_STREAM
AudioTrack plus a background worker coroutine that pulls sentences
from an unlimited Channel. enqueueSentence is non-blocking; the worker
serialises generation so audio order matches enqueue order.
generateSegmentAudioVC is the per-sentence body (tokenize → prefill
build → Hexagon gen → decode) without the WAV-save side effects that
the /stream_text intent path does.
* KazeiaService new intents:
- stream_llm : real LLM path (needs LLM loaded; currently the
debug build runs echo-mode so this path is
shipped but requires production config to
exercise).
- stream_llm_mock : fakes the LLM stream by splitting the given
text on spaces with 50 ms per "token" —
matches the ~20 tok/s rate the on-device LLM
produces and lets Stage 3 be validated without
flipping the LLM on.
Architectural notes:
- AudioTrack buffer is 30 s so generation can run ahead of playback
without blocking writes. RTF on Snapdragon 8 Elite is ~3 for short
sentences, so for a 2-3 sentence response the buffer actually drains
between segments and the user hears a short gap — expected, not a
bug. Masking that gap requires RTF < 1 which is out of scope.
- Hexagon KV is reset between sentences (hexReset) so the talker
doesn't see stale context. Prefill observed cb0 = 1995 on every
sentence that starts with a capital letter, matching the Python
greedy reference — confirms prefill reconstruction is stable across
segments within a session.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Removes the PC-side prepare_tts_segments.py dependency for day-to-day
generation. The tablet now tokenizes, embeds, and voice-clones any
French (or Qwen3-supported) text with no network, no ADB push per
phrase, and quality that matches Python's reference on "Bonjour, je
suis Kazeia, je suis là pour vous écouter." — user validation:
"impeccable".
Three pieces that compose the path:
1. Qwen3BpeTokenizer.kt — byte-level BPE matching Qwen2/Qwen3's
Python implementation bit-for-bit. UTF-8 + GPT-2 byte encoder,
Qwen regex with \p{IsAlphabetic}/\p{IsDigit} (Android's regex
lacks UNICODE_CHARACTER_CLASS — caught in testing). Produces
identical token IDs to HF's Qwen2TokenizerFast on the test phrase:
[81581, 11, 4759, 35631, 730, 9832, 685, 11, 4759, 35631, 37915,
4914, 9012, 90229, 2676, 13].
2. export_tts_text_embeddings.py — one-time PC export of:
* Full projected text embeddings for the entire 151936-token vocab
as fp16 (297 MB). Sanity check: live vs stored max abs diff
1.15e-4 on token 1043. Mmap'd on-device so it stays off the
Java heap and leaves room for the 125 MB cp_embeddings alloc.
* Damien voice PREFIX (9 × 1024 fp32) — positions 0..8 of a
Python voice-clone capture, text-invariant across segments.
* Damien voice SUFFIX (2 × 1024 fp32) — positions nP-2..nP-1
of the same capture. Also text-invariant (diff = 0.0 across
3 different-text segments). Without it the talker never sees
"text ended" and decode falls into page/beg repetition.
* Qwen3 tokenizer vocab.json + merges.txt.
3. Qwen3TtsEngine.kt:
* mmap loader for the embeddings table + buffered fp16→fp32
lookup (halfToFloat covers subnormals/inf/NaN so pathological
tokens don't become 0).
* Stage 2 assets detected at init; missing file transparently
falls back to legacy 1050-token reduced-vocab path.
* synthesizeTextStreaming(text, onSegmentReady) — new public API:
sentence-split → BPE → build prefill as
[voice prefix] + [text_proj(id) + codec_pad] × N + [voice suffix]
(exact structure Python emits; verified bit-for-bit by matching
captured Baer prefill positions against text_projection(tok)+
codec_embedding(CODEC_PAD)) → runHexGenWithPrefill → decode
each segment through the existing BigVGAN pipeline → callback.
* runHexGenWithPrefill — Hexagon prefill + interleaved CP decode
loop. Feeds tts_eos once, tts_pad thereafter (same schedule as
Python's voice_clone). Degeneracy guard stops when 9 identical
cb0 in a row appear — catches the rare "page beg beg beg" tail
when EOS never fires. maxGen = ids.size*4 + 10 matches the
typical 3.3 codec-frames-per-text-token that Python produces.
* Prefill build uses the speaker's captured prefix/suffix rather
than the legacy in-code buildPrefillEmbeddings that puts only
one text token in prefill — the structure mismatch produced
garbled audio in the first attempt of this commit.
4. KazeiaService.kt: new stream_text intent extra wires text input
to synthesizeTextStreaming with an AudioTrack MODE_STREAM consumer.
First-audio latency on the "Bonjour..." test: ~23 s on Snapdragon
8 Elite (prefill + 74-token decode), vs a 3-phrase sentence batch
that was 65 s pre-streaming — streaming + on-device text together
unblock the MVP chat loop.
Known caveats:
* 297 MB on-device footprint for the embedding table. Acceptable on
OnePlus Pad 3; can be quantized further (int8 per-row) if storage
becomes tight.
* First init adds ~3 s for BPE vocab + merges load (151k × 2 hash-
maps). Happens once per process.
* maxGen cap means extremely long sentences may truncate. The
sentence splitter already keeps segments ≤120 chars so this
hasn't been observed in practice.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds a streaming multi-segment pipeline on top of the Hexagon talker + ONNX
CP backend. First audio arrives at ~20s (vs ~65s for the full phrase
non-streamed) on the Baer 16.56s reference (3-segment split). Voice cloning
is preserved per segment because each segment now ships its own full prefill.
Changes:
* Qwen3TtsEngine.generateFromEmbedsHexagonStreaming(path, onSegmentReady)
reads single- or multi-segment embeds, runs prefill + generation + VQ
decode + BigVGAN per segment, and fires the callback with each
segment's ShortArray the moment it's ready. Saves per-segment WAVs
(kazeia_stream_seg{N}.wav) plus the concatenated kazeia_stream_full.wav
for offline inspection. Extracted the common generation loop into
runHexSegmentFromEmbeds(prefill, trailing, idx) so single-segment and
streaming paths share exactly the same code (no quality drift between
modes). Added hexReset() between segments so segment 2's prefill logits
don't contain segment 1's KV state.
* vqDecode buffer overrun fix: when the talker samples CODEC_EOS as cb0
it stores a vocab id > CODEBOOK_SIZE, which vqDecode then used as a
codebook row index — reading past the 2048-row buffer. The short Baer
probe never hit this; longer phrases do. Clamp any out-of-vocab code
to 0 at allCodebooks build time.
* KazeiaService: new stream_pipeline intent extra wires the callback
to an AudioTrack MODE_STREAM instance, writing each segment's audio as
soon as it comes back. Logs time-to-first-audio.
* prepare_tts_segments.py: the previous version only captured 1-token
decode calls and substituted a generic 9-embed "prefill_base" pulled
from an unrelated single-segment file — dropping the per-segment
xvector conditioning AND the text-encoded embeddings, so Hexagon
produced garbled mixed speech for segments 2..N. Now captures the
multi-token prefill call too (like prepare_tts_voiceclone.py) so each
segment is self-contained.
Limitation (documented, not fixed in this commit): RTF ~4.4 > 1 on the
Snapdragon 8 Elite with current config means each segment takes longer to
generate than it takes to play, so audible gaps between segments remain.
Removing the gaps requires either (a) producer/consumer parallelism across
two coroutines (doesn't help if RTF stays > 1), or (b) faster CP (the
~180ms/step ONNX MLAS CP is the bottleneck; Hexagon HMX has a known NaN bug
and the .pte path contends with Hexagon talker on the DSP).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extensive investigation of the audible "tremor" in the generated voice-cloned
audio. Conclusion is architectural, not a bug:
* Hexagon HMX fp16 talker logits correlate with PyTorch fp32 at 0.999998
* ONNX Runtime CP V2 is bit-identical to PyTorch greedy CP (0.24% residual
divergence measured by injecting Python's captured cb0 at each step —
14/16 codebooks match 100%, cb14/cb15 miss 1 token out of 53)
* BigVGAN decoder is bit-identical to PyTorch (validated earlier)
* Therefore the tremor is caused entirely by the ~28% of cb0 argmax flips
where the tiny fp16 logits drift crosses the top-1/top-2 margin. This
cascades through the autoregressive chain into a trajectory the model
never saw at training time → incoherent artifacts.
Cross-architecture test (x86 AVX-512 / ARM64 NEON+HMX) cannot be zeroed by
any runtime swap — LibTorch Android would use NEON kernels with a different
reduction order than PyTorch x86, same class of error, smaller but non-zero
residual. Temperature tweaking (0.3 → 0.9) and greedy-vs-sample gave no
perceptual difference: the floor is numeric, not in the sampling layer.
Accepted for MVP. Documented in project_tts_cross_arch_limit.md — this is a
thesis-relevant finding about on-device TTS deployment limits.
Cleanup:
* All diagnostic flags (force_inject_pycb0, force_greedy_cb0, cb0_temp,
force_python_codes, force_cpu_talker, force_cpu_talker_gguf) now gated
behind BuildConfig.DEBUG via diagFlag()/diagFile() helpers. Release
builds JIT-eliminate the file checks; debug builds keep the whole
experimental toolchain for re-running the analysis for demos/thesis.
* force_hexagon + force_cp_v2 stay unconditional — production routing.
* Prefill cb0 now respects force_greedy_cb0 (was always sampleTopK 0.9).
* Native TTS pipeline (executorch-custom/jni_layer_tts.cpp,
app/src/main/jni/tts_pipeline.cpp): pad-zone sampling switched to
greedy argmax so EOS gets a fair chance (temp 0.9 top-k kept producing
audio past EOS where Python's seeded sampler terminated naturally).
* scripts/prepare_tts_voiceclone.py: new script that captures Python
greedy-CP reference (stochastic talker for EOS, deterministic CP) for
token-by-token comparison.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- prepare_tts_native.py: auto-splits long text at sentence/comma
boundaries, max 15 tokens per segment
- Multi-segment format: each segment gets fresh KV cache
- Formula: target_len = n_tokens × 3.2 + 5 per segment
- Tested on Edouard Baer monologue: 28 segments, 102s audio
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The complete solution for native TTS on NPU:
1. Python: tokenize + text_projection only (30ms, no model generation)
2. File: golden prefill[0:9] + text_proj + eos padding (ratio 3.5×)
3. C++ shared Module: codec_sum(our codes) + trailing text/eos/pad
4. RMS-based auto-trim of trailing noise after speech ends
Key insights:
- Shared Module C++ uses SAME QNN compiled graph as Java → self-consistent
- codec_sum from our NPU codes is coherent (same model instance)
- Text tokens consumed 1:1, then eos padding for remaining steps
- RMS trim detects 15% energy drop from peak → cuts garbage
Validated "impeccable" by user on "Bonjour, je m'appelle Kazeia..."
prepare_tts_native.py works for ANY text.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause: embeds must come from SAME NPU model instance.
Python fp32 embeds cause divergence on NPU fp16 after ~20 steps.
Solution: Java pipeline captures embeds on-device during generation.
Captured embeds work perfectly with C++ pipeline (validated "bon").
- Added capture mode: touch /data/local/tmp/kazeia/capture_mode
- Embeds saved to captured_embeds.bin (same format as pipeline input)
- KV_LEN restored to 100 (KV=64 lost role tokens → quality loss)
- C++ uses pre-computed embeds as-is (no double codec_sum)
Production path: Java pipeline RTF 1.8 for new texts (good quality)
Replay path: C++ pipeline RTF 1.26 with captured embeds
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- KV_LEN restored to 100 (KV=64 caused quality loss from evicted role tokens)
- C++ uses pre-computed embeds as-is (no double codec_sum)
- Multi-segment format support in Kotlin (detects n_segments header)
- prepare_tts_segments.py: splits text + generates per-segment embeds
- Quality issue: Python-captured embeds differ from original working file
(original was likely captured on-device, not from Python model.forward)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pre-computed embeds from Python already contain codec_sum+text.
Using them as-is works correctly. After exhausted, fallback to
our codec_sum + pad.
Long text: 191 tokens, 15.28s audio, RTF 1.27
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- BigVGAN: 8 threads (2757→1872ms), pre_conv/pre_transformer: 4 threads
- Restored pre-computed embeds format (codec_sum+text from Python)
- Text-only trailing embeds don't work: model needs codec_sum for EOS
For long phrases, pre-computed embeds must be generated from Python.
RTF 1.26 on short phrase.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cache input tensor pointers after first prepare_input_tensors call,
then memcpy directly into them for all subsequent steps.
Eliminates ~14000 mallocs per pipeline run (986 CP + 58 talker calls).
Generation: 4640ms → 4007ms (-633ms), total RTF: 1.6 → 1.51
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Key breakthrough: C++ pipeline loop using the SAME Method* instances
that Java loaded (via Module::method("forward")). This gives:
- Same QNN compiled graph → identical numerical results → no trembling
- C++ loop → no Java Tensor/EValue allocation overhead
- prepare_input_tensors + memcpy + Method::execute (like cp_et_runner)
Pipeline: talker ~20ms/step + CP ~44ms/step + decoder 2.8s = 7.3s for 4.64s
Added to executorch JNI:
- Module.nativeSetCpModule() — registers CP module for pipeline
- Module.nativeRunTtsPipeline(...) — runs full talker+CP loop in C++
- Updated executorch.jar with new native method declarations
From RTF 4.9 (start of session) to RTF 1.6 with impeccable audio quality.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause found: QNN HTP level=1 compilation is not bitwise
deterministic. Two loads of the same .pte produce slightly different
hidden states → audible trembling in decoded speech.
Java pipeline uses single QNN instance → no trembling, validated quality.
C++ pipeline code preserved for future use when QNN context caching
is fixed (would make both loads use same compiled graph).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reuse float arrays and Tensor/EValue objects across talker steps
instead of creating new ones each iteration. Eliminates ~7s of
GC overhead from thousands of JNI object allocations.
Same validated audio quality as before, no C++ pipeline needed.
Talker 35ms/step, CP 58ms/step, total 8.9s for 4.64s audio.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The native pipeline was adding zeros after trailing text tokens
instead of tts_eos_embed then tts_pad_embed. This caused the model
to mispronounce final words (e.g. "développement" → "devopment").
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Full talker+CP autoregressive loop in C++ via JNI.
Talker 20ms/step, CP 44ms/step, total 6.6s for 4.64s audio.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Pre-load all 15 CP heads at first CP call (eliminates lazy-load lag)
- Tested BigVGAN on GPU Adreno: no gain (+300ms vs CPU), kept on CPU
- Added headArgmaxOffset for future batch optimization
- Cancel previous pipeline on new run_pipeline intent
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Warmup forward() for talker+CP during init (avoids 7s DSP compilation
on first pipeline run)
- Cancel previous pipeline job before starting new one
- Use Dispatchers.IO for pipeline intent
First run after warmup: talker 19ms/step, CP 59ms/step → RTF ~1.9
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>