kazeia

Commit Graph

Author	SHA1	Message	Date
Kazeia Team	7dc6704e95	docs: add before/after performance comparison to no-root report Concrete measurements taken 2026-04-14 on the same Qwen3-4B .pte and the same C++ runner — only the invocation path differs (subprocess su -c vs in-process LlmModule JNI). Confirms no LLM regression and a measurable speedup on the TTS path thanks to the shared QNN context (Talker 37 ms/step vs 45-65 ms/step before). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 11:37:15 +02:00
Kazeia Team	6c7746c5d0	docs: add post-mortem to no-root report — issue resolved The root cause was process-credential loss across fork+exec, not the QNN SDK version mismatch I had hypothesized. Switching the LLM to in-process ExecuTorch LlmModule (Zygote-forked context, accepted by adsprpcd's FastRPC credential check) eliminated the su requirement. The original investigation sections are kept verbatim for reference; the new section 10 documents the actual fix, the patches applied to ExecuTorch, the metrics validated end-to-end, and pointers to the project memory entry. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 11:19:27 +02:00
Kazeia Team	b57719fa5e	LLM: filter <think> tokens out of the streaming TTS path Even with /no_think in the system prompt Qwen3 still emits an empty <think>…</think> wrapper before the real answer. Without filtering, the SentenceStreamer treats '<think>' as a sentence boundary and feeds three tokens of XML into the TTS, producing audible parasites at the start of each reply. The new in-callback filter buffers a small lookahead (just enough to span "</think>"), suppresses everything between the open and close tags, and flushes the surrounding prose to onToken in order. With the lookahead, tags that arrive split across decoded pieces ("<thi"+"nk>") still match. Validated end-to-end: prompt 'Bonjour, comment vas-tu ?' now streams sentence-by-sentence to the TTS — first segment "Bonjour !" reaches the talker at 4.6 s, no <think> sneak-through. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 11:16:08 +02:00
Kazeia Team	f32b5ddfdd	LLM no-root: validate end-to-end pipeline, fix kv_io_bit_width detection End-to-end validation on OnePlus Pad 3 with stream_llm intent: Prompt: 'Bonjour, comment vas-tu ?' Response: 'Bonjour ! Je suis là pour t'écouter. Comment vas-tu aujourd'hui ?' TTS: Talker(PTE) 37ms/step, CP(PTE) 73ms/step, audio synthesized. No su, no Magisk prompts. Two fixes since the previous commit: 1. ExecuTorchLlmEngine: pass echo=false to LlmModule.generate() — by default the runner echoes the prompt tokens back via the callback, which fed the ChatML wrap (<\|im_start\|>user …) into the SentenceStreamer and TTS. 2. jni_layer_llama.cpp: pick Runner<uint8_t> vs Runner<uint16_t> based on the model's get_kv_io_bit_width metadata, mirroring qnn_llama_runner.cpp main(). The hard-coded uint16_t was wrong for our Qwen3-4B export (which uses 8-bit KV I/O) and produced fluent-looking but completely random tokens ("blocked罩ug darkestSOLEQuotes作者本人 …") — same symptom whether greedy or sampled, the smoking gun for a width-mismatched KV cache reinterpretation. Other tweaks: - temperature=0.0 in the QNN_LLAMA branch of jni_layer_llama.cpp (greedy, matches the working qnn_llama_runner --temperature 0 invocation) - shared_buffer=true (same as binary defaults) - Kotlin chat template mirrors qnn_llama_runner.cpp's get_formatted_prompt for Qwen3 (user-first, then optional system, then "<\|im_start\|>assistant" with no trailing newline — that quirky ordering is what the .pte was trained on) TFTT is ~4 s for a 77-token prompt on kv-only mode (sequential prefill, one forward per token). To get a sub-second TTFT we'd need to re-export the model in --model_mode hybrid which adds a parallel prefill_forward graph; not required for the conversational use case. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 11:11:23 +02:00
Kazeia Team	809a6d4fed	LLM no-root: migrate to in-process LlmModule (JNI) — zero su calls The root cause of the previous su-c requirement was that Qualcomm's FastRPC kernel driver rejects processes spawned via ProcessBuilder fork+exec because they lose supplementary GIDs on exec. Zygote-forked app processes retain the proper init-configured credentials and are accepted by the adsprpcd service, which is why ORT-QNN (Whisper, in-process) worked while the subprocess qnn_llama_runner did not. Running the LLM in-process via ExecuTorch's LlmModule bypasses the fork+exec path entirely. What this commit does: - ExecuTorchLlmEngine now uses org.pytorch.executorch.extension.llm.LlmModule with MODEL_TYPE_QNN_LLAMA=4 (routes to example::Runner in jni_layer_llama.cpp, the same C++ runner that qnn_llama_runner embeds). - All su, ProcessBuilder, file-based prompt/response plumbing, and run_llm.sh gone. ChatML template is built in Kotlin; tokens stream in via LlmCallback. Supporting changes under executorch-patches/llm_in_process_jni.patch: 1. backends/qualcomm/CMakeLists.txt — gate PyQnnManagerAdaptor on NOT ANDROID. The original guard (CMAKE_SYSTEM_PROCESSOR MATCHES x86_64) misfires in a nested scope during Android cross-compile and tried to build the host Python bindings. 2. extension/android/jni/jni_layer_llama.cpp — hardcode decoder_model="qwen3" (was "llama3") and pass eval_mode=0 (EvalMode::kKVCached) + shared_buffer=true to match our hybrid_llama_qnn.pte which only contains kv_forward, not prefill_forward. Build: scripts/build_android_library.sh arm64-v8a with QNN_SDK_ROOT pointing to /opt/Kazeia/qnn_sdk_242/qairt/2.42.0.251225 and EXECUTORCH_BUILD_QNN=ON. Produces libexecutorch_jni.so (192 MB) with QNN v2.42 backend + the llama runner code, plus libqnn_executorch_backend.so. Both staged in jniLibs. Validated on OnePlus Pad 3: LlmModule.load() completes in 4.2 s, no su prompts, Pipeline ready with STT(WhisperHybridEngine) → [VoiceCommands → LLM] → TTS(Qwen3TtsEngine). TTS .pte still loads with the upgraded v2.42 runtime — no regression. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 10:39:50 +02:00
Kazeia Team	6e6a2d9f82	Baseline before no-root migration: working state with root LLM Commit de sauvegarde avant la tentative d'unification QNN SDK v2.37 et suppression du su -c pour le LLM. État actuel fonctionnel : - LLM Qwen3-4B via su -c qnn_llama_runner (v2.42 dans /data/local/tmp/kazeia-et/) - TTS talker + CP via ExecuTorch .pte JNI (v2.31 dans jniLibs) - STT Whisper via ORT-QNN 1.24.3 Le rapport kazeia-no-root-report.md documente en détail les tentatives de no-root et leurs échecs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 08:19:36 +02:00
Kazeia Team	364016b7b8	LLM+TTS: short-response system prompt, PTE streaming fallback - ExecuTorchLlmEngine: system prompt forces French, 1-2 short sentences, /no_think so the full budget goes to the answer (Qwen3 was consuming 120+ tokens on <think>); eval_mode 0 matches our kv-mode export. - Qwen3TtsEngine.generateSegmentAudioVC: when the Hexagon talker socket isn't open, fall back to runInterleavedPteFromEmbeds so the Stage 3 streaming session still produces audio. Without this the session opened, accepted sentences, and silently emitted empty PCM. Documents the QNN SDK version-skew pitfall in ExecuTorchLlmEngine.kt ahead of the upcoming migration to a unified v2.42 toolchain. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 00:17:08 +02:00
Kazeia Team	9930bfa392	LLM: enable Qwen3-4B NPU (21 tok/s) in service pipeline - ExecuTorchLlmEngine: eval_mode 0 (our .pte is kv-mode, not hybrid) - KazeiaService: call llm.load() after TTS init; try/catch falls back to echo mode if the runner or .pte are missing. Pipeline on device: STT(WhisperHybridEngine) → [VoiceCommands → LLM] → TTS(Qwen3TtsEngine). Validated on OnePlus Pad 3: LLM ready in ~8 s, gen 21.3 tok/s, RSS 1.76 GB in the qnn_llama_runner subprocess (out-of-process from the Kazeia app). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 23:00:25 +02:00
Kazeia Team	19f934af25	LLM NPU: Qwen3-4B QNN export patches + deployment notes Adds executorch-patches/ with the local modifications to /opt/Kazeia/executorch (upstream pytorch/executorch v1.2.0) required to export Qwen3-4B to QNN for the OnePlus Pad 3 Hexagon V79. Tablet runs 18.2 tok/s (gen), TTFT 0.9 s, RSS 1.76 GB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 22:56:42 +02:00
Kazeia Team	f548e02283	TTS: dynamic EOS-rank boost terminates generation cleanly across voices Replaces the fixed maxGen + length-based boost with a fully dynamic end-of-utterance detector that watches the model's own EOS logit rank. End result on the Baer 3-segment monologue, validated by user as "FORMIDABLE" / "impeccable" with both Damien and Zelda voices: - All 3 segments terminate via EOS (no maxGen cap hit) - No "page beg beg" filler tail - No abrupt cuts between segments - Audio durations 5-8 s per segment, matching Python within ~10 % How it works (runHexGenWithPrefill, in tts/Qwen3TtsEngine.kt): 1. At every decode step, compute the rank of CODEC_EOS in the repetition-penalised logits. Mid-utterance the rank sits at 150-700 (model is committed to producing speech). Approaching the natural end, the rank dips toward top-50. 2. Arm the boost only when EOS rank stays below eosRankTrigger=60 for THREE consecutive steps. The 3-step requirement filters out transient single-step dips that occur during low-energy phonemes mid-sentence (without it, short sentences would terminate after ~3 s). Arming is also gated by eosBoostMinStep (50 % of expected speech length) so we never arm in the very first frames. 3. Once armed, the boost increments monotonically: each subsequent step adds boostStepsActive * eosBoostScale to the EOS logit. The accumulated boost lifts EOS above top-1 within 1-3 steps, the argmax check fires, and the loop breaks. Scale=4 gives the model a small natural decay before termination; scale=5 was perfect-but- slightly-clipping, scale=3 wasn't strong enough to outpace the growing top-1 logit. Other tweaks bundled in this commit because they all contribute to the clean output: * Inter-segment gap 120 → 250 ms — gives the listener a perceived sentence boundary instead of a hard concatenation. * fadeOut(audio, 40) on every segment — cosine roll-off over the last 40 ms so the EOS-clipped tail decays naturally instead of sample-clipping. * top_k 50 → 200 in the fallback sample call — wider pool to keep EOS reachable when the boost just fails to hit argmax. Voice swap is a 45 KB file push (damien_voice_prefix.bin and damien_voice_suffix.bin). Successfully tested today with Elodie (female, norm 10.12) and Zelda (norm 9.39) using Damien (norm 10.36) as the baseline — same Kotlin code, no rebuild needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 14:13:04 +02:00
Kazeia Team	c25040a780	TTS: conditional tail-trim + export script accepts voice path arg Two small changes: * export_tts_text_embeddings.py now takes the voice wav as an optional second CLI arg (defaults to damien_15s_24k.wav). Lets the same script capture voice-prefix+suffix for any speaker wav without editing the source — used today to test Elodie alongside Damien. * synthesizeTextStreaming + generateSegmentAudioVC only run the trimTailLowEnergy trim when n >= maxGen. The trim's 35%-of-peak threshold is tuned to catch "page beg beg" filler after the talker fails to emit EOS — but it was cutting valid speech when EOS fired early (observed on Elodie seg 1: 10.08 s → 2.92 s, a 4-second over- trim). With the guard it's a no-op on converging generations and only fires on the ~15% of segments that hit maxGen. Validation after the fix (Elodie, Baer monologue): - seg 1: 126 tokens = maxGen → trimmed 10.08 s → 8.88 s (1.2 s cut, the filler tail) - seg 2: 105 tokens < 138 maxGen → no trim, 8.4 s kept as-is - seg 3: 69 tokens < 96 maxGen → no trim, 5.6 s kept as-is Voice prefix/suffix shape is speaker-invariant except position 7 (the xvector). Confirmed by capturing both Damien and Elodie and diffing: positions 0-6 and 8 identical within 1e-8, suffix identical within 1e-8, only pos 7 has a different xvector embedding (norm 10.36 vs 10.12). That means swapping speakers on-device is a 45 KB file push — no app rebuild, no re-export of the 297 MB vocabulary table. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 11:32:33 +02:00
Kazeia Team	0833d1bd21	TTS: route all synthesizeAndPlay calls through Stage 3 streaming session Replaces the four per-sentence TTS entry points (pipeline.speak, REPEAT voice command, echo-mode TTS, LLM-response TTS) with a single shared pipeline.speakText() that: * opens a Qwen3TtsEngine streaming session when the TTS backend is Qwen3 (voice-cloning path); * feeds the whole response through a SentenceStreamer so the first sentence starts playing as soon as it's decoded; * falls back to the old one-shot synthesizeAndPlay for non-Qwen3 TTS engines (AndroidTts, Chatterbox) that don't expose a session API. KazeiaPipeline.speakText is now public so KazeiaService can use the same dispatch — previously each call site re-implemented the "streaming-or-fallback" logic or just called synthesizeAndPlay and waited for the full synthesis. Enabling the real on-device LLM is a separate task (task #48): the existing llama-cli binary has ggml-hexagon linked in and fails to init the DSP (0x80000406) when the TTS Hexagon runners hold the session. Needs either a CPU-only llama-cli build or the restored ExecuTorch qnn_llama_runner setup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 11:12:14 +02:00
Kazeia Team	2f07901ff3	TTS Stage 3: LLM stream → sentence split → TTS session → shared AudioTrack Closes the loop on on-device conversational TTS. The LLM's token stream is now consumed by a SentenceStreamer which fires a callback the moment a terminal-punctuation boundary appears; each sentence is enqueued to a persistent TTS streaming session that generates and plays audio through a single shared AudioTrack. Sentence N's audio plays while sentence N+1 is being generated on Hexagon+CP — no per-sentence AudioTrack init gap, and no "wait for full response before hearing anything". Mocked-LLM validation on the 3-sentence prompt: "Bonjour. Je suis là pour vous écouter. Comment allez-vous aujourd'hui." - First sentence detected: 1 ms - Seg 0 prefill (Hex): 567 ms - Seg 0 generated: 4 200 ms (18 tokens, 1.4 s audio) - Seg 1 generated: 9 100 ms (42 tokens) - Seg 2 generated: 11 000 ms (46 tokens) - Session closed: 33 500 ms (all audio drained) Changes: * tts/SentenceStreamer.kt — 50-line helper that buffers tokens and fires onSentence when a "." "!" "?" ";" or "\n" appears. minChars = 4 so "Oui." / "Bonjour." count as real sentences; higher thresholds swallowed conversational openers into the next segment and delayed first audio. flush() for the final partial sentence. * Qwen3TtsEngine.startStreamingSession / enqueueSentence / endStreamingSession triplet. startStreamingSession opens a 30-second MODE_STREAM AudioTrack plus a background worker coroutine that pulls sentences from an unlimited Channel. enqueueSentence is non-blocking; the worker serialises generation so audio order matches enqueue order. generateSegmentAudioVC is the per-sentence body (tokenize → prefill build → Hexagon gen → decode) without the WAV-save side effects that the /stream_text intent path does. * KazeiaService new intents: - stream_llm : real LLM path (needs LLM loaded; currently the debug build runs echo-mode so this path is shipped but requires production config to exercise). - stream_llm_mock : fakes the LLM stream by splitting the given text on spaces with 50 ms per "token" — matches the ~20 tok/s rate the on-device LLM produces and lets Stage 3 be validated without flipping the LLM on. Architectural notes: - AudioTrack buffer is 30 s so generation can run ahead of playback without blocking writes. RTF on Snapdragon 8 Elite is ~3 for short sentences, so for a 2-3 sentence response the buffer actually drains between segments and the user hears a short gap — expected, not a bug. Masking that gap requires RTF < 1 which is out of scope. - Hexagon KV is reset between sentences (hexReset) so the talker doesn't see stale context. Prefill observed cb0 = 1995 on every sentence that starts with a capital letter, matching the Python greedy reference — confirms prefill reconstruction is stable across segments within a session. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:52:46 +02:00
Kazeia Team	7f1a44c23d	TTS Stage 2: on-device voice-cloning TTS for arbitrary text Removes the PC-side prepare_tts_segments.py dependency for day-to-day generation. The tablet now tokenizes, embeds, and voice-clones any French (or Qwen3-supported) text with no network, no ADB push per phrase, and quality that matches Python's reference on "Bonjour, je suis Kazeia, je suis là pour vous écouter." — user validation: "impeccable". Three pieces that compose the path: 1. Qwen3BpeTokenizer.kt — byte-level BPE matching Qwen2/Qwen3's Python implementation bit-for-bit. UTF-8 + GPT-2 byte encoder, Qwen regex with \p{IsAlphabetic}/\p{IsDigit} (Android's regex lacks UNICODE_CHARACTER_CLASS — caught in testing). Produces identical token IDs to HF's Qwen2TokenizerFast on the test phrase: [81581, 11, 4759, 35631, 730, 9832, 685, 11, 4759, 35631, 37915, 4914, 9012, 90229, 2676, 13]. 2. export_tts_text_embeddings.py — one-time PC export of: * Full projected text embeddings for the entire 151936-token vocab as fp16 (297 MB). Sanity check: live vs stored max abs diff 1.15e-4 on token 1043. Mmap'd on-device so it stays off the Java heap and leaves room for the 125 MB cp_embeddings alloc. * Damien voice PREFIX (9 × 1024 fp32) — positions 0..8 of a Python voice-clone capture, text-invariant across segments. * Damien voice SUFFIX (2 × 1024 fp32) — positions nP-2..nP-1 of the same capture. Also text-invariant (diff = 0.0 across 3 different-text segments). Without it the talker never sees "text ended" and decode falls into page/beg repetition. * Qwen3 tokenizer vocab.json + merges.txt. 3. Qwen3TtsEngine.kt: * mmap loader for the embeddings table + buffered fp16→fp32 lookup (halfToFloat covers subnormals/inf/NaN so pathological tokens don't become 0). * Stage 2 assets detected at init; missing file transparently falls back to legacy 1050-token reduced-vocab path. * synthesizeTextStreaming(text, onSegmentReady) — new public API: sentence-split → BPE → build prefill as [voice prefix] + [text_proj(id) + codec_pad] × N + [voice suffix] (exact structure Python emits; verified bit-for-bit by matching captured Baer prefill positions against text_projection(tok)+ codec_embedding(CODEC_PAD)) → runHexGenWithPrefill → decode each segment through the existing BigVGAN pipeline → callback. * runHexGenWithPrefill — Hexagon prefill + interleaved CP decode loop. Feeds tts_eos once, tts_pad thereafter (same schedule as Python's voice_clone). Degeneracy guard stops when 9 identical cb0 in a row appear — catches the rare "page beg beg beg" tail when EOS never fires. maxGen = ids.size4 + 10 matches the typical 3.3 codec-frames-per-text-token that Python produces. Prefill build uses the speaker's captured prefix/suffix rather than the legacy in-code buildPrefillEmbeddings that puts only one text token in prefill — the structure mismatch produced garbled audio in the first attempt of this commit. 4. KazeiaService.kt: new stream_text intent extra wires text input to synthesizeTextStreaming with an AudioTrack MODE_STREAM consumer. First-audio latency on the "Bonjour..." test: ~23 s on Snapdragon 8 Elite (prefill + 74-token decode), vs a 3-phrase sentence batch that was 65 s pre-streaming — streaming + on-device text together unblock the MVP chat loop. Known caveats: * 297 MB on-device footprint for the embedding table. Acceptable on OnePlus Pad 3; can be quantized further (int8 per-row) if storage becomes tight. * First init adds ~3 s for BPE vocab + merges load (151k × 2 hash- maps). Happens once per process. * maxGen cap means extremely long sentences may truncate. The sentence splitter already keeps segments ≤120 chars so this hasn't been observed in practice. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:12:09 +02:00
Kazeia Team	5e416713ce	TTS Stage 1 streaming: play each segment the moment it's decoded Adds a streaming multi-segment pipeline on top of the Hexagon talker + ONNX CP backend. First audio arrives at ~20s (vs ~65s for the full phrase non-streamed) on the Baer 16.56s reference (3-segment split). Voice cloning is preserved per segment because each segment now ships its own full prefill. Changes: * Qwen3TtsEngine.generateFromEmbedsHexagonStreaming(path, onSegmentReady) reads single- or multi-segment embeds, runs prefill + generation + VQ decode + BigVGAN per segment, and fires the callback with each segment's ShortArray the moment it's ready. Saves per-segment WAVs (kazeia_stream_seg{N}.wav) plus the concatenated kazeia_stream_full.wav for offline inspection. Extracted the common generation loop into runHexSegmentFromEmbeds(prefill, trailing, idx) so single-segment and streaming paths share exactly the same code (no quality drift between modes). Added hexReset() between segments so segment 2's prefill logits don't contain segment 1's KV state. * vqDecode buffer overrun fix: when the talker samples CODEC_EOS as cb0 it stores a vocab id > CODEBOOK_SIZE, which vqDecode then used as a codebook row index — reading past the 2048-row buffer. The short Baer probe never hit this; longer phrases do. Clamp any out-of-vocab code to 0 at allCodebooks build time. * KazeiaService: new stream_pipeline intent extra wires the callback to an AudioTrack MODE_STREAM instance, writing each segment's audio as soon as it comes back. Logs time-to-first-audio. * prepare_tts_segments.py: the previous version only captured 1-token decode calls and substituted a generic 9-embed "prefill_base" pulled from an unrelated single-segment file — dropping the per-segment xvector conditioning AND the text-encoded embeddings, so Hexagon produced garbled mixed speech for segments 2..N. Now captures the multi-token prefill call too (like prepare_tts_voiceclone.py) so each segment is self-contained. Limitation (documented, not fixed in this commit): RTF ~4.4 > 1 on the Snapdragon 8 Elite with current config means each segment takes longer to generate than it takes to play, so audible gaps between segments remain. Removing the gaps requires either (a) producer/consumer parallelism across two coroutines (doesn't help if RTF stays > 1), or (b) faster CP (the ~180ms/step ONNX MLAS CP is the bottleneck; Hexagon HMX has a known NaN bug and the .pte path contends with Hexagon talker on the DSP). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 08:43:30 +02:00
Kazeia Team	de878ddf5c	TTS tremor investigation: identify cross-arch numerical floor, gate diag flags Extensive investigation of the audible "tremor" in the generated voice-cloned audio. Conclusion is architectural, not a bug: * Hexagon HMX fp16 talker logits correlate with PyTorch fp32 at 0.999998 * ONNX Runtime CP V2 is bit-identical to PyTorch greedy CP (0.24% residual divergence measured by injecting Python's captured cb0 at each step — 14/16 codebooks match 100%, cb14/cb15 miss 1 token out of 53) * BigVGAN decoder is bit-identical to PyTorch (validated earlier) * Therefore the tremor is caused entirely by the ~28% of cb0 argmax flips where the tiny fp16 logits drift crosses the top-1/top-2 margin. This cascades through the autoregressive chain into a trajectory the model never saw at training time → incoherent artifacts. Cross-architecture test (x86 AVX-512 / ARM64 NEON+HMX) cannot be zeroed by any runtime swap — LibTorch Android would use NEON kernels with a different reduction order than PyTorch x86, same class of error, smaller but non-zero residual. Temperature tweaking (0.3 → 0.9) and greedy-vs-sample gave no perceptual difference: the floor is numeric, not in the sampling layer. Accepted for MVP. Documented in project_tts_cross_arch_limit.md — this is a thesis-relevant finding about on-device TTS deployment limits. Cleanup: * All diagnostic flags (force_inject_pycb0, force_greedy_cb0, cb0_temp, force_python_codes, force_cpu_talker, force_cpu_talker_gguf) now gated behind BuildConfig.DEBUG via diagFlag()/diagFile() helpers. Release builds JIT-eliminate the file checks; debug builds keep the whole experimental toolchain for re-running the analysis for demos/thesis. * force_hexagon + force_cp_v2 stay unconditional — production routing. * Prefill cb0 now respects force_greedy_cb0 (was always sampleTopK 0.9). * Native TTS pipeline (executorch-custom/jni_layer_tts.cpp, app/src/main/jni/tts_pipeline.cpp): pad-zone sampling switched to greedy argmax so EOS gets a fair chance (temp 0.9 top-k kept producing audio past EOS where Python's seeded sampler terminated naturally). * scripts/prepare_tts_voiceclone.py: new script that captures Python greedy-CP reference (stochastic talker for EOS, deterministic CP) for token-by-token comparison. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 00:15:14 +02:00
Kazeia Team	ee186e9049	Auto-segmentation for long texts + dynamic pipeline - prepare_tts_native.py: auto-splits long text at sentence/comma boundaries, max 15 tokens per segment - Multi-segment format: each segment gets fresh KV cache - Formula: target_len = n_tokens × 3.2 + 5 per segment - Tested on Edouard Baer monologue: 28 segments, 102s audio Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 00:08:59 +02:00
Kazeia Team	199bc4fbc9	Full native C++ TTS validated on short + long phrases Dynamic formula: target_len = n_tokens × 3.2 + 5 (calibrated) - Short "Bonjour..." (18 tokens → 62 trailing): OK - Long "Je suis Kazeia... difficiles" (30 tokens → 101 trailing): OK RMS trim disabled (garbage is loud, can't distinguish from speech). Length controlled purely by maxTokens = trailing count. Pipeline: prepare_tts_native.py "any text" → adb push → run → audio Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 23:51:05 +02:00
Kazeia Team	dafbe2a52b	FULL NATIVE C++ TTS pipeline — any text, perfect quality The complete solution for native TTS on NPU: 1. Python: tokenize + text_projection only (30ms, no model generation) 2. File: golden prefill[0:9] + text_proj + eos padding (ratio 3.5×) 3. C++ shared Module: codec_sum(our codes) + trailing text/eos/pad 4. RMS-based auto-trim of trailing noise after speech ends Key insights: - Shared Module C++ uses SAME QNN compiled graph as Java → self-consistent - codec_sum from our NPU codes is coherent (same model instance) - Text tokens consumed 1:1, then eos padding for remaining steps - RMS trim detects 15% energy drop from peak → cuts garbage Validated "impeccable" by user on "Bonjour, je m'appelle Kazeia..." prepare_tts_native.py works for ANY text. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 23:39:06 +02:00
Kazeia Team	09d36f2025	Root cause found + on-device embed capture + KV=100 restored Root cause: embeds must come from SAME NPU model instance. Python fp32 embeds cause divergence on NPU fp16 after ~20 steps. Solution: Java pipeline captures embeds on-device during generation. Captured embeds work perfectly with C++ pipeline (validated "bon"). - Added capture mode: touch /data/local/tmp/kazeia/capture_mode - Embeds saved to captured_embeds.bin (same format as pipeline input) - KV_LEN restored to 100 (KV=64 lost role tokens → quality loss) - C++ uses pre-computed embeds as-is (no double codec_sum) Production path: Java pipeline RTF 1.8 for new texts (good quality) Replay path: C++ pipeline RTF 1.26 with captured embeds Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 23:00:37 +02:00
Kazeia Team	3dcf73aa38	Restore KV=100 + fix as-is embeds + multi-segment support - KV_LEN restored to 100 (KV=64 caused quality loss from evicted role tokens) - C++ uses pre-computed embeds as-is (no double codec_sum) - Multi-segment format support in Kotlin (detects n_segments header) - prepare_tts_segments.py: splits text + generates per-segment embeds - Quality issue: Python-captured embeds differ from original working file (original was likely captured on-device, not from Python model.forward) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 22:26:20 +02:00
Kazeia Team	10a3904d7d	Multi-segment TTS for long text: split → generate → concatenate - prepare_tts_segments.py: splits text at sentence boundaries, generates Python pre-computed embeds per segment - Kotlin: detects multi-segment file format, processes each segment independently (fresh KV cache), concatenates audio - Long text tested: 3 segments, 335 tokens, 26.8s audio, RTF 1.67 File format: n_segments, then per segment: nPrefill, nTotal, embeds[] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 14:34:05 +02:00
Kazeia Team	24157c0a68	Fix: use pre-computed embeds as-is (no double codec_sum) Pre-computed embeds from Python already contain codec_sum+text. Using them as-is works correctly. After exhausted, fallback to our codec_sum + pad. Long text: 191 tokens, 15.28s audio, RTF 1.27 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 14:10:23 +02:00
Kazeia Team	f6df1738c5	Add prepare_tts_embeds.py for any text + codec_sum fix - prepare_tts_embeds.py: generates pre-computed embeddings from any text via Python generate_voice_clone, capturing talker inputs - C++ pipeline: always build codec_sum + trailing (not as-is) - maxTokens: 4× trailing count (audio >> text tokens) - Long text tested: 224 Python tokens → 125 NPU tokens (10s audio) - Text-only embeds don't work (model needs Python pre-computed codec_sum) Usage: python3 scripts/prepare_tts_embeds.py "Your text" output.bin adb push output.bin /data/local/tmp/.../full_pipeline_embeds.bin Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 14:05:42 +02:00
Kazeia Team	173606dae7	Stable: decoder 8T optimization + restore pre-computed embeds - BigVGAN: 8 threads (2757→1872ms), pre_conv/pre_transformer: 4 threads - Restored pre-computed embeds format (codec_sum+text from Python) - Text-only trailing embeds don't work: model needs codec_sum for EOS For long phrases, pre-computed embeds must be generated from Python. RTF 1.26 on short phrase. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 13:42:02 +02:00
Kazeia Team	42bbb96fd8	Optimize decoder: BigVGAN 8T, small models 4T → RTF 1.26 BigVGAN benefits from 8 intra-op threads (all perf cores). Pre_conv and pre_transformer kept at 4T (small, less contention). BigVGAN: 2757ms → 1872ms (-885ms), decode total: 2830ms → 2035ms Pipeline: 6438ms → 5834ms → RTF 1.26 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 13:00:05 +02:00
Kazeia Team	a688edc9ec	Reduce talker KV_LEN 100→64: saves 148ms (RTF 1.31) KV window of 64 sufficient for ~70 token generation (10 prefill + 58 gen). 36% less KV memcpy per talker step (28L × 2 × 64×8×128 vs 100×8×128). Generation: 3795ms → 3647ms, total: 6438ms → 6093ms Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 12:47:30 +02:00
Kazeia Team	4dcc4bb8b3	Fix KV buffer + revert HTP decoder (BigVGAN too complex for HTP) - Restored intermediate KV buffer for talker (direct output→input caused trembling from buffer overwrite during execute()) - BigVGAN HTP compilation takes >5min, not viable - RTF 1.35 with clean audio quality Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 12:37:50 +02:00
Kazeia Team	985fd9cff9	Direct output→input KV copy: RTF 1.51 → 1.31 Skip intermediate KV buffer: copy output tensors directly into next step's input pointers. Saves ~1.5GB/run of memcpy for talker (28L × 2 × 100×8×128 floats × 58 steps) and CP similarly. Generation: 4007ms → 3713ms, total: 7180ms → 6078ms Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 12:23:45 +02:00
Kazeia Team	14f7e5b05f	Optimize CP+talker: eliminate prepare_input_tensors per step Cache input tensor pointers after first prepare_input_tensors call, then memcpy directly into them for all subsequent steps. Eliminates ~14000 mallocs per pipeline run (986 CP + 58 talker calls). Generation: 4640ms → 4007ms (-633ms), total RTF: 1.6 → 1.51 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 12:16:38 +02:00
Kazeia Team	e647911329	Shared Module C++ pipeline: RTF 1.6 with perfect quality Key breakthrough: C++ pipeline loop using the SAME Method* instances that Java loaded (via Module::method("forward")). This gives: - Same QNN compiled graph → identical numerical results → no trembling - C++ loop → no Java Tensor/EValue allocation overhead - prepare_input_tensors + memcpy + Method::execute (like cp_et_runner) Pipeline: talker ~20ms/step + CP ~44ms/step + decoder 2.8s = 7.3s for 4.64s Added to executorch JNI: - Module.nativeSetCpModule() — registers CP module for pipeline - Module.nativeRunTtsPipeline(...) — runs full talker+CP loop in C++ - Updated executorch.jar with new native method declarations From RTF 4.9 (start of session) to RTF 1.6 with impeccable audio quality. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 12:05:58 +02:00
Kazeia Team	38c0e9874a	Disable C++ pipeline (QNN non-deterministic), keep Java RTF 1.8 Root cause found: QNN HTP level=1 compilation is not bitwise deterministic. Two loads of the same .pte produce slightly different hidden states → audible trembling in decoded speech. Java pipeline uses single QNN instance → no trembling, validated quality. C++ pipeline code preserved for future use when QNN context caching is fixed (would make both loads use same compiled graph). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 11:42:49 +02:00
Kazeia Team	439629c9bf	Revert "Pre-allocate Tensor/EValue in Java pipeline: 16s → 8.9s (RTF 1.9)" This reverts commit `0f027c5fde`.	2026-04-09 11:03:52 +02:00
Kazeia Team	0f027c5fde	Pre-allocate Tensor/EValue in Java pipeline: 16s → 8.9s (RTF 1.9) Reuse float arrays and Tensor/EValue objects across talker steps instead of creating new ones each iteration. Eliminates ~7s of GC overhead from thousands of JNI object allocations. Same validated audio quality as before, no C++ pipeline needed. Talker 35ms/step, CP 58ms/step, total 8.9s for 4.64s audio. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 10:59:13 +02:00
Kazeia Team	8e536094df	Fix C++ pipeline eos/pad + disable for quality (keep Java default) - Fixed trailing embed handling (use pre-computed as-is) - Added eos/pad embed params to nativeRun - Improved C++ PRNG for sampling - Disabled native pipeline: slight quality regression vs Java (two separate QNN instances give different numerical results) - Java pipeline (RTF 1.8) kept as default for validated quality Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 10:53:19 +02:00
Kazeia Team	3b01302cfb	Fix missing eos/pad embeddings in native C++ pipeline The native pipeline was adding zeros after trailing text tokens instead of tts_eos_embed then tts_pad_embed. This caused the model to mispronounce final words (e.g. "développement" → "devopment"). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 10:35:05 +02:00
Kazeia Team	393ce79eb5	Native C++ pipeline: RTF 1.4 (was 3.6 in Java) Full talker+CP autoregressive loop in C++ via JNI. Talker 20ms/step, CP 44ms/step, total 6.6s for 4.64s audio. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 10:09:32 +02:00
Kazeia Team	fb6045a635	Pre-load CP heads + GPU decoder test (reverted) + headArgmaxOffset - Pre-load all 15 CP heads at first CP call (eliminates lazy-load lag) - Tested BigVGAN on GPU Adreno: no gain (+300ms vs CPU), kept on CPU - Added headArgmaxOffset for future batch optimization - Cancel previous pipeline on new run_pipeline intent Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 09:57:01 +02:00
Kazeia Team	6e6c562d53	Add DSP warmup + fix pipeline thread contention - Warmup forward() for talker+CP during init (avoids 7s DSP compilation on first pipeline run) - Cancel previous pipeline job before starting new one - Use Dispatchers.IO for pipeline intent First run after warmup: talker 19ms/step, CP 59ms/step → RTF ~1.9 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 09:24:18 +02:00
Kazeia Team	8bfe6c7445	Add NEON SIMD heads argmax for CP — 2.3× speedup CP head dot products (15 × 2048×1024) optimized with ARM NEON vfmaq_f32 (4 accumulators, 16 floats/iteration). CP/frame: 131ms → 58ms, total pipeline: 22.7s → 14.7s (RTF 3.2) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 08:55:20 +02:00
Kazeia Team	389ffa7c61	Initial commit: Kazeia TTS pipeline on NPU via ExecuTorch Full Qwen3-TTS-0.6B pipeline running on Snapdragon 8 Elite NPU: - Talker (28L) and Code Predictor (5L) as .pte on QNN HTP fp16 - JNI integration, no root required - Validated audio quality: RTF 3.9 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 08:42:11 +02:00

41 Commits All Branches Search

41 Commits

All Branches