kazeia

Commit Graph

Author	SHA1	Message	Date
Kazeia Team	b5b13780f7	UI: whole-sphere Fourier-mode deformation during speech Dropped the internal waveform lines — not what we wanted visually — and replaced them with a spectrum-driven deformation of the sphere outline itself. Each of the 12 log-spaced bands drives one Fourier mode of the perimeter (band b → mode b + 2, so modes 0/1 stay circular and higher bands produce tighter ripples). Low bands pull the shape into wide asymmetric bumps that feel like formants; high bands add quick sibilant-like tremors. Phase advances faster for higher modes so tight ripples visually match high-frequency content. Overall displacement is gated by the RMS envelope so silence is quiet and loud syllables distort strongly. Fill + highlight are clipped to the deformed path so the gradient follows the shape and it reads as a single living object rather than a circle with stuff bolted on. Removed drawSpectrumBars and drawWaveformLine. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 23:47:30 +02:00
Kazeia Team	2fe46e0f15	Fix seg-2 audio dropout + switch spectrum from bars to Bézier lines Regression fix: when synthesis of segment N+1 ran longer than the playback of segment N (e.g. 5 s synth for a 1.5 s "Bonjour !"), the previous MediaPlayer was already in the Completed state by the time we queued the next one. setNextMediaPlayer() on a Completed player is a documented silent no-op — so the second sentence never started and the user only heard the first part of the reply. Rewrote playChainedMediaPlayers with per-player CompletableDeferred tracking: before calling setNext we check whether current's done has fired; after awaiting completion we verify next really auto-started (checking isPlaying / currentPosition) and call start() explicitly if the chain missed. Belt-and-suspenders against the race either way. Removed the now-unused waitForPlaybackCompletion helper. Visual change: in-sphere spectrum bars replaced with three superimposed Bézier "deforming lines", mirrored above/below a central baseline, with a soft cosine taper so the curves decay to zero at the sphere's left/right edges (matches the circular mask). Each line has its own slow-moving phase + gain + thickness + alpha so the three overlap to give depth — closer to an oscilloscope trace than an EQ. Low-level sin jitter keeps the lines alive during quiet passages, amplitude-gated so true silence is a flat line. User-facing change: no bars anymore. The sphere now "breathes" with flowing waveforms matching its voice. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 23:42:43 +02:00
Kazeia Team	06dcd76dcb	UI: large central orb w/ spectrum-inside + per-voice palette Complete redesign of AudioVisualizerView based on feedback: the orb is now the app's visual face, takes the top ~60% of the chat area, and has clearly distinct behaviour in each state. - Idle: slow 5 s breathing (scale 0.88 → 1.00 via cos easing), pure round shape, soft halo in phase. No high-frequency motion. - Listening: organic blob outline built from 8 Fourier modes whose amplitude scales with live mic RMS; a thin shimmering arc rotates around the orb while mic energy is present; continuous micro-ripples pulse outward. Looks clearly 'alive and attentive' vs Idle's static breathing. - Speaking: the orb becomes a contained spectrometer. A pre- computed log-spaced spectrogram (12 bands, 120 Hz–4 kHz, Hann-windowed FFT, one column per 50 ms of audio) is rendered as vertical rounded-rectangle bars CLIPPED to the sphere outline so they really look like the sphere itself speaking. Bar heights interpolate between spectrogram frames and exponentially smooth toward the target for fluid 60 fps motion. Outer halo pulses with the RMS envelope; ripples release on envelope peaks. - Per-voice color. Eight-entry palette (Damien lavender, Elodie rose, Jerome aqua, Richard amber, Amir emerald, Didier indigo, Sid peach, Zelda periwinkle). Halo, accent, bars, ring, and ripples are all derived from a single voiceColor so switching the voice spinner tweens the entire scene to the new identity over a few frames. Color stored on both KazeiaService (for persistence across process/view rebinds) and pushed directly to the view for instant feedback at selection time. Sidecar pipeline changes: - Qwen3TtsEngine now computes per-segment spectrogram alongside the RMS envelope (new computeSpectrogram + an in-place radix-2 FFT). FFT_SIZE = 1024, hop = 50 ms, 12 log-spaced bands. SegmentReady carries both arrays; onSegmentPlaying is (sentence, durationMs, rmsEnvelope, spectrogram). - KazeiaPipeline.speakText forwards the new callback shape. - KazeiaService.VisualizerSignal.Speaking now carries the spectrogram and the new voiceColor StateFlow. - ChatActivity passes both to the view and collects voiceColor. Layout: vertical chain between audioViz (weight 3) and rvMessages (weight 2) so the orb owns ~60% of the chat panel and the chat list takes the remainder. Removed the fixed 140 dp constraint. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 23:33:38 +02:00
Kazeia Team	8939c680b2	UI: épuré audio-reactive orb visualizer — replaces 3D avatar for MVP Adds a breathing lavender orb centred above the chat list that tracks the actual audio state of the app: - Idle: slow respiratory pulsation (~4 s cycle) at 20 fps. The chatbot is visually "awake" without animating loudly. - Listening: halo swells with live mic RMS from the VAD loop, so the user sees Kazeia hearing them even before Whisper has produced any transcription. Mic RMS is normalised with the same sqrt squashing the TTS envelope uses so quiet speech still reads visibly. - Speaking: amplitude + halo driven by a pre-computed RMS envelope (50 ms windows, sqrt-normalised) produced at synthesis time. Ripples fire on local peaks above 0.35 — matches speech rhythm without overwhelming. Timer is internal to the view, synced to the segment's durationMs; no MediaPlayer position polling. Architecture: - Sidecar RMS envelope. Computed in Qwen3TtsEngine.generateSegmentAudioVC right after PCM is available, packed into SegmentReady, and handed to onSegmentPlaying(sentence, durationMs, rmsEnvelope) when each MediaPlayer starts. Zero extra IO — runs on the same PCM we already write to WAV. - KazeiaService exposes VisualizerSignal (Idle \| Listening(rms) \| Speaking(env, dur)) as a StateFlow. The VAD loop pushes Listening, processLlmResponse pushes Speaking from the per-segment TTS callback, and finally clears to Idle when no mic is open. - AudioVisualizerView renders via Choreographer.FrameCallback, self- throttled to 20 fps at Idle and full refresh during Listening/ Speaking. Hardware layer. Pure Kotlin + Canvas, no deps. ~280 LOC. Layout: 140 dp strip between voiceBar and rvMessages in activity_chat.xml. No 3D engine, no Unity, no splash extension. The avatar design work remains on disk for a later phase when the TTS+streaming pipeline stabilises enough to spend time on DECA/FLAME integration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 23:20:15 +02:00
Kazeia Team	f17131aefb	UI: reveal Kazeia reply in sync with TTS audio (per-sentence, per-word) Matches the 'conversation' feel the user asked for. Previously the full LLM response appeared in the chat as soon as generation finished, then audio played 5–10 s later — text and sound felt decoupled. Now: - The KAZEIA bubble is created empty and only starts filling when the first TTS segment actually starts playing through the speaker (we already split the response by sentence for the chained- MediaPlayer pipeline; that split drives the reveal too). - Inside each sentence, words are appended one by one at a cadence of (audio duration / word count) — slower sentences reveal slower, matching speech pacing. The first word of each sentence appears immediately so audio and text stay aligned at the start. Implementation: - Qwen3TtsEngine: added `onSegmentPlaying(sentence, durationMs)` listener, invoked from the chained-MediaPlayer worker the moment each segment's MediaPlayer.start() lands. Sentence + duration are carried end-to-end via a new SegmentReady data class. - KazeiaPipeline.speakText: forwards an optional listener down to the TTS engine, same signature. - KazeiaService: new updateMessageText(id, text) helper. In processLlmResponse, the bubble is added empty before speakText and grown by a reveal coroutine per sentence; after speakText returns we snap to the full text as a safety net. No change to the stream_llm debug intent path — it still uses the old enqueueSentence flow directly and doesn't need the reveal (no UI bubble there). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 22:58:18 +02:00
Kazeia Team	6a958c1a10	Revert MemoryOptimizer — reclaim wasn't worth the footprint The kill-background approach worked at startup (−1.6 GB) but respawns pulled most of that back within 1–3 min, and the periodic sweep only kept the first sweep's reclaim stable rather than going lower. Net practical benefit over "just let Android manage it" is small, not worth a custom optimizer + normal-perm + file maintenance. Removes: - MemoryOptimizer.kt - KazeiaService.onCreate calls to freeRamForModels / startPeriodicOptimizer - KILL_BACKGROUND_PROCESSES permission from the manifest Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 22:46:58 +02:00
Kazeia Team	751e3e0868	memory: periodic sweep + expand kill list (photos, calendar, contacts, vending, tachyon…) First sweep reclaimed 1.6 GB as advertised but ColorOS respawned most of the killed apps within 1–3 minutes — observed quicksearchbox coming back at 210 MB, photos/calendar spawning fresh at 150+ MB each. Two changes: 1. Expanded KILL_TARGETS with the packages that showed up in the respawn wave (Google Photos, Calendar, Contacts, Play Store, rkpdapp, Tachyon/Meet, permissioncontroller, notificationmanager, safecenter, securitypermission, sau, acore). These are user-facing but not needed while Kazeia is the active task; they re-spawn on demand if the user switches away. 2. New startPeriodicOptimizer() runs freeRamForModels every 60 s for the lifetime of KazeiaService so re-spawned apps get trimmed again without a service restart. Tied to serviceScope so it stops cleanly on destroy. Net effect observed: avail RAM stays ~1.2–1.5 GB higher than without the sweep. Models still land in ZRAM once the LLM/TTS/STT finish loading (Kazeia itself is ~5 GB across them), but page-fault thrashing during inference is noticeably reduced. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 22:44:09 +02:00
Kazeia Team	39babcb158	TTS+audio+memory: ColorOS playback fixes + kill-background reclaim Three unrelated fixes rolled into one so testing on the tablet stayed coherent. All were driven by what the user was observing during live audio tests, not by pre-planned refactors. 1. Audio playback actually audible. ColorOS's AudioFlinger silently muted our AudioTrack ~600 ms after play() every time (dumpsys audio showed `event:muted updated source:clientVolume` and playbackHeadPosition stuck at 0), regardless of USAGE_MEDIA / USAGE_ASSISTANT / USAGE_VOICE_COMMUNICATION, regardless of audio focus grant, regardless of FGS type including mediaPlayback. A MediaPlayer path using the SAME usage attributes works because it routes through a different AudioFlinger thread that isn't under the same background-hardening policy. `USE_MEDIAPLAYER_FALLBACK` in Qwen3TtsEngine.kt flips playback to a WAV-per-segment pipeline. Two MediaPlayer instances are chained via `setNextMediaPlayer()` so segments transition without re-arming the DAC (that re-arm was audible as "beg beg" pops between sentences). Synth of seg N+1 runs in parallel with playback of seg N via a capacity-2 Channel, hiding synthesis latency behind playback for all but the first seg. 2. Mic no longer loops TTS back into STT. The continuous- listening VAD in KazeiaService already had a guard to drop frames while `pipelineState is Speaking`, but that state was never set by any caller — so the mic kept recording during playback and fed our own speaker output back to Whisper, creating the infinite "Kazeia talks to Kazeia" loop the user observed. Both the stream_llm intent path and the main `processLlmResponse` TTS path now wrap the TTS call with `Speaking → Idle/Listening`. 3. Free 1.6 GB of RAM at service start. The OnePlus Pad 3 with ColorOS keeps ~7 GB of Google + OPLUS background services resident at idle. With Qwen3-4B (3.2 GB) + Qwen3-TTS (1 GB) + Whisper (0.5 GB) on top, most of our model weights were going to ZRAM swap — "the NPU is stuck" reports were actually page faults paging 3 GB of LLM weights back in before each inference. New `MemoryOptimizer` kills 30-ish non-essential background packages (Google optional: YouTube, Wallet, Chromecast, Messaging, AICore, Quicksearchbox; OPLUS optional: smartsidebar, cosa, pantanal, nhs, midas, …) via `ActivityManager.killBackgroundProcesses`. Measured reclaim on first run: avail RAM 8468 MB → 10112 MB, +1644 MB. Uses KILL_BACKGROUND_PROCESSES (normal perm, no user prompt); system-critical packages and the launcher/systemui are explicitly excluded from the target list. Collateral changes: - Added FOREGROUND_SERVICE_MEDIA_PLAYBACK permission + fgsType flag (didn't fix the mute on its own, but it's correct per Android 14 policy and leaving it without would be a latent compliance risk). - Kept `USE_STREAMING_DECODE` + CP↔BigVGAN overlap code intact behind the MediaPlayer-fallback branch so reverting to the AudioTrack streaming path is a single-const flip if ColorOS ever lifts the hardening (or we move to a device without it). - New AudioTrack path has a keep-alive silence watchdog and a playback-head drain wait on stop. Both were attempts to fix the mute that didn't pan out on their own; leaving them in so the streaming path stays usable on non-hardened devices. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 22:37:20 +02:00
Kazeia Team	0632db1ee0	UI: drop Magisk prompt — ResourceMonitor stops probing su ResourceMonitor.init ran `su -c id` at every ChatActivity launch to see if root was available, then used root to read /sys/class/kgsl/... and /sys/bus/platform/devices/soc:qcom,msm-cdsp-rm/... for GPU/NPU usage %. That probe was the only thing still triggering the Magisk auth dialog on each app start after the no-root LLM migration. Remove the root probe and the execRoot helper. GPU/NPU reads now return -1 (UI already renders "—" for negative values). The non-root /sys/class/kgsl/kgsl-3d0/gpubusy path is kept as a best-effort — it's world-readable on some devices, silently fails otherwise. CPU and RAM readouts are unaffected (never needed root). Dead-code `su -c ...` calls remain in Qwen3TtsEngine (hexStartRunner, hexStartCpRunner, hexStopRunner, etc.) and WhisperNpuSttEngine, but all are gated behind fallback paths that don't execute under the current PTE-only config (talkerPteModule != null && cpPteModule != null short- circuits before any su call). Left in place to avoid churning the TTS Hexagon fallback; can be purged in a later cleanup pass if needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 18:35:18 +02:00
Kazeia Team	10fd10fd90	TTS: overlap CP↔BigVGAN — first audio 14.5s → 10.9s per segment Streaming variant of the per-segment decode pipeline. As soon as SEQ_LEN codes are accumulated from the talker/CP loop, BigVGAN is dispatched on a background coroutine while the producer keeps generating the rest of the segment. The BigVGAN consumer feeds a streaming crossfader that emits stable audio as it arrives and holds back overlapSamples for the next chunk's blend. Mirrors decodeChunked's semantics exactly so final audio is bit-identical modulo the fadeOut application location (now applied to the final emission tail instead of the full buffer; the last 40ms still get faded). Validated A/B on the same prompt 3 used in the recent benchmark: prompt: "Je me sens un peu triste aujourdhui…" seg 0 first audio: 14 485 ms → 10 936 ms (−3.5 s) end-to-end first audio (LLM trigger → audio): 16.2 s → 12.7 s Stream LLM total: 33 234 ms → 28 594 ms (−4.6 s) Short segments (<SEQ_LEN codes) and the legacy non-streaming callers (generateSegmentAudioVC, decodeChunked, multi-segment pipelines, etc.) are untouched. The new path is gated behind USE_STREAMING_DECODE so it can be reverted by flipping a single const if a regression is found. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 16:22:15 +02:00
Kazeia Team	67de8d4767	LLM: enable hybrid-mode export via num_sharding=1 — TTFT 2.9s → 113ms Re-exported Qwen3-4B in hybrid mode (prefill_forward + kv_forward) with num_sharding=1 after discovering that sharding=2 produces a multi-context .pte that the LlmModule loader cannot restore (error 5010 "Context group 1 does not exist"). Single-context hybrid .pte loads cleanly through the JNI runner and the auto-detected eval_mode=1 path. The peak RAM during export hit 49 GB, which is why sharding=2 was used originally — the /swapfile (192 GB) now absorbs it. Compile wall time with sharding=1 + hybrid is ~73 min (two graphs) vs ~30 min for sharding=2 + kv-only (one graph). End-to-end on tablet, same 'Bonjour, comment vas-tu ?' prompt: Before (kv-only, short prompt): TTFT 2865 ms, total 4034 ms After (hybrid, short prompt): TTFT 113 ms, total 1471 ms Gain: -2752 ms TTFT (96% reduction, 25× faster) Response: "Bonjour ! Je vais bien, merci de me demander. Comment vas-tu ?" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 15:08:31 +02:00
Kazeia Team	a41619ed67	TTS: keep BigVGAN on CPU after GPU regression; LLM filter strips more tags #2 BigVGAN GPU experiment: ORT-QNN GPU EP loaded the v2_decoder_conv ONNX model successfully (session creation 463 ms, no fallback warnings) but per-phrase inference jumped to ~3.5 s vs ~2 s on CPU 8-thread. The GPU/CPU memory transfer cost dominates for this conv-heavy decoder, and the optimization went the wrong way. Comment block updated to record both the HTP and GPU paths as tried-and-rejected so future passes don't re-walk the same ground. LLM streaming filter: extend the lookahead-based <think>…</think> suppressor to also strip singleton special tokens (<\|im_start\|>, <\|im_end\|>, <\|endoftext\|>). Previously the closing <\|im_end\|> at end of the assistant's turn leaked into the SentenceStreamer and ended up as a spurious sentence at the end of the TTS output. Same lookahead-buffer trick handles split tokens. Validated end-to-end: 'Bonjour, comment vas-tu ?' → "Bonjour ! Je vais bien, merci. Comment vas-tu ?" → seg 0 "Bonjour !", seg 1 "Je vais bien, merci." (no <\|im_end\|>), BigVGAN back to 1.8 s/phrase. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 13:48:37 +02:00
Kazeia Team	f4b15a72a7	LLM JNI: auto-detect eval_mode from .pte methods (kv-only vs hybrid) Replace the hardcoded eval_mode=0 in the QNN_LLAMA branch with a runtime check on the loaded module's method names: if the .pte exposes a prefill_forward graph, switch to EvalMode::kHybrid (1) — the runner can then batch the entire prompt through prefill_forward in one parallel pass instead of running 52 ms/token sequentially through kv_forward. Falls back to kKVCached (0) when only kv_forward exists, matching the current .pte behaviour exactly so this is a safe in-place upgrade ahead of the hybrid re-export. Sanity-tested with the kv-only Qwen3-4B .pte already on the tablet: Prompt 'Bonjour, ça va ?' → "Bonjour ! Ça va, merci de me demander ça. Tu as une question ?", TTFT 2728 ms, total 4158 ms — no change vs the hardcoded eval_mode=0 build. Once the hybrid Qwen3-4B export finishes (~50 min compile, both prefill_forward + kv_forward graphs), the same JNI binary will pick up the new .pte and TTFT should drop to <1 s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 12:45:10 +02:00
Kazeia Team	3d435f9cdd	LLM: trim system prompt to drop ~27 prefill tokens (-1.3s TTFT) The verbose 55-token system prompt was the cheapest TTFT win on the kv-only path (52 ms per prefill token). Compacting it to 25 tokens while keeping the three load-bearing constraints — Kazeia identity, French only, short replies, /no_think — measurably improved end-to-end latency. Validated 'Bonjour, comment vas-tu ?' on tablet: Before: prompt_tokens=80, TTFT=4202ms, total=5716ms After: prompt_tokens=53, TTFT=2865ms, total=4034ms (-1.3s, -32% TTFT) Reply quality preserved: "Bonjour ! Je vais bien, merci. Comment vas-tu ?" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 12:16:11 +02:00
Kazeia Team	7dc6704e95	docs: add before/after performance comparison to no-root report Concrete measurements taken 2026-04-14 on the same Qwen3-4B .pte and the same C++ runner — only the invocation path differs (subprocess su -c vs in-process LlmModule JNI). Confirms no LLM regression and a measurable speedup on the TTS path thanks to the shared QNN context (Talker 37 ms/step vs 45-65 ms/step before). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 11:37:15 +02:00
Kazeia Team	6c7746c5d0	docs: add post-mortem to no-root report — issue resolved The root cause was process-credential loss across fork+exec, not the QNN SDK version mismatch I had hypothesized. Switching the LLM to in-process ExecuTorch LlmModule (Zygote-forked context, accepted by adsprpcd's FastRPC credential check) eliminated the su requirement. The original investigation sections are kept verbatim for reference; the new section 10 documents the actual fix, the patches applied to ExecuTorch, the metrics validated end-to-end, and pointers to the project memory entry. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 11:19:27 +02:00
Kazeia Team	b57719fa5e	LLM: filter <think> tokens out of the streaming TTS path Even with /no_think in the system prompt Qwen3 still emits an empty <think>…</think> wrapper before the real answer. Without filtering, the SentenceStreamer treats '<think>' as a sentence boundary and feeds three tokens of XML into the TTS, producing audible parasites at the start of each reply. The new in-callback filter buffers a small lookahead (just enough to span "</think>"), suppresses everything between the open and close tags, and flushes the surrounding prose to onToken in order. With the lookahead, tags that arrive split across decoded pieces ("<thi"+"nk>") still match. Validated end-to-end: prompt 'Bonjour, comment vas-tu ?' now streams sentence-by-sentence to the TTS — first segment "Bonjour !" reaches the talker at 4.6 s, no <think> sneak-through. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 11:16:08 +02:00
Kazeia Team	f32b5ddfdd	LLM no-root: validate end-to-end pipeline, fix kv_io_bit_width detection End-to-end validation on OnePlus Pad 3 with stream_llm intent: Prompt: 'Bonjour, comment vas-tu ?' Response: 'Bonjour ! Je suis là pour t'écouter. Comment vas-tu aujourd'hui ?' TTS: Talker(PTE) 37ms/step, CP(PTE) 73ms/step, audio synthesized. No su, no Magisk prompts. Two fixes since the previous commit: 1. ExecuTorchLlmEngine: pass echo=false to LlmModule.generate() — by default the runner echoes the prompt tokens back via the callback, which fed the ChatML wrap (<\|im_start\|>user …) into the SentenceStreamer and TTS. 2. jni_layer_llama.cpp: pick Runner<uint8_t> vs Runner<uint16_t> based on the model's get_kv_io_bit_width metadata, mirroring qnn_llama_runner.cpp main(). The hard-coded uint16_t was wrong for our Qwen3-4B export (which uses 8-bit KV I/O) and produced fluent-looking but completely random tokens ("blocked罩ug darkestSOLEQuotes作者本人 …") — same symptom whether greedy or sampled, the smoking gun for a width-mismatched KV cache reinterpretation. Other tweaks: - temperature=0.0 in the QNN_LLAMA branch of jni_layer_llama.cpp (greedy, matches the working qnn_llama_runner --temperature 0 invocation) - shared_buffer=true (same as binary defaults) - Kotlin chat template mirrors qnn_llama_runner.cpp's get_formatted_prompt for Qwen3 (user-first, then optional system, then "<\|im_start\|>assistant" with no trailing newline — that quirky ordering is what the .pte was trained on) TFTT is ~4 s for a 77-token prompt on kv-only mode (sequential prefill, one forward per token). To get a sub-second TTFT we'd need to re-export the model in --model_mode hybrid which adds a parallel prefill_forward graph; not required for the conversational use case. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 11:11:23 +02:00
Kazeia Team	809a6d4fed	LLM no-root: migrate to in-process LlmModule (JNI) — zero su calls The root cause of the previous su-c requirement was that Qualcomm's FastRPC kernel driver rejects processes spawned via ProcessBuilder fork+exec because they lose supplementary GIDs on exec. Zygote-forked app processes retain the proper init-configured credentials and are accepted by the adsprpcd service, which is why ORT-QNN (Whisper, in-process) worked while the subprocess qnn_llama_runner did not. Running the LLM in-process via ExecuTorch's LlmModule bypasses the fork+exec path entirely. What this commit does: - ExecuTorchLlmEngine now uses org.pytorch.executorch.extension.llm.LlmModule with MODEL_TYPE_QNN_LLAMA=4 (routes to example::Runner in jni_layer_llama.cpp, the same C++ runner that qnn_llama_runner embeds). - All su, ProcessBuilder, file-based prompt/response plumbing, and run_llm.sh gone. ChatML template is built in Kotlin; tokens stream in via LlmCallback. Supporting changes under executorch-patches/llm_in_process_jni.patch: 1. backends/qualcomm/CMakeLists.txt — gate PyQnnManagerAdaptor on NOT ANDROID. The original guard (CMAKE_SYSTEM_PROCESSOR MATCHES x86_64) misfires in a nested scope during Android cross-compile and tried to build the host Python bindings. 2. extension/android/jni/jni_layer_llama.cpp — hardcode decoder_model="qwen3" (was "llama3") and pass eval_mode=0 (EvalMode::kKVCached) + shared_buffer=true to match our hybrid_llama_qnn.pte which only contains kv_forward, not prefill_forward. Build: scripts/build_android_library.sh arm64-v8a with QNN_SDK_ROOT pointing to /opt/Kazeia/qnn_sdk_242/qairt/2.42.0.251225 and EXECUTORCH_BUILD_QNN=ON. Produces libexecutorch_jni.so (192 MB) with QNN v2.42 backend + the llama runner code, plus libqnn_executorch_backend.so. Both staged in jniLibs. Validated on OnePlus Pad 3: LlmModule.load() completes in 4.2 s, no su prompts, Pipeline ready with STT(WhisperHybridEngine) → [VoiceCommands → LLM] → TTS(Qwen3TtsEngine). TTS .pte still loads with the upgraded v2.42 runtime — no regression. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 10:39:50 +02:00
Kazeia Team	6e6a2d9f82	Baseline before no-root migration: working state with root LLM Commit de sauvegarde avant la tentative d'unification QNN SDK v2.37 et suppression du su -c pour le LLM. État actuel fonctionnel : - LLM Qwen3-4B via su -c qnn_llama_runner (v2.42 dans /data/local/tmp/kazeia-et/) - TTS talker + CP via ExecuTorch .pte JNI (v2.31 dans jniLibs) - STT Whisper via ORT-QNN 1.24.3 Le rapport kazeia-no-root-report.md documente en détail les tentatives de no-root et leurs échecs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 08:19:36 +02:00
Kazeia Team	364016b7b8	LLM+TTS: short-response system prompt, PTE streaming fallback - ExecuTorchLlmEngine: system prompt forces French, 1-2 short sentences, /no_think so the full budget goes to the answer (Qwen3 was consuming 120+ tokens on <think>); eval_mode 0 matches our kv-mode export. - Qwen3TtsEngine.generateSegmentAudioVC: when the Hexagon talker socket isn't open, fall back to runInterleavedPteFromEmbeds so the Stage 3 streaming session still produces audio. Without this the session opened, accepted sentences, and silently emitted empty PCM. Documents the QNN SDK version-skew pitfall in ExecuTorchLlmEngine.kt ahead of the upcoming migration to a unified v2.42 toolchain. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 00:17:08 +02:00
Kazeia Team	9930bfa392	LLM: enable Qwen3-4B NPU (21 tok/s) in service pipeline - ExecuTorchLlmEngine: eval_mode 0 (our .pte is kv-mode, not hybrid) - KazeiaService: call llm.load() after TTS init; try/catch falls back to echo mode if the runner or .pte are missing. Pipeline on device: STT(WhisperHybridEngine) → [VoiceCommands → LLM] → TTS(Qwen3TtsEngine). Validated on OnePlus Pad 3: LLM ready in ~8 s, gen 21.3 tok/s, RSS 1.76 GB in the qnn_llama_runner subprocess (out-of-process from the Kazeia app). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 23:00:25 +02:00
Kazeia Team	19f934af25	LLM NPU: Qwen3-4B QNN export patches + deployment notes Adds executorch-patches/ with the local modifications to /opt/Kazeia/executorch (upstream pytorch/executorch v1.2.0) required to export Qwen3-4B to QNN for the OnePlus Pad 3 Hexagon V79. Tablet runs 18.2 tok/s (gen), TTFT 0.9 s, RSS 1.76 GB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 22:56:42 +02:00
Kazeia Team	f548e02283	TTS: dynamic EOS-rank boost terminates generation cleanly across voices Replaces the fixed maxGen + length-based boost with a fully dynamic end-of-utterance detector that watches the model's own EOS logit rank. End result on the Baer 3-segment monologue, validated by user as "FORMIDABLE" / "impeccable" with both Damien and Zelda voices: - All 3 segments terminate via EOS (no maxGen cap hit) - No "page beg beg" filler tail - No abrupt cuts between segments - Audio durations 5-8 s per segment, matching Python within ~10 % How it works (runHexGenWithPrefill, in tts/Qwen3TtsEngine.kt): 1. At every decode step, compute the rank of CODEC_EOS in the repetition-penalised logits. Mid-utterance the rank sits at 150-700 (model is committed to producing speech). Approaching the natural end, the rank dips toward top-50. 2. Arm the boost only when EOS rank stays below eosRankTrigger=60 for THREE consecutive steps. The 3-step requirement filters out transient single-step dips that occur during low-energy phonemes mid-sentence (without it, short sentences would terminate after ~3 s). Arming is also gated by eosBoostMinStep (50 % of expected speech length) so we never arm in the very first frames. 3. Once armed, the boost increments monotonically: each subsequent step adds boostStepsActive * eosBoostScale to the EOS logit. The accumulated boost lifts EOS above top-1 within 1-3 steps, the argmax check fires, and the loop breaks. Scale=4 gives the model a small natural decay before termination; scale=5 was perfect-but- slightly-clipping, scale=3 wasn't strong enough to outpace the growing top-1 logit. Other tweaks bundled in this commit because they all contribute to the clean output: * Inter-segment gap 120 → 250 ms — gives the listener a perceived sentence boundary instead of a hard concatenation. * fadeOut(audio, 40) on every segment — cosine roll-off over the last 40 ms so the EOS-clipped tail decays naturally instead of sample-clipping. * top_k 50 → 200 in the fallback sample call — wider pool to keep EOS reachable when the boost just fails to hit argmax. Voice swap is a 45 KB file push (damien_voice_prefix.bin and damien_voice_suffix.bin). Successfully tested today with Elodie (female, norm 10.12) and Zelda (norm 9.39) using Damien (norm 10.36) as the baseline — same Kotlin code, no rebuild needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 14:13:04 +02:00
Kazeia Team	c25040a780	TTS: conditional tail-trim + export script accepts voice path arg Two small changes: * export_tts_text_embeddings.py now takes the voice wav as an optional second CLI arg (defaults to damien_15s_24k.wav). Lets the same script capture voice-prefix+suffix for any speaker wav without editing the source — used today to test Elodie alongside Damien. * synthesizeTextStreaming + generateSegmentAudioVC only run the trimTailLowEnergy trim when n >= maxGen. The trim's 35%-of-peak threshold is tuned to catch "page beg beg" filler after the talker fails to emit EOS — but it was cutting valid speech when EOS fired early (observed on Elodie seg 1: 10.08 s → 2.92 s, a 4-second over- trim). With the guard it's a no-op on converging generations and only fires on the ~15% of segments that hit maxGen. Validation after the fix (Elodie, Baer monologue): - seg 1: 126 tokens = maxGen → trimmed 10.08 s → 8.88 s (1.2 s cut, the filler tail) - seg 2: 105 tokens < 138 maxGen → no trim, 8.4 s kept as-is - seg 3: 69 tokens < 96 maxGen → no trim, 5.6 s kept as-is Voice prefix/suffix shape is speaker-invariant except position 7 (the xvector). Confirmed by capturing both Damien and Elodie and diffing: positions 0-6 and 8 identical within 1e-8, suffix identical within 1e-8, only pos 7 has a different xvector embedding (norm 10.36 vs 10.12). That means swapping speakers on-device is a 45 KB file push — no app rebuild, no re-export of the 297 MB vocabulary table. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 11:32:33 +02:00
Kazeia Team	0833d1bd21	TTS: route all synthesizeAndPlay calls through Stage 3 streaming session Replaces the four per-sentence TTS entry points (pipeline.speak, REPEAT voice command, echo-mode TTS, LLM-response TTS) with a single shared pipeline.speakText() that: * opens a Qwen3TtsEngine streaming session when the TTS backend is Qwen3 (voice-cloning path); * feeds the whole response through a SentenceStreamer so the first sentence starts playing as soon as it's decoded; * falls back to the old one-shot synthesizeAndPlay for non-Qwen3 TTS engines (AndroidTts, Chatterbox) that don't expose a session API. KazeiaPipeline.speakText is now public so KazeiaService can use the same dispatch — previously each call site re-implemented the "streaming-or-fallback" logic or just called synthesizeAndPlay and waited for the full synthesis. Enabling the real on-device LLM is a separate task (task #48): the existing llama-cli binary has ggml-hexagon linked in and fails to init the DSP (0x80000406) when the TTS Hexagon runners hold the session. Needs either a CPU-only llama-cli build or the restored ExecuTorch qnn_llama_runner setup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 11:12:14 +02:00
Kazeia Team	2f07901ff3	TTS Stage 3: LLM stream → sentence split → TTS session → shared AudioTrack Closes the loop on on-device conversational TTS. The LLM's token stream is now consumed by a SentenceStreamer which fires a callback the moment a terminal-punctuation boundary appears; each sentence is enqueued to a persistent TTS streaming session that generates and plays audio through a single shared AudioTrack. Sentence N's audio plays while sentence N+1 is being generated on Hexagon+CP — no per-sentence AudioTrack init gap, and no "wait for full response before hearing anything". Mocked-LLM validation on the 3-sentence prompt: "Bonjour. Je suis là pour vous écouter. Comment allez-vous aujourd'hui." - First sentence detected: 1 ms - Seg 0 prefill (Hex): 567 ms - Seg 0 generated: 4 200 ms (18 tokens, 1.4 s audio) - Seg 1 generated: 9 100 ms (42 tokens) - Seg 2 generated: 11 000 ms (46 tokens) - Session closed: 33 500 ms (all audio drained) Changes: * tts/SentenceStreamer.kt — 50-line helper that buffers tokens and fires onSentence when a "." "!" "?" ";" or "\n" appears. minChars = 4 so "Oui." / "Bonjour." count as real sentences; higher thresholds swallowed conversational openers into the next segment and delayed first audio. flush() for the final partial sentence. * Qwen3TtsEngine.startStreamingSession / enqueueSentence / endStreamingSession triplet. startStreamingSession opens a 30-second MODE_STREAM AudioTrack plus a background worker coroutine that pulls sentences from an unlimited Channel. enqueueSentence is non-blocking; the worker serialises generation so audio order matches enqueue order. generateSegmentAudioVC is the per-sentence body (tokenize → prefill build → Hexagon gen → decode) without the WAV-save side effects that the /stream_text intent path does. * KazeiaService new intents: - stream_llm : real LLM path (needs LLM loaded; currently the debug build runs echo-mode so this path is shipped but requires production config to exercise). - stream_llm_mock : fakes the LLM stream by splitting the given text on spaces with 50 ms per "token" — matches the ~20 tok/s rate the on-device LLM produces and lets Stage 3 be validated without flipping the LLM on. Architectural notes: - AudioTrack buffer is 30 s so generation can run ahead of playback without blocking writes. RTF on Snapdragon 8 Elite is ~3 for short sentences, so for a 2-3 sentence response the buffer actually drains between segments and the user hears a short gap — expected, not a bug. Masking that gap requires RTF < 1 which is out of scope. - Hexagon KV is reset between sentences (hexReset) so the talker doesn't see stale context. Prefill observed cb0 = 1995 on every sentence that starts with a capital letter, matching the Python greedy reference — confirms prefill reconstruction is stable across segments within a session. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:52:46 +02:00
Kazeia Team	7f1a44c23d	TTS Stage 2: on-device voice-cloning TTS for arbitrary text Removes the PC-side prepare_tts_segments.py dependency for day-to-day generation. The tablet now tokenizes, embeds, and voice-clones any French (or Qwen3-supported) text with no network, no ADB push per phrase, and quality that matches Python's reference on "Bonjour, je suis Kazeia, je suis là pour vous écouter." — user validation: "impeccable". Three pieces that compose the path: 1. Qwen3BpeTokenizer.kt — byte-level BPE matching Qwen2/Qwen3's Python implementation bit-for-bit. UTF-8 + GPT-2 byte encoder, Qwen regex with \p{IsAlphabetic}/\p{IsDigit} (Android's regex lacks UNICODE_CHARACTER_CLASS — caught in testing). Produces identical token IDs to HF's Qwen2TokenizerFast on the test phrase: [81581, 11, 4759, 35631, 730, 9832, 685, 11, 4759, 35631, 37915, 4914, 9012, 90229, 2676, 13]. 2. export_tts_text_embeddings.py — one-time PC export of: * Full projected text embeddings for the entire 151936-token vocab as fp16 (297 MB). Sanity check: live vs stored max abs diff 1.15e-4 on token 1043. Mmap'd on-device so it stays off the Java heap and leaves room for the 125 MB cp_embeddings alloc. * Damien voice PREFIX (9 × 1024 fp32) — positions 0..8 of a Python voice-clone capture, text-invariant across segments. * Damien voice SUFFIX (2 × 1024 fp32) — positions nP-2..nP-1 of the same capture. Also text-invariant (diff = 0.0 across 3 different-text segments). Without it the talker never sees "text ended" and decode falls into page/beg repetition. * Qwen3 tokenizer vocab.json + merges.txt. 3. Qwen3TtsEngine.kt: * mmap loader for the embeddings table + buffered fp16→fp32 lookup (halfToFloat covers subnormals/inf/NaN so pathological tokens don't become 0). * Stage 2 assets detected at init; missing file transparently falls back to legacy 1050-token reduced-vocab path. * synthesizeTextStreaming(text, onSegmentReady) — new public API: sentence-split → BPE → build prefill as [voice prefix] + [text_proj(id) + codec_pad] × N + [voice suffix] (exact structure Python emits; verified bit-for-bit by matching captured Baer prefill positions against text_projection(tok)+ codec_embedding(CODEC_PAD)) → runHexGenWithPrefill → decode each segment through the existing BigVGAN pipeline → callback. * runHexGenWithPrefill — Hexagon prefill + interleaved CP decode loop. Feeds tts_eos once, tts_pad thereafter (same schedule as Python's voice_clone). Degeneracy guard stops when 9 identical cb0 in a row appear — catches the rare "page beg beg beg" tail when EOS never fires. maxGen = ids.size4 + 10 matches the typical 3.3 codec-frames-per-text-token that Python produces. Prefill build uses the speaker's captured prefix/suffix rather than the legacy in-code buildPrefillEmbeddings that puts only one text token in prefill — the structure mismatch produced garbled audio in the first attempt of this commit. 4. KazeiaService.kt: new stream_text intent extra wires text input to synthesizeTextStreaming with an AudioTrack MODE_STREAM consumer. First-audio latency on the "Bonjour..." test: ~23 s on Snapdragon 8 Elite (prefill + 74-token decode), vs a 3-phrase sentence batch that was 65 s pre-streaming — streaming + on-device text together unblock the MVP chat loop. Known caveats: * 297 MB on-device footprint for the embedding table. Acceptable on OnePlus Pad 3; can be quantized further (int8 per-row) if storage becomes tight. * First init adds ~3 s for BPE vocab + merges load (151k × 2 hash- maps). Happens once per process. * maxGen cap means extremely long sentences may truncate. The sentence splitter already keeps segments ≤120 chars so this hasn't been observed in practice. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:12:09 +02:00
Kazeia Team	5e416713ce	TTS Stage 1 streaming: play each segment the moment it's decoded Adds a streaming multi-segment pipeline on top of the Hexagon talker + ONNX CP backend. First audio arrives at ~20s (vs ~65s for the full phrase non-streamed) on the Baer 16.56s reference (3-segment split). Voice cloning is preserved per segment because each segment now ships its own full prefill. Changes: * Qwen3TtsEngine.generateFromEmbedsHexagonStreaming(path, onSegmentReady) reads single- or multi-segment embeds, runs prefill + generation + VQ decode + BigVGAN per segment, and fires the callback with each segment's ShortArray the moment it's ready. Saves per-segment WAVs (kazeia_stream_seg{N}.wav) plus the concatenated kazeia_stream_full.wav for offline inspection. Extracted the common generation loop into runHexSegmentFromEmbeds(prefill, trailing, idx) so single-segment and streaming paths share exactly the same code (no quality drift between modes). Added hexReset() between segments so segment 2's prefill logits don't contain segment 1's KV state. * vqDecode buffer overrun fix: when the talker samples CODEC_EOS as cb0 it stores a vocab id > CODEBOOK_SIZE, which vqDecode then used as a codebook row index — reading past the 2048-row buffer. The short Baer probe never hit this; longer phrases do. Clamp any out-of-vocab code to 0 at allCodebooks build time. * KazeiaService: new stream_pipeline intent extra wires the callback to an AudioTrack MODE_STREAM instance, writing each segment's audio as soon as it comes back. Logs time-to-first-audio. * prepare_tts_segments.py: the previous version only captured 1-token decode calls and substituted a generic 9-embed "prefill_base" pulled from an unrelated single-segment file — dropping the per-segment xvector conditioning AND the text-encoded embeddings, so Hexagon produced garbled mixed speech for segments 2..N. Now captures the multi-token prefill call too (like prepare_tts_voiceclone.py) so each segment is self-contained. Limitation (documented, not fixed in this commit): RTF ~4.4 > 1 on the Snapdragon 8 Elite with current config means each segment takes longer to generate than it takes to play, so audible gaps between segments remain. Removing the gaps requires either (a) producer/consumer parallelism across two coroutines (doesn't help if RTF stays > 1), or (b) faster CP (the ~180ms/step ONNX MLAS CP is the bottleneck; Hexagon HMX has a known NaN bug and the .pte path contends with Hexagon talker on the DSP). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 08:43:30 +02:00
Kazeia Team	de878ddf5c	TTS tremor investigation: identify cross-arch numerical floor, gate diag flags Extensive investigation of the audible "tremor" in the generated voice-cloned audio. Conclusion is architectural, not a bug: * Hexagon HMX fp16 talker logits correlate with PyTorch fp32 at 0.999998 * ONNX Runtime CP V2 is bit-identical to PyTorch greedy CP (0.24% residual divergence measured by injecting Python's captured cb0 at each step — 14/16 codebooks match 100%, cb14/cb15 miss 1 token out of 53) * BigVGAN decoder is bit-identical to PyTorch (validated earlier) * Therefore the tremor is caused entirely by the ~28% of cb0 argmax flips where the tiny fp16 logits drift crosses the top-1/top-2 margin. This cascades through the autoregressive chain into a trajectory the model never saw at training time → incoherent artifacts. Cross-architecture test (x86 AVX-512 / ARM64 NEON+HMX) cannot be zeroed by any runtime swap — LibTorch Android would use NEON kernels with a different reduction order than PyTorch x86, same class of error, smaller but non-zero residual. Temperature tweaking (0.3 → 0.9) and greedy-vs-sample gave no perceptual difference: the floor is numeric, not in the sampling layer. Accepted for MVP. Documented in project_tts_cross_arch_limit.md — this is a thesis-relevant finding about on-device TTS deployment limits. Cleanup: * All diagnostic flags (force_inject_pycb0, force_greedy_cb0, cb0_temp, force_python_codes, force_cpu_talker, force_cpu_talker_gguf) now gated behind BuildConfig.DEBUG via diagFlag()/diagFile() helpers. Release builds JIT-eliminate the file checks; debug builds keep the whole experimental toolchain for re-running the analysis for demos/thesis. * force_hexagon + force_cp_v2 stay unconditional — production routing. * Prefill cb0 now respects force_greedy_cb0 (was always sampleTopK 0.9). * Native TTS pipeline (executorch-custom/jni_layer_tts.cpp, app/src/main/jni/tts_pipeline.cpp): pad-zone sampling switched to greedy argmax so EOS gets a fair chance (temp 0.9 top-k kept producing audio past EOS where Python's seeded sampler terminated naturally). * scripts/prepare_tts_voiceclone.py: new script that captures Python greedy-CP reference (stochastic talker for EOS, deterministic CP) for token-by-token comparison. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 00:15:14 +02:00
Kazeia Team	ee186e9049	Auto-segmentation for long texts + dynamic pipeline - prepare_tts_native.py: auto-splits long text at sentence/comma boundaries, max 15 tokens per segment - Multi-segment format: each segment gets fresh KV cache - Formula: target_len = n_tokens × 3.2 + 5 per segment - Tested on Edouard Baer monologue: 28 segments, 102s audio Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 00:08:59 +02:00
Kazeia Team	199bc4fbc9	Full native C++ TTS validated on short + long phrases Dynamic formula: target_len = n_tokens × 3.2 + 5 (calibrated) - Short "Bonjour..." (18 tokens → 62 trailing): OK - Long "Je suis Kazeia... difficiles" (30 tokens → 101 trailing): OK RMS trim disabled (garbage is loud, can't distinguish from speech). Length controlled purely by maxTokens = trailing count. Pipeline: prepare_tts_native.py "any text" → adb push → run → audio Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 23:51:05 +02:00
Kazeia Team	dafbe2a52b	FULL NATIVE C++ TTS pipeline — any text, perfect quality The complete solution for native TTS on NPU: 1. Python: tokenize + text_projection only (30ms, no model generation) 2. File: golden prefill[0:9] + text_proj + eos padding (ratio 3.5×) 3. C++ shared Module: codec_sum(our codes) + trailing text/eos/pad 4. RMS-based auto-trim of trailing noise after speech ends Key insights: - Shared Module C++ uses SAME QNN compiled graph as Java → self-consistent - codec_sum from our NPU codes is coherent (same model instance) - Text tokens consumed 1:1, then eos padding for remaining steps - RMS trim detects 15% energy drop from peak → cuts garbage Validated "impeccable" by user on "Bonjour, je m'appelle Kazeia..." prepare_tts_native.py works for ANY text. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 23:39:06 +02:00
Kazeia Team	09d36f2025	Root cause found + on-device embed capture + KV=100 restored Root cause: embeds must come from SAME NPU model instance. Python fp32 embeds cause divergence on NPU fp16 after ~20 steps. Solution: Java pipeline captures embeds on-device during generation. Captured embeds work perfectly with C++ pipeline (validated "bon"). - Added capture mode: touch /data/local/tmp/kazeia/capture_mode - Embeds saved to captured_embeds.bin (same format as pipeline input) - KV_LEN restored to 100 (KV=64 lost role tokens → quality loss) - C++ uses pre-computed embeds as-is (no double codec_sum) Production path: Java pipeline RTF 1.8 for new texts (good quality) Replay path: C++ pipeline RTF 1.26 with captured embeds Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 23:00:37 +02:00
Kazeia Team	3dcf73aa38	Restore KV=100 + fix as-is embeds + multi-segment support - KV_LEN restored to 100 (KV=64 caused quality loss from evicted role tokens) - C++ uses pre-computed embeds as-is (no double codec_sum) - Multi-segment format support in Kotlin (detects n_segments header) - prepare_tts_segments.py: splits text + generates per-segment embeds - Quality issue: Python-captured embeds differ from original working file (original was likely captured on-device, not from Python model.forward) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 22:26:20 +02:00
Kazeia Team	10a3904d7d	Multi-segment TTS for long text: split → generate → concatenate - prepare_tts_segments.py: splits text at sentence boundaries, generates Python pre-computed embeds per segment - Kotlin: detects multi-segment file format, processes each segment independently (fresh KV cache), concatenates audio - Long text tested: 3 segments, 335 tokens, 26.8s audio, RTF 1.67 File format: n_segments, then per segment: nPrefill, nTotal, embeds[] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 14:34:05 +02:00
Kazeia Team	24157c0a68	Fix: use pre-computed embeds as-is (no double codec_sum) Pre-computed embeds from Python already contain codec_sum+text. Using them as-is works correctly. After exhausted, fallback to our codec_sum + pad. Long text: 191 tokens, 15.28s audio, RTF 1.27 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 14:10:23 +02:00
Kazeia Team	f6df1738c5	Add prepare_tts_embeds.py for any text + codec_sum fix - prepare_tts_embeds.py: generates pre-computed embeddings from any text via Python generate_voice_clone, capturing talker inputs - C++ pipeline: always build codec_sum + trailing (not as-is) - maxTokens: 4× trailing count (audio >> text tokens) - Long text tested: 224 Python tokens → 125 NPU tokens (10s audio) - Text-only embeds don't work (model needs Python pre-computed codec_sum) Usage: python3 scripts/prepare_tts_embeds.py "Your text" output.bin adb push output.bin /data/local/tmp/.../full_pipeline_embeds.bin Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 14:05:42 +02:00
Kazeia Team	173606dae7	Stable: decoder 8T optimization + restore pre-computed embeds - BigVGAN: 8 threads (2757→1872ms), pre_conv/pre_transformer: 4 threads - Restored pre-computed embeds format (codec_sum+text from Python) - Text-only trailing embeds don't work: model needs codec_sum for EOS For long phrases, pre-computed embeds must be generated from Python. RTF 1.26 on short phrase. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 13:42:02 +02:00
Kazeia Team	42bbb96fd8	Optimize decoder: BigVGAN 8T, small models 4T → RTF 1.26 BigVGAN benefits from 8 intra-op threads (all perf cores). Pre_conv and pre_transformer kept at 4T (small, less contention). BigVGAN: 2757ms → 1872ms (-885ms), decode total: 2830ms → 2035ms Pipeline: 6438ms → 5834ms → RTF 1.26 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 13:00:05 +02:00
Kazeia Team	a688edc9ec	Reduce talker KV_LEN 100→64: saves 148ms (RTF 1.31) KV window of 64 sufficient for ~70 token generation (10 prefill + 58 gen). 36% less KV memcpy per talker step (28L × 2 × 64×8×128 vs 100×8×128). Generation: 3795ms → 3647ms, total: 6438ms → 6093ms Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 12:47:30 +02:00
Kazeia Team	4dcc4bb8b3	Fix KV buffer + revert HTP decoder (BigVGAN too complex for HTP) - Restored intermediate KV buffer for talker (direct output→input caused trembling from buffer overwrite during execute()) - BigVGAN HTP compilation takes >5min, not viable - RTF 1.35 with clean audio quality Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 12:37:50 +02:00
Kazeia Team	985fd9cff9	Direct output→input KV copy: RTF 1.51 → 1.31 Skip intermediate KV buffer: copy output tensors directly into next step's input pointers. Saves ~1.5GB/run of memcpy for talker (28L × 2 × 100×8×128 floats × 58 steps) and CP similarly. Generation: 4007ms → 3713ms, total: 7180ms → 6078ms Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 12:23:45 +02:00
Kazeia Team	14f7e5b05f	Optimize CP+talker: eliminate prepare_input_tensors per step Cache input tensor pointers after first prepare_input_tensors call, then memcpy directly into them for all subsequent steps. Eliminates ~14000 mallocs per pipeline run (986 CP + 58 talker calls). Generation: 4640ms → 4007ms (-633ms), total RTF: 1.6 → 1.51 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 12:16:38 +02:00
Kazeia Team	e647911329	Shared Module C++ pipeline: RTF 1.6 with perfect quality Key breakthrough: C++ pipeline loop using the SAME Method* instances that Java loaded (via Module::method("forward")). This gives: - Same QNN compiled graph → identical numerical results → no trembling - C++ loop → no Java Tensor/EValue allocation overhead - prepare_input_tensors + memcpy + Method::execute (like cp_et_runner) Pipeline: talker ~20ms/step + CP ~44ms/step + decoder 2.8s = 7.3s for 4.64s Added to executorch JNI: - Module.nativeSetCpModule() — registers CP module for pipeline - Module.nativeRunTtsPipeline(...) — runs full talker+CP loop in C++ - Updated executorch.jar with new native method declarations From RTF 4.9 (start of session) to RTF 1.6 with impeccable audio quality. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 12:05:58 +02:00
Kazeia Team	38c0e9874a	Disable C++ pipeline (QNN non-deterministic), keep Java RTF 1.8 Root cause found: QNN HTP level=1 compilation is not bitwise deterministic. Two loads of the same .pte produce slightly different hidden states → audible trembling in decoded speech. Java pipeline uses single QNN instance → no trembling, validated quality. C++ pipeline code preserved for future use when QNN context caching is fixed (would make both loads use same compiled graph). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 11:42:49 +02:00
Kazeia Team	439629c9bf	Revert "Pre-allocate Tensor/EValue in Java pipeline: 16s → 8.9s (RTF 1.9)" This reverts commit `0f027c5fde`.	2026-04-09 11:03:52 +02:00
Kazeia Team	0f027c5fde	Pre-allocate Tensor/EValue in Java pipeline: 16s → 8.9s (RTF 1.9) Reuse float arrays and Tensor/EValue objects across talker steps instead of creating new ones each iteration. Eliminates ~7s of GC overhead from thousands of JNI object allocations. Same validated audio quality as before, no C++ pipeline needed. Talker 35ms/step, CP 58ms/step, total 8.9s for 4.64s audio. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 10:59:13 +02:00
Kazeia Team	8e536094df	Fix C++ pipeline eos/pad + disable for quality (keep Java default) - Fixed trailing embed handling (use pre-computed as-is) - Added eos/pad embed params to nativeRun - Improved C++ PRNG for sampling - Disabled native pipeline: slight quality regression vs Java (two separate QNN instances give different numerical results) - Java pipeline (RTF 1.8) kept as default for validated quality Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 10:53:19 +02:00
Kazeia Team	3b01302cfb	Fix missing eos/pad embeddings in native C++ pipeline The native pipeline was adding zeros after trailing text tokens instead of tts_eos_embed then tts_pad_embed. This caused the model to mispronounce final words (e.g. "développement" → "devopment"). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 10:35:05 +02:00

1 2

55 Commits All Branches Search

55 Commits

All Branches