Dropped the internal waveform lines — not what we wanted visually —
and replaced them with a spectrum-driven deformation of the sphere
outline itself. Each of the 12 log-spaced bands drives one Fourier
mode of the perimeter (band b → mode b + 2, so modes 0/1 stay
circular and higher bands produce tighter ripples). Low bands pull
the shape into wide asymmetric bumps that feel like formants; high
bands add quick sibilant-like tremors. Phase advances faster for
higher modes so tight ripples visually match high-frequency content.
Overall displacement is gated by the RMS envelope so silence is
quiet and loud syllables distort strongly. Fill + highlight are
clipped to the deformed path so the gradient follows the shape and
it reads as a single living object rather than a circle with stuff
bolted on.
Removed drawSpectrumBars and drawWaveformLine.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
**Regression fix**: when synthesis of segment N+1 ran longer than the
playback of segment N (e.g. 5 s synth for a 1.5 s "Bonjour !"), the
previous MediaPlayer was already in the Completed state by the time
we queued the next one. setNextMediaPlayer() on a Completed player is
a documented silent no-op — so the second sentence never started and
the user only heard the first part of the reply.
Rewrote playChainedMediaPlayers with per-player CompletableDeferred
tracking: before calling setNext we check whether current's done has
fired; after awaiting completion we verify next really auto-started
(checking isPlaying / currentPosition) and call start() explicitly if
the chain missed. Belt-and-suspenders against the race either way.
Removed the now-unused waitForPlaybackCompletion helper.
**Visual change**: in-sphere spectrum bars replaced with three
superimposed Bézier "deforming lines", mirrored above/below a central
baseline, with a soft cosine taper so the curves decay to zero at the
sphere's left/right edges (matches the circular mask). Each line has
its own slow-moving phase + gain + thickness + alpha so the three
overlap to give depth — closer to an oscilloscope trace than an EQ.
Low-level sin jitter keeps the lines alive during quiet passages,
amplitude-gated so true silence is a flat line.
User-facing change: no bars anymore. The sphere now "breathes" with
flowing waveforms matching its voice.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Complete redesign of AudioVisualizerView based on feedback: the orb
is now the app's visual face, takes the top ~60% of the chat area,
and has clearly distinct behaviour in each state.
- **Idle**: slow 5 s breathing (scale 0.88 → 1.00 via cos easing),
pure round shape, soft halo in phase. No high-frequency motion.
- **Listening**: organic blob outline built from 8 Fourier modes
whose amplitude scales with live mic RMS; a thin shimmering arc
rotates around the orb while mic energy is present; continuous
micro-ripples pulse outward. Looks clearly 'alive and attentive'
vs Idle's static breathing.
- **Speaking**: the orb becomes a contained spectrometer. A pre-
computed log-spaced spectrogram (12 bands, 120 Hz–4 kHz,
Hann-windowed FFT, one column per 50 ms of audio) is rendered as
vertical rounded-rectangle bars CLIPPED to the sphere outline so
they really look like the sphere itself speaking. Bar heights
interpolate between spectrogram frames and exponentially smooth
toward the target for fluid 60 fps motion. Outer halo pulses with
the RMS envelope; ripples release on envelope peaks.
- **Per-voice color**. Eight-entry palette (Damien lavender,
Elodie rose, Jerome aqua, Richard amber, Amir emerald, Didier
indigo, Sid peach, Zelda periwinkle). Halo, accent, bars, ring,
and ripples are all derived from a single voiceColor so switching
the voice spinner tweens the entire scene to the new identity
over a few frames. Color stored on both KazeiaService (for
persistence across process/view rebinds) and pushed directly to
the view for instant feedback at selection time.
Sidecar pipeline changes:
- Qwen3TtsEngine now computes per-segment spectrogram alongside the
RMS envelope (new computeSpectrogram + an in-place radix-2 FFT).
FFT_SIZE = 1024, hop = 50 ms, 12 log-spaced bands. SegmentReady
carries both arrays; onSegmentPlaying is (sentence, durationMs,
rmsEnvelope, spectrogram).
- KazeiaPipeline.speakText forwards the new callback shape.
- KazeiaService.VisualizerSignal.Speaking now carries the
spectrogram and the new voiceColor StateFlow.
- ChatActivity passes both to the view and collects voiceColor.
Layout: vertical chain between audioViz (weight 3) and rvMessages
(weight 2) so the orb owns ~60% of the chat panel and the chat list
takes the remainder. Removed the fixed 140 dp constraint.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds a breathing lavender orb centred above the chat list that tracks
the actual audio state of the app:
- **Idle**: slow respiratory pulsation (~4 s cycle) at 20 fps. The
chatbot is visually "awake" without animating loudly.
- **Listening**: halo swells with live mic RMS from the VAD loop, so
the user sees Kazeia hearing them even before Whisper has produced
any transcription. Mic RMS is normalised with the same sqrt
squashing the TTS envelope uses so quiet speech still reads visibly.
- **Speaking**: amplitude + halo driven by a pre-computed RMS envelope
(50 ms windows, sqrt-normalised) produced at synthesis time. Ripples
fire on local peaks above 0.35 — matches speech rhythm without
overwhelming. Timer is internal to the view, synced to the segment's
durationMs; no MediaPlayer position polling.
Architecture:
- Sidecar RMS envelope. Computed in Qwen3TtsEngine.generateSegmentAudioVC
right after PCM is available, packed into SegmentReady, and handed to
onSegmentPlaying(sentence, durationMs, rmsEnvelope) when each MediaPlayer
starts. Zero extra IO — runs on the same PCM we already write to WAV.
- KazeiaService exposes VisualizerSignal (Idle | Listening(rms) |
Speaking(env, dur)) as a StateFlow. The VAD loop pushes Listening,
processLlmResponse pushes Speaking from the per-segment TTS callback,
and finally clears to Idle when no mic is open.
- AudioVisualizerView renders via Choreographer.FrameCallback, self-
throttled to 20 fps at Idle and full refresh during Listening/
Speaking. Hardware layer. Pure Kotlin + Canvas, no deps. ~280 LOC.
Layout: 140 dp strip between voiceBar and rvMessages in activity_chat.xml.
No 3D engine, no Unity, no splash extension. The avatar design work
remains on disk for a later phase when the TTS+streaming pipeline
stabilises enough to spend time on DECA/FLAME integration.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Matches the 'conversation' feel the user asked for. Previously the
full LLM response appeared in the chat as soon as generation
finished, then audio played 5–10 s later — text and sound felt
decoupled. Now:
- The KAZEIA bubble is created empty and only starts filling when
the first TTS segment actually starts playing through the speaker
(we already split the response by sentence for the chained-
MediaPlayer pipeline; that split drives the reveal too).
- Inside each sentence, words are appended one by one at a cadence
of (audio duration / word count) — slower sentences reveal slower,
matching speech pacing. The first word of each sentence appears
immediately so audio and text stay aligned at the start.
Implementation:
- Qwen3TtsEngine: added `onSegmentPlaying(sentence, durationMs)`
listener, invoked from the chained-MediaPlayer worker the moment
each segment's MediaPlayer.start() lands. Sentence + duration are
carried end-to-end via a new SegmentReady data class.
- KazeiaPipeline.speakText: forwards an optional listener down to
the TTS engine, same signature.
- KazeiaService: new updateMessageText(id, text) helper. In
processLlmResponse, the bubble is added empty before speakText and
grown by a reveal coroutine per sentence; after speakText returns
we snap to the full text as a safety net.
No change to the stream_llm debug intent path — it still uses the
old enqueueSentence flow directly and doesn't need the reveal (no
UI bubble there).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The kill-background approach worked at startup (−1.6 GB) but respawns
pulled most of that back within 1–3 min, and the periodic sweep only
kept the first sweep's reclaim stable rather than going lower. Net
practical benefit over "just let Android manage it" is small, not
worth a custom optimizer + normal-perm + file maintenance.
Removes:
- MemoryOptimizer.kt
- KazeiaService.onCreate calls to freeRamForModels / startPeriodicOptimizer
- KILL_BACKGROUND_PROCESSES permission from the manifest
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
First sweep reclaimed 1.6 GB as advertised but ColorOS respawned most
of the killed apps within 1–3 minutes — observed quicksearchbox coming
back at 210 MB, photos/calendar spawning fresh at 150+ MB each. Two
changes:
1. Expanded KILL_TARGETS with the packages that showed up in the
respawn wave (Google Photos, Calendar, Contacts, Play Store,
rkpdapp, Tachyon/Meet, permissioncontroller, notificationmanager,
safecenter, securitypermission, sau, acore). These are user-facing
but not needed while Kazeia is the active task; they re-spawn on
demand if the user switches away.
2. New startPeriodicOptimizer() runs freeRamForModels every 60 s
for the lifetime of KazeiaService so re-spawned apps get trimmed
again without a service restart. Tied to serviceScope so it stops
cleanly on destroy.
Net effect observed: avail RAM stays ~1.2–1.5 GB higher than without
the sweep. Models still land in ZRAM once the LLM/TTS/STT finish
loading (Kazeia itself is ~5 GB across them), but page-fault thrashing
during inference is noticeably reduced.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three unrelated fixes rolled into one so testing on the tablet stayed
coherent. All were driven by what the user was observing during live
audio tests, not by pre-planned refactors.
1. **Audio playback actually audible.** ColorOS's AudioFlinger
silently muted our AudioTrack ~600 ms after play() every time
(dumpsys audio showed `event:muted updated source:clientVolume`
and playbackHeadPosition stuck at 0), regardless of USAGE_MEDIA /
USAGE_ASSISTANT / USAGE_VOICE_COMMUNICATION, regardless of audio
focus grant, regardless of FGS type including mediaPlayback. A
MediaPlayer path using the SAME usage attributes works because it
routes through a different AudioFlinger thread that isn't under
the same background-hardening policy. `USE_MEDIAPLAYER_FALLBACK`
in Qwen3TtsEngine.kt flips playback to a WAV-per-segment pipeline.
Two MediaPlayer instances are chained via `setNextMediaPlayer()`
so segments transition without re-arming the DAC (that re-arm was
audible as "beg beg" pops between sentences). Synth of seg N+1
runs in parallel with playback of seg N via a capacity-2 Channel,
hiding synthesis latency behind playback for all but the first seg.
2. **Mic no longer loops TTS back into STT.** The continuous-
listening VAD in KazeiaService already had a guard to drop frames
while `pipelineState is Speaking`, but that state was never set by
any caller — so the mic kept recording during playback and fed our
own speaker output back to Whisper, creating the infinite
"Kazeia talks to Kazeia" loop the user observed. Both the
stream_llm intent path and the main `processLlmResponse` TTS path
now wrap the TTS call with `Speaking → Idle/Listening`.
3. **Free 1.6 GB of RAM at service start.** The OnePlus Pad 3 with
ColorOS keeps ~7 GB of Google + OPLUS background services
resident at idle. With Qwen3-4B (3.2 GB) + Qwen3-TTS (1 GB) +
Whisper (0.5 GB) on top, most of our model weights were going to
ZRAM swap — "the NPU is stuck" reports were actually page faults
paging 3 GB of LLM weights back in before each inference. New
`MemoryOptimizer` kills 30-ish non-essential background packages
(Google optional: YouTube, Wallet, Chromecast, Messaging, AICore,
Quicksearchbox; OPLUS optional: smartsidebar, cosa, pantanal,
nhs, midas, …) via `ActivityManager.killBackgroundProcesses`.
Measured reclaim on first run: **avail RAM 8468 MB → 10112 MB,
+1644 MB**. Uses KILL_BACKGROUND_PROCESSES (normal perm, no user
prompt); system-critical packages and the launcher/systemui are
explicitly excluded from the target list.
Collateral changes:
- Added FOREGROUND_SERVICE_MEDIA_PLAYBACK permission + fgsType flag
(didn't fix the mute on its own, but it's correct per Android 14
policy and leaving it without would be a latent compliance risk).
- Kept `USE_STREAMING_DECODE` + CP↔BigVGAN overlap code intact
behind the MediaPlayer-fallback branch so reverting to the
AudioTrack streaming path is a single-const flip if ColorOS ever
lifts the hardening (or we move to a device without it).
- New AudioTrack path has a keep-alive silence watchdog and a
playback-head drain wait on stop. Both were attempts to fix the
mute that didn't pan out on their own; leaving them in so the
streaming path stays usable on non-hardened devices.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ResourceMonitor.init ran `su -c id` at every ChatActivity launch to see
if root was available, then used root to read /sys/class/kgsl/... and
/sys/bus/platform/devices/soc:qcom,msm-cdsp-rm/... for GPU/NPU usage %.
That probe was the only thing still triggering the Magisk auth dialog
on each app start after the no-root LLM migration.
Remove the root probe and the execRoot helper. GPU/NPU reads now return
-1 (UI already renders "—" for negative values). The non-root
/sys/class/kgsl/kgsl-3d0/gpubusy path is kept as a best-effort — it's
world-readable on some devices, silently fails otherwise. CPU and RAM
readouts are unaffected (never needed root).
Dead-code `su -c ...` calls remain in Qwen3TtsEngine (hexStartRunner,
hexStartCpRunner, hexStopRunner, etc.) and WhisperNpuSttEngine, but all
are gated behind fallback paths that don't execute under the current
PTE-only config (talkerPteModule != null && cpPteModule != null short-
circuits before any su call). Left in place to avoid churning the TTS
Hexagon fallback; can be purged in a later cleanup pass if needed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Streaming variant of the per-segment decode pipeline. As soon as SEQ_LEN
codes are accumulated from the talker/CP loop, BigVGAN is dispatched on
a background coroutine while the producer keeps generating the rest of
the segment. The BigVGAN consumer feeds a streaming crossfader that
emits stable audio as it arrives and holds back overlapSamples for the
next chunk's blend.
Mirrors decodeChunked's semantics exactly so final audio is bit-identical
modulo the fadeOut application location (now applied to the final
emission tail instead of the full buffer; the last 40ms still get faded).
Validated A/B on the same prompt 3 used in the recent benchmark:
prompt: "Je me sens un peu triste aujourdhui…"
seg 0 first audio: 14 485 ms → 10 936 ms (−3.5 s)
end-to-end first audio (LLM trigger → audio): 16.2 s → 12.7 s
Stream LLM total: 33 234 ms → 28 594 ms (−4.6 s)
Short segments (<SEQ_LEN codes) and the legacy non-streaming callers
(generateSegmentAudioVC, decodeChunked, multi-segment pipelines, etc.)
are untouched. The new path is gated behind USE_STREAMING_DECODE so it
can be reverted by flipping a single const if a regression is found.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Re-exported Qwen3-4B in hybrid mode (prefill_forward + kv_forward) with
num_sharding=1 after discovering that sharding=2 produces a multi-context
.pte that the LlmModule loader cannot restore (error 5010 "Context group 1
does not exist"). Single-context hybrid .pte loads cleanly through the JNI
runner and the auto-detected eval_mode=1 path.
The peak RAM during export hit 49 GB, which is why sharding=2 was used
originally — the /swapfile (192 GB) now absorbs it. Compile wall time with
sharding=1 + hybrid is ~73 min (two graphs) vs ~30 min for sharding=2 +
kv-only (one graph).
End-to-end on tablet, same 'Bonjour, comment vas-tu ?' prompt:
Before (kv-only, short prompt): TTFT 2865 ms, total 4034 ms
After (hybrid, short prompt): TTFT 113 ms, total 1471 ms
Gain: -2752 ms TTFT (96% reduction, 25× faster)
Response: "Bonjour ! Je vais bien, merci de me demander. Comment vas-tu ?"
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
#2 BigVGAN GPU experiment: ORT-QNN GPU EP loaded the v2_decoder_conv ONNX
model successfully (session creation 463 ms, no fallback warnings) but
per-phrase inference jumped to ~3.5 s vs ~2 s on CPU 8-thread. The GPU/CPU
memory transfer cost dominates for this conv-heavy decoder, and the
optimization went the wrong way. Comment block updated to record both the
HTP and GPU paths as tried-and-rejected so future passes don't re-walk the
same ground.
LLM streaming filter: extend the lookahead-based <think>…</think>
suppressor to also strip singleton special tokens (<|im_start|>,
<|im_end|>, <|endoftext|>). Previously the closing <|im_end|> at end of
the assistant's turn leaked into the SentenceStreamer and ended up as a
spurious sentence at the end of the TTS output. Same lookahead-buffer
trick handles split tokens.
Validated end-to-end: 'Bonjour, comment vas-tu ?' → "Bonjour ! Je vais
bien, merci. Comment vas-tu ?" → seg 0 "Bonjour !", seg 1 "Je vais bien,
merci." (no <|im_end|>), BigVGAN back to 1.8 s/phrase.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the hardcoded eval_mode=0 in the QNN_LLAMA branch with a runtime
check on the loaded module's method names: if the .pte exposes a
prefill_forward graph, switch to EvalMode::kHybrid (1) — the runner can
then batch the entire prompt through prefill_forward in one parallel pass
instead of running 52 ms/token sequentially through kv_forward. Falls
back to kKVCached (0) when only kv_forward exists, matching the current
.pte behaviour exactly so this is a safe in-place upgrade ahead of the
hybrid re-export.
Sanity-tested with the kv-only Qwen3-4B .pte already on the tablet:
Prompt 'Bonjour, ça va ?' → "Bonjour ! Ça va, merci de me demander ça.
Tu as une question ?", TTFT 2728 ms, total 4158 ms — no change vs the
hardcoded eval_mode=0 build.
Once the hybrid Qwen3-4B export finishes (~50 min compile, both
prefill_forward + kv_forward graphs), the same JNI binary will pick up
the new .pte and TTFT should drop to <1 s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The verbose 55-token system prompt was the cheapest TTFT win on the
kv-only path (52 ms per prefill token). Compacting it to 25 tokens while
keeping the three load-bearing constraints — Kazeia identity, French only,
short replies, /no_think — measurably improved end-to-end latency.
Validated 'Bonjour, comment vas-tu ?' on tablet:
Before: prompt_tokens=80, TTFT=4202ms, total=5716ms
After: prompt_tokens=53, TTFT=2865ms, total=4034ms (-1.3s, -32% TTFT)
Reply quality preserved: "Bonjour ! Je vais bien, merci. Comment vas-tu ?"
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Concrete measurements taken 2026-04-14 on the same Qwen3-4B .pte and the
same C++ runner — only the invocation path differs (subprocess su -c vs
in-process LlmModule JNI). Confirms no LLM regression and a measurable
speedup on the TTS path thanks to the shared QNN context (Talker 37 ms/step
vs 45-65 ms/step before).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The root cause was process-credential loss across fork+exec, not the QNN
SDK version mismatch I had hypothesized. Switching the LLM to in-process
ExecuTorch LlmModule (Zygote-forked context, accepted by adsprpcd's
FastRPC credential check) eliminated the su requirement.
The original investigation sections are kept verbatim for reference; the
new section 10 documents the actual fix, the patches applied to ExecuTorch,
the metrics validated end-to-end, and pointers to the project memory entry.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Even with /no_think in the system prompt Qwen3 still emits an empty
<think>…</think> wrapper before the real answer. Without filtering, the
SentenceStreamer treats '<think>' as a sentence boundary and feeds three
tokens of XML into the TTS, producing audible parasites at the start of
each reply.
The new in-callback filter buffers a small lookahead (just enough to span
"</think>"), suppresses everything between the open and close tags, and
flushes the surrounding prose to onToken in order. With the lookahead, tags
that arrive split across decoded pieces ("<thi"+"nk>") still match.
Validated end-to-end: prompt 'Bonjour, comment vas-tu ?' now streams
sentence-by-sentence to the TTS — first segment "Bonjour !" reaches the
talker at 4.6 s, no <think> sneak-through.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
End-to-end validation on OnePlus Pad 3 with stream_llm intent:
Prompt: 'Bonjour, comment vas-tu ?'
Response: 'Bonjour ! Je suis là pour t'écouter. Comment vas-tu aujourd'hui ?'
TTS: Talker(PTE) 37ms/step, CP(PTE) 73ms/step, audio synthesized.
No su, no Magisk prompts.
Two fixes since the previous commit:
1. ExecuTorchLlmEngine: pass echo=false to LlmModule.generate() — by default
the runner echoes the prompt tokens back via the callback, which fed the
ChatML wrap (<|im_start|>user …) into the SentenceStreamer and TTS.
2. jni_layer_llama.cpp: pick Runner<uint8_t> vs Runner<uint16_t> based on the
model's get_kv_io_bit_width metadata, mirroring qnn_llama_runner.cpp main().
The hard-coded uint16_t was wrong for our Qwen3-4B export (which uses 8-bit
KV I/O) and produced fluent-looking but completely random tokens
("blocked罩ug darkestSOLEQuotes作者本人 …") — same symptom whether greedy or
sampled, the smoking gun for a width-mismatched KV cache reinterpretation.
Other tweaks:
- temperature=0.0 in the QNN_LLAMA branch of jni_layer_llama.cpp (greedy,
matches the working qnn_llama_runner --temperature 0 invocation)
- shared_buffer=true (same as binary defaults)
- Kotlin chat template mirrors qnn_llama_runner.cpp's get_formatted_prompt for
Qwen3 (user-first, then optional system, then "<|im_start|>assistant" with
no trailing newline — that quirky ordering is what the .pte was trained on)
TFTT is ~4 s for a 77-token prompt on kv-only mode (sequential prefill, one
forward per token). To get a sub-second TTFT we'd need to re-export the model
in --model_mode hybrid which adds a parallel prefill_forward graph; not
required for the conversational use case.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The root cause of the previous su-c requirement was that Qualcomm's FastRPC
kernel driver rejects processes spawned via ProcessBuilder fork+exec because
they lose supplementary GIDs on exec. Zygote-forked app processes retain the
proper init-configured credentials and are accepted by the adsprpcd service,
which is why ORT-QNN (Whisper, in-process) worked while the subprocess
qnn_llama_runner did not. Running the LLM in-process via ExecuTorch's
LlmModule bypasses the fork+exec path entirely.
What this commit does:
- ExecuTorchLlmEngine now uses org.pytorch.executorch.extension.llm.LlmModule
with MODEL_TYPE_QNN_LLAMA=4 (routes to example::Runner in jni_layer_llama.cpp,
the same C++ runner that qnn_llama_runner embeds).
- All su, ProcessBuilder, file-based prompt/response plumbing, and run_llm.sh
gone. ChatML template is built in Kotlin; tokens stream in via LlmCallback.
Supporting changes under executorch-patches/llm_in_process_jni.patch:
1. backends/qualcomm/CMakeLists.txt — gate PyQnnManagerAdaptor on NOT ANDROID.
The original guard (CMAKE_SYSTEM_PROCESSOR MATCHES x86_64) misfires in a
nested scope during Android cross-compile and tried to build the host
Python bindings.
2. extension/android/jni/jni_layer_llama.cpp — hardcode decoder_model="qwen3"
(was "llama3") and pass eval_mode=0 (EvalMode::kKVCached) + shared_buffer=true
to match our hybrid_llama_qnn.pte which only contains kv_forward, not
prefill_forward.
Build: scripts/build_android_library.sh arm64-v8a with QNN_SDK_ROOT pointing
to /opt/Kazeia/qnn_sdk_242/qairt/2.42.0.251225 and EXECUTORCH_BUILD_QNN=ON.
Produces libexecutorch_jni.so (192 MB) with QNN v2.42 backend + the llama
runner code, plus libqnn_executorch_backend.so. Both staged in jniLibs.
Validated on OnePlus Pad 3: LlmModule.load() completes in 4.2 s, no su
prompts, Pipeline ready with STT(WhisperHybridEngine) → [VoiceCommands →
LLM] → TTS(Qwen3TtsEngine). TTS .pte still loads with the upgraded v2.42
runtime — no regression.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Commit de sauvegarde avant la tentative d'unification QNN SDK v2.37 et
suppression du su -c pour le LLM. État actuel fonctionnel :
- LLM Qwen3-4B via su -c qnn_llama_runner (v2.42 dans /data/local/tmp/kazeia-et/)
- TTS talker + CP via ExecuTorch .pte JNI (v2.31 dans jniLibs)
- STT Whisper via ORT-QNN 1.24.3
Le rapport kazeia-no-root-report.md documente en détail les tentatives de
no-root et leurs échecs.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- ExecuTorchLlmEngine: system prompt forces French, 1-2 short sentences,
/no_think so the full budget goes to the answer (Qwen3 was consuming
120+ tokens on <think>); eval_mode 0 matches our kv-mode export.
- Qwen3TtsEngine.generateSegmentAudioVC: when the Hexagon talker socket
isn't open, fall back to runInterleavedPteFromEmbeds so the Stage 3
streaming session still produces audio. Without this the session opened,
accepted sentences, and silently emitted empty PCM.
Documents the QNN SDK version-skew pitfall in ExecuTorchLlmEngine.kt
ahead of the upcoming migration to a unified v2.42 toolchain.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- ExecuTorchLlmEngine: eval_mode 0 (our .pte is kv-mode, not hybrid)
- KazeiaService: call llm.load() after TTS init; try/catch falls back
to echo mode if the runner or .pte are missing.
Pipeline on device: STT(WhisperHybridEngine) → [VoiceCommands → LLM] → TTS(Qwen3TtsEngine).
Validated on OnePlus Pad 3: LLM ready in ~8 s, gen 21.3 tok/s, RSS 1.76 GB in the
qnn_llama_runner subprocess (out-of-process from the Kazeia app).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds executorch-patches/ with the local modifications to /opt/Kazeia/executorch
(upstream pytorch/executorch v1.2.0) required to export Qwen3-4B to QNN for the
OnePlus Pad 3 Hexagon V79. Tablet runs 18.2 tok/s (gen), TTFT 0.9 s, RSS 1.76 GB.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces the fixed maxGen + length-based boost with a fully dynamic
end-of-utterance detector that watches the model's own EOS logit rank.
End result on the Baer 3-segment monologue, validated by user as
"FORMIDABLE" / "impeccable" with both Damien and Zelda voices:
- All 3 segments terminate via EOS (no maxGen cap hit)
- No "page beg beg" filler tail
- No abrupt cuts between segments
- Audio durations 5-8 s per segment, matching Python within ~10 %
How it works (runHexGenWithPrefill, in tts/Qwen3TtsEngine.kt):
1. At every decode step, compute the rank of CODEC_EOS in the
repetition-penalised logits. Mid-utterance the rank sits at
150-700 (model is committed to producing speech). Approaching
the natural end, the rank dips toward top-50.
2. Arm the boost only when EOS rank stays below eosRankTrigger=60
for THREE consecutive steps. The 3-step requirement filters out
transient single-step dips that occur during low-energy phonemes
mid-sentence (without it, short sentences would terminate after
~3 s). Arming is also gated by eosBoostMinStep (50 % of expected
speech length) so we never arm in the very first frames.
3. Once armed, the boost increments monotonically: each subsequent
step adds boostStepsActive * eosBoostScale to the EOS logit. The
accumulated boost lifts EOS above top-1 within 1-3 steps, the
argmax check fires, and the loop breaks. Scale=4 gives the model
a small natural decay before termination; scale=5 was perfect-but-
slightly-clipping, scale=3 wasn't strong enough to outpace the
growing top-1 logit.
Other tweaks bundled in this commit because they all contribute to
the clean output:
* Inter-segment gap 120 → 250 ms — gives the listener a perceived
sentence boundary instead of a hard concatenation.
* fadeOut(audio, 40) on every segment — cosine roll-off over the
last 40 ms so the EOS-clipped tail decays naturally instead of
sample-clipping.
* top_k 50 → 200 in the fallback sample call — wider pool to keep
EOS reachable when the boost just fails to hit argmax.
Voice swap is a 45 KB file push (damien_voice_prefix.bin and
damien_voice_suffix.bin). Successfully tested today with Elodie
(female, norm 10.12) and Zelda (norm 9.39) using Damien (norm 10.36)
as the baseline — same Kotlin code, no rebuild needed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two small changes:
* export_tts_text_embeddings.py now takes the voice wav as an optional
second CLI arg (defaults to damien_15s_24k.wav). Lets the same script
capture voice-prefix+suffix for any speaker wav without editing the
source — used today to test Elodie alongside Damien.
* synthesizeTextStreaming + generateSegmentAudioVC only run the
trimTailLowEnergy trim when n >= maxGen. The trim's 35%-of-peak
threshold is tuned to catch "page beg beg" filler after the talker
fails to emit EOS — but it was cutting valid speech when EOS fired
early (observed on Elodie seg 1: 10.08 s → 2.92 s, a 4-second over-
trim). With the guard it's a no-op on converging generations and
only fires on the ~15% of segments that hit maxGen.
Validation after the fix (Elodie, Baer monologue):
- seg 1: 126 tokens = maxGen → trimmed 10.08 s → 8.88 s (1.2 s cut,
the filler tail)
- seg 2: 105 tokens < 138 maxGen → no trim, 8.4 s kept as-is
- seg 3: 69 tokens < 96 maxGen → no trim, 5.6 s kept as-is
Voice prefix/suffix shape is speaker-invariant except position 7 (the
xvector). Confirmed by capturing both Damien and Elodie and diffing:
positions 0-6 and 8 identical within 1e-8, suffix identical within
1e-8, only pos 7 has a different xvector embedding (norm 10.36 vs 10.12).
That means swapping speakers on-device is a 45 KB file push — no app
rebuild, no re-export of the 297 MB vocabulary table.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces the four per-sentence TTS entry points (pipeline.speak, REPEAT
voice command, echo-mode TTS, LLM-response TTS) with a single shared
pipeline.speakText() that:
* opens a Qwen3TtsEngine streaming session when the TTS backend is
Qwen3 (voice-cloning path);
* feeds the whole response through a SentenceStreamer so the first
sentence starts playing as soon as it's decoded;
* falls back to the old one-shot synthesizeAndPlay for non-Qwen3 TTS
engines (AndroidTts, Chatterbox) that don't expose a session API.
KazeiaPipeline.speakText is now public so KazeiaService can use the
same dispatch — previously each call site re-implemented the
"streaming-or-fallback" logic or just called synthesizeAndPlay and
waited for the full synthesis.
Enabling the real on-device LLM is a separate task (task #48): the
existing llama-cli binary has ggml-hexagon linked in and fails to
init the DSP (0x80000406) when the TTS Hexagon runners hold the
session. Needs either a CPU-only llama-cli build or the restored
ExecuTorch qnn_llama_runner setup.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Closes the loop on on-device conversational TTS. The LLM's token stream is
now consumed by a SentenceStreamer which fires a callback the moment a
terminal-punctuation boundary appears; each sentence is enqueued to a
persistent TTS streaming session that generates and plays audio through a
single shared AudioTrack. Sentence N's audio plays while sentence N+1 is
being generated on Hexagon+CP — no per-sentence AudioTrack init gap, and
no "wait for full response before hearing anything".
Mocked-LLM validation on the 3-sentence prompt:
"Bonjour. Je suis là pour vous écouter. Comment allez-vous aujourd'hui."
- First sentence detected: 1 ms
- Seg 0 prefill (Hex): 567 ms
- Seg 0 generated: 4 200 ms (18 tokens, 1.4 s audio)
- Seg 1 generated: 9 100 ms (42 tokens)
- Seg 2 generated: 11 000 ms (46 tokens)
- Session closed: 33 500 ms (all audio drained)
Changes:
* tts/SentenceStreamer.kt — 50-line helper that buffers tokens and
fires onSentence when a "." "!" "?" ";" or "\n" appears. minChars = 4
so "Oui." / "Bonjour." count as real sentences; higher thresholds
swallowed conversational openers into the next segment and delayed
first audio. flush() for the final partial sentence.
* Qwen3TtsEngine.startStreamingSession / enqueueSentence / endStreamingSession
triplet. startStreamingSession opens a 30-second MODE_STREAM
AudioTrack plus a background worker coroutine that pulls sentences
from an unlimited Channel. enqueueSentence is non-blocking; the worker
serialises generation so audio order matches enqueue order.
generateSegmentAudioVC is the per-sentence body (tokenize → prefill
build → Hexagon gen → decode) without the WAV-save side effects that
the /stream_text intent path does.
* KazeiaService new intents:
- stream_llm : real LLM path (needs LLM loaded; currently the
debug build runs echo-mode so this path is
shipped but requires production config to
exercise).
- stream_llm_mock : fakes the LLM stream by splitting the given
text on spaces with 50 ms per "token" —
matches the ~20 tok/s rate the on-device LLM
produces and lets Stage 3 be validated without
flipping the LLM on.
Architectural notes:
- AudioTrack buffer is 30 s so generation can run ahead of playback
without blocking writes. RTF on Snapdragon 8 Elite is ~3 for short
sentences, so for a 2-3 sentence response the buffer actually drains
between segments and the user hears a short gap — expected, not a
bug. Masking that gap requires RTF < 1 which is out of scope.
- Hexagon KV is reset between sentences (hexReset) so the talker
doesn't see stale context. Prefill observed cb0 = 1995 on every
sentence that starts with a capital letter, matching the Python
greedy reference — confirms prefill reconstruction is stable across
segments within a session.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Removes the PC-side prepare_tts_segments.py dependency for day-to-day
generation. The tablet now tokenizes, embeds, and voice-clones any
French (or Qwen3-supported) text with no network, no ADB push per
phrase, and quality that matches Python's reference on "Bonjour, je
suis Kazeia, je suis là pour vous écouter." — user validation:
"impeccable".
Three pieces that compose the path:
1. Qwen3BpeTokenizer.kt — byte-level BPE matching Qwen2/Qwen3's
Python implementation bit-for-bit. UTF-8 + GPT-2 byte encoder,
Qwen regex with \p{IsAlphabetic}/\p{IsDigit} (Android's regex
lacks UNICODE_CHARACTER_CLASS — caught in testing). Produces
identical token IDs to HF's Qwen2TokenizerFast on the test phrase:
[81581, 11, 4759, 35631, 730, 9832, 685, 11, 4759, 35631, 37915,
4914, 9012, 90229, 2676, 13].
2. export_tts_text_embeddings.py — one-time PC export of:
* Full projected text embeddings for the entire 151936-token vocab
as fp16 (297 MB). Sanity check: live vs stored max abs diff
1.15e-4 on token 1043. Mmap'd on-device so it stays off the
Java heap and leaves room for the 125 MB cp_embeddings alloc.
* Damien voice PREFIX (9 × 1024 fp32) — positions 0..8 of a
Python voice-clone capture, text-invariant across segments.
* Damien voice SUFFIX (2 × 1024 fp32) — positions nP-2..nP-1
of the same capture. Also text-invariant (diff = 0.0 across
3 different-text segments). Without it the talker never sees
"text ended" and decode falls into page/beg repetition.
* Qwen3 tokenizer vocab.json + merges.txt.
3. Qwen3TtsEngine.kt:
* mmap loader for the embeddings table + buffered fp16→fp32
lookup (halfToFloat covers subnormals/inf/NaN so pathological
tokens don't become 0).
* Stage 2 assets detected at init; missing file transparently
falls back to legacy 1050-token reduced-vocab path.
* synthesizeTextStreaming(text, onSegmentReady) — new public API:
sentence-split → BPE → build prefill as
[voice prefix] + [text_proj(id) + codec_pad] × N + [voice suffix]
(exact structure Python emits; verified bit-for-bit by matching
captured Baer prefill positions against text_projection(tok)+
codec_embedding(CODEC_PAD)) → runHexGenWithPrefill → decode
each segment through the existing BigVGAN pipeline → callback.
* runHexGenWithPrefill — Hexagon prefill + interleaved CP decode
loop. Feeds tts_eos once, tts_pad thereafter (same schedule as
Python's voice_clone). Degeneracy guard stops when 9 identical
cb0 in a row appear — catches the rare "page beg beg beg" tail
when EOS never fires. maxGen = ids.size*4 + 10 matches the
typical 3.3 codec-frames-per-text-token that Python produces.
* Prefill build uses the speaker's captured prefix/suffix rather
than the legacy in-code buildPrefillEmbeddings that puts only
one text token in prefill — the structure mismatch produced
garbled audio in the first attempt of this commit.
4. KazeiaService.kt: new stream_text intent extra wires text input
to synthesizeTextStreaming with an AudioTrack MODE_STREAM consumer.
First-audio latency on the "Bonjour..." test: ~23 s on Snapdragon
8 Elite (prefill + 74-token decode), vs a 3-phrase sentence batch
that was 65 s pre-streaming — streaming + on-device text together
unblock the MVP chat loop.
Known caveats:
* 297 MB on-device footprint for the embedding table. Acceptable on
OnePlus Pad 3; can be quantized further (int8 per-row) if storage
becomes tight.
* First init adds ~3 s for BPE vocab + merges load (151k × 2 hash-
maps). Happens once per process.
* maxGen cap means extremely long sentences may truncate. The
sentence splitter already keeps segments ≤120 chars so this
hasn't been observed in practice.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds a streaming multi-segment pipeline on top of the Hexagon talker + ONNX
CP backend. First audio arrives at ~20s (vs ~65s for the full phrase
non-streamed) on the Baer 16.56s reference (3-segment split). Voice cloning
is preserved per segment because each segment now ships its own full prefill.
Changes:
* Qwen3TtsEngine.generateFromEmbedsHexagonStreaming(path, onSegmentReady)
reads single- or multi-segment embeds, runs prefill + generation + VQ
decode + BigVGAN per segment, and fires the callback with each
segment's ShortArray the moment it's ready. Saves per-segment WAVs
(kazeia_stream_seg{N}.wav) plus the concatenated kazeia_stream_full.wav
for offline inspection. Extracted the common generation loop into
runHexSegmentFromEmbeds(prefill, trailing, idx) so single-segment and
streaming paths share exactly the same code (no quality drift between
modes). Added hexReset() between segments so segment 2's prefill logits
don't contain segment 1's KV state.
* vqDecode buffer overrun fix: when the talker samples CODEC_EOS as cb0
it stores a vocab id > CODEBOOK_SIZE, which vqDecode then used as a
codebook row index — reading past the 2048-row buffer. The short Baer
probe never hit this; longer phrases do. Clamp any out-of-vocab code
to 0 at allCodebooks build time.
* KazeiaService: new stream_pipeline intent extra wires the callback
to an AudioTrack MODE_STREAM instance, writing each segment's audio as
soon as it comes back. Logs time-to-first-audio.
* prepare_tts_segments.py: the previous version only captured 1-token
decode calls and substituted a generic 9-embed "prefill_base" pulled
from an unrelated single-segment file — dropping the per-segment
xvector conditioning AND the text-encoded embeddings, so Hexagon
produced garbled mixed speech for segments 2..N. Now captures the
multi-token prefill call too (like prepare_tts_voiceclone.py) so each
segment is self-contained.
Limitation (documented, not fixed in this commit): RTF ~4.4 > 1 on the
Snapdragon 8 Elite with current config means each segment takes longer to
generate than it takes to play, so audible gaps between segments remain.
Removing the gaps requires either (a) producer/consumer parallelism across
two coroutines (doesn't help if RTF stays > 1), or (b) faster CP (the
~180ms/step ONNX MLAS CP is the bottleneck; Hexagon HMX has a known NaN bug
and the .pte path contends with Hexagon talker on the DSP).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extensive investigation of the audible "tremor" in the generated voice-cloned
audio. Conclusion is architectural, not a bug:
* Hexagon HMX fp16 talker logits correlate with PyTorch fp32 at 0.999998
* ONNX Runtime CP V2 is bit-identical to PyTorch greedy CP (0.24% residual
divergence measured by injecting Python's captured cb0 at each step —
14/16 codebooks match 100%, cb14/cb15 miss 1 token out of 53)
* BigVGAN decoder is bit-identical to PyTorch (validated earlier)
* Therefore the tremor is caused entirely by the ~28% of cb0 argmax flips
where the tiny fp16 logits drift crosses the top-1/top-2 margin. This
cascades through the autoregressive chain into a trajectory the model
never saw at training time → incoherent artifacts.
Cross-architecture test (x86 AVX-512 / ARM64 NEON+HMX) cannot be zeroed by
any runtime swap — LibTorch Android would use NEON kernels with a different
reduction order than PyTorch x86, same class of error, smaller but non-zero
residual. Temperature tweaking (0.3 → 0.9) and greedy-vs-sample gave no
perceptual difference: the floor is numeric, not in the sampling layer.
Accepted for MVP. Documented in project_tts_cross_arch_limit.md — this is a
thesis-relevant finding about on-device TTS deployment limits.
Cleanup:
* All diagnostic flags (force_inject_pycb0, force_greedy_cb0, cb0_temp,
force_python_codes, force_cpu_talker, force_cpu_talker_gguf) now gated
behind BuildConfig.DEBUG via diagFlag()/diagFile() helpers. Release
builds JIT-eliminate the file checks; debug builds keep the whole
experimental toolchain for re-running the analysis for demos/thesis.
* force_hexagon + force_cp_v2 stay unconditional — production routing.
* Prefill cb0 now respects force_greedy_cb0 (was always sampleTopK 0.9).
* Native TTS pipeline (executorch-custom/jni_layer_tts.cpp,
app/src/main/jni/tts_pipeline.cpp): pad-zone sampling switched to
greedy argmax so EOS gets a fair chance (temp 0.9 top-k kept producing
audio past EOS where Python's seeded sampler terminated naturally).
* scripts/prepare_tts_voiceclone.py: new script that captures Python
greedy-CP reference (stochastic talker for EOS, deterministic CP) for
token-by-token comparison.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- prepare_tts_native.py: auto-splits long text at sentence/comma
boundaries, max 15 tokens per segment
- Multi-segment format: each segment gets fresh KV cache
- Formula: target_len = n_tokens × 3.2 + 5 per segment
- Tested on Edouard Baer monologue: 28 segments, 102s audio
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The complete solution for native TTS on NPU:
1. Python: tokenize + text_projection only (30ms, no model generation)
2. File: golden prefill[0:9] + text_proj + eos padding (ratio 3.5×)
3. C++ shared Module: codec_sum(our codes) + trailing text/eos/pad
4. RMS-based auto-trim of trailing noise after speech ends
Key insights:
- Shared Module C++ uses SAME QNN compiled graph as Java → self-consistent
- codec_sum from our NPU codes is coherent (same model instance)
- Text tokens consumed 1:1, then eos padding for remaining steps
- RMS trim detects 15% energy drop from peak → cuts garbage
Validated "impeccable" by user on "Bonjour, je m'appelle Kazeia..."
prepare_tts_native.py works for ANY text.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause: embeds must come from SAME NPU model instance.
Python fp32 embeds cause divergence on NPU fp16 after ~20 steps.
Solution: Java pipeline captures embeds on-device during generation.
Captured embeds work perfectly with C++ pipeline (validated "bon").
- Added capture mode: touch /data/local/tmp/kazeia/capture_mode
- Embeds saved to captured_embeds.bin (same format as pipeline input)
- KV_LEN restored to 100 (KV=64 lost role tokens → quality loss)
- C++ uses pre-computed embeds as-is (no double codec_sum)
Production path: Java pipeline RTF 1.8 for new texts (good quality)
Replay path: C++ pipeline RTF 1.26 with captured embeds
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- KV_LEN restored to 100 (KV=64 caused quality loss from evicted role tokens)
- C++ uses pre-computed embeds as-is (no double codec_sum)
- Multi-segment format support in Kotlin (detects n_segments header)
- prepare_tts_segments.py: splits text + generates per-segment embeds
- Quality issue: Python-captured embeds differ from original working file
(original was likely captured on-device, not from Python model.forward)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pre-computed embeds from Python already contain codec_sum+text.
Using them as-is works correctly. After exhausted, fallback to
our codec_sum + pad.
Long text: 191 tokens, 15.28s audio, RTF 1.27
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- BigVGAN: 8 threads (2757→1872ms), pre_conv/pre_transformer: 4 threads
- Restored pre-computed embeds format (codec_sum+text from Python)
- Text-only trailing embeds don't work: model needs codec_sum for EOS
For long phrases, pre-computed embeds must be generated from Python.
RTF 1.26 on short phrase.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cache input tensor pointers after first prepare_input_tensors call,
then memcpy directly into them for all subsequent steps.
Eliminates ~14000 mallocs per pipeline run (986 CP + 58 talker calls).
Generation: 4640ms → 4007ms (-633ms), total RTF: 1.6 → 1.51
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Key breakthrough: C++ pipeline loop using the SAME Method* instances
that Java loaded (via Module::method("forward")). This gives:
- Same QNN compiled graph → identical numerical results → no trembling
- C++ loop → no Java Tensor/EValue allocation overhead
- prepare_input_tensors + memcpy + Method::execute (like cp_et_runner)
Pipeline: talker ~20ms/step + CP ~44ms/step + decoder 2.8s = 7.3s for 4.64s
Added to executorch JNI:
- Module.nativeSetCpModule() — registers CP module for pipeline
- Module.nativeRunTtsPipeline(...) — runs full talker+CP loop in C++
- Updated executorch.jar with new native method declarations
From RTF 4.9 (start of session) to RTF 1.6 with impeccable audio quality.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause found: QNN HTP level=1 compilation is not bitwise
deterministic. Two loads of the same .pte produce slightly different
hidden states → audible trembling in decoded speech.
Java pipeline uses single QNN instance → no trembling, validated quality.
C++ pipeline code preserved for future use when QNN context caching
is fixed (would make both loads use same compiled graph).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reuse float arrays and Tensor/EValue objects across talker steps
instead of creating new ones each iteration. Eliminates ~7s of
GC overhead from thousands of JNI object allocations.
Same validated audio quality as before, no C++ pipeline needed.
Talker 35ms/step, CP 58ms/step, total 8.9s for 4.64s audio.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The native pipeline was adding zeros after trailing text tokens
instead of tts_eos_embed then tts_pad_embed. This caused the model
to mispronounce final words (e.g. "développement" → "devopment").
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>