Commit Graph

44 Commits

Author SHA1 Message Date
Kazeia Team 2fe46e0f15 Fix seg-2 audio dropout + switch spectrum from bars to Bézier lines
**Regression fix**: when synthesis of segment N+1 ran longer than the
playback of segment N (e.g. 5 s synth for a 1.5 s "Bonjour !"), the
previous MediaPlayer was already in the Completed state by the time
we queued the next one. setNextMediaPlayer() on a Completed player is
a documented silent no-op — so the second sentence never started and
the user only heard the first part of the reply.

Rewrote playChainedMediaPlayers with per-player CompletableDeferred
tracking: before calling setNext we check whether current's done has
fired; after awaiting completion we verify next really auto-started
(checking isPlaying / currentPosition) and call start() explicitly if
the chain missed. Belt-and-suspenders against the race either way.

Removed the now-unused waitForPlaybackCompletion helper.

**Visual change**: in-sphere spectrum bars replaced with three
superimposed Bézier "deforming lines", mirrored above/below a central
baseline, with a soft cosine taper so the curves decay to zero at the
sphere's left/right edges (matches the circular mask). Each line has
its own slow-moving phase + gain + thickness + alpha so the three
overlap to give depth — closer to an oscilloscope trace than an EQ.
Low-level sin jitter keeps the lines alive during quiet passages,
amplitude-gated so true silence is a flat line.

User-facing change: no bars anymore. The sphere now "breathes" with
flowing waveforms matching its voice.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 23:42:43 +02:00
Kazeia Team 06dcd76dcb UI: large central orb w/ spectrum-inside + per-voice palette
Complete redesign of AudioVisualizerView based on feedback: the orb
is now the app's visual face, takes the top ~60% of the chat area,
and has clearly distinct behaviour in each state.

- **Idle**: slow 5 s breathing (scale 0.88 → 1.00 via cos easing),
  pure round shape, soft halo in phase. No high-frequency motion.

- **Listening**: organic blob outline built from 8 Fourier modes
  whose amplitude scales with live mic RMS; a thin shimmering arc
  rotates around the orb while mic energy is present; continuous
  micro-ripples pulse outward. Looks clearly 'alive and attentive'
  vs Idle's static breathing.

- **Speaking**: the orb becomes a contained spectrometer. A pre-
  computed log-spaced spectrogram (12 bands, 120 Hz–4 kHz,
  Hann-windowed FFT, one column per 50 ms of audio) is rendered as
  vertical rounded-rectangle bars CLIPPED to the sphere outline so
  they really look like the sphere itself speaking. Bar heights
  interpolate between spectrogram frames and exponentially smooth
  toward the target for fluid 60 fps motion. Outer halo pulses with
  the RMS envelope; ripples release on envelope peaks.

- **Per-voice color**. Eight-entry palette (Damien lavender,
  Elodie rose, Jerome aqua, Richard amber, Amir emerald, Didier
  indigo, Sid peach, Zelda periwinkle). Halo, accent, bars, ring,
  and ripples are all derived from a single voiceColor so switching
  the voice spinner tweens the entire scene to the new identity
  over a few frames. Color stored on both KazeiaService (for
  persistence across process/view rebinds) and pushed directly to
  the view for instant feedback at selection time.

Sidecar pipeline changes:
- Qwen3TtsEngine now computes per-segment spectrogram alongside the
  RMS envelope (new computeSpectrogram + an in-place radix-2 FFT).
  FFT_SIZE = 1024, hop = 50 ms, 12 log-spaced bands.  SegmentReady
  carries both arrays; onSegmentPlaying is (sentence, durationMs,
  rmsEnvelope, spectrogram).
- KazeiaPipeline.speakText forwards the new callback shape.
- KazeiaService.VisualizerSignal.Speaking now carries the
  spectrogram and the new voiceColor StateFlow.
- ChatActivity passes both to the view and collects voiceColor.

Layout: vertical chain between audioViz (weight 3) and rvMessages
(weight 2) so the orb owns ~60% of the chat panel and the chat list
takes the remainder. Removed the fixed 140 dp constraint.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 23:33:38 +02:00
Kazeia Team 8939c680b2 UI: épuré audio-reactive orb visualizer — replaces 3D avatar for MVP
Adds a breathing lavender orb centred above the chat list that tracks
the actual audio state of the app:

- **Idle**: slow respiratory pulsation (~4 s cycle) at 20 fps. The
  chatbot is visually "awake" without animating loudly.
- **Listening**: halo swells with live mic RMS from the VAD loop, so
  the user sees Kazeia hearing them even before Whisper has produced
  any transcription. Mic RMS is normalised with the same sqrt
  squashing the TTS envelope uses so quiet speech still reads visibly.
- **Speaking**: amplitude + halo driven by a pre-computed RMS envelope
  (50 ms windows, sqrt-normalised) produced at synthesis time. Ripples
  fire on local peaks above 0.35 — matches speech rhythm without
  overwhelming. Timer is internal to the view, synced to the segment's
  durationMs; no MediaPlayer position polling.

Architecture:
- Sidecar RMS envelope. Computed in Qwen3TtsEngine.generateSegmentAudioVC
  right after PCM is available, packed into SegmentReady, and handed to
  onSegmentPlaying(sentence, durationMs, rmsEnvelope) when each MediaPlayer
  starts. Zero extra IO — runs on the same PCM we already write to WAV.
- KazeiaService exposes VisualizerSignal (Idle | Listening(rms) |
  Speaking(env, dur)) as a StateFlow. The VAD loop pushes Listening,
  processLlmResponse pushes Speaking from the per-segment TTS callback,
  and finally clears to Idle when no mic is open.
- AudioVisualizerView renders via Choreographer.FrameCallback, self-
  throttled to 20 fps at Idle and full refresh during Listening/
  Speaking. Hardware layer. Pure Kotlin + Canvas, no deps. ~280 LOC.

Layout: 140 dp strip between voiceBar and rvMessages in activity_chat.xml.

No 3D engine, no Unity, no splash extension. The avatar design work
remains on disk for a later phase when the TTS+streaming pipeline
stabilises enough to spend time on DECA/FLAME integration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 23:20:15 +02:00
Kazeia Team f17131aefb UI: reveal Kazeia reply in sync with TTS audio (per-sentence, per-word)
Matches the 'conversation' feel the user asked for. Previously the
full LLM response appeared in the chat as soon as generation
finished, then audio played 5–10 s later — text and sound felt
decoupled. Now:

- The KAZEIA bubble is created empty and only starts filling when
  the first TTS segment actually starts playing through the speaker
  (we already split the response by sentence for the chained-
  MediaPlayer pipeline; that split drives the reveal too).
- Inside each sentence, words are appended one by one at a cadence
  of (audio duration / word count) — slower sentences reveal slower,
  matching speech pacing. The first word of each sentence appears
  immediately so audio and text stay aligned at the start.

Implementation:
- Qwen3TtsEngine: added `onSegmentPlaying(sentence, durationMs)`
  listener, invoked from the chained-MediaPlayer worker the moment
  each segment's MediaPlayer.start() lands. Sentence + duration are
  carried end-to-end via a new SegmentReady data class.
- KazeiaPipeline.speakText: forwards an optional listener down to
  the TTS engine, same signature.
- KazeiaService: new updateMessageText(id, text) helper. In
  processLlmResponse, the bubble is added empty before speakText and
  grown by a reveal coroutine per sentence; after speakText returns
  we snap to the full text as a safety net.

No change to the stream_llm debug intent path — it still uses the
old enqueueSentence flow directly and doesn't need the reveal (no
UI bubble there).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 22:58:18 +02:00
Kazeia Team 6a958c1a10 Revert MemoryOptimizer — reclaim wasn't worth the footprint
The kill-background approach worked at startup (−1.6 GB) but respawns
pulled most of that back within 1–3 min, and the periodic sweep only
kept the first sweep's reclaim stable rather than going lower. Net
practical benefit over "just let Android manage it" is small, not
worth a custom optimizer + normal-perm + file maintenance.

Removes:
- MemoryOptimizer.kt
- KazeiaService.onCreate calls to freeRamForModels / startPeriodicOptimizer
- KILL_BACKGROUND_PROCESSES permission from the manifest

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 22:46:58 +02:00
Kazeia Team 751e3e0868 memory: periodic sweep + expand kill list (photos, calendar, contacts, vending, tachyon…)
First sweep reclaimed 1.6 GB as advertised but ColorOS respawned most
of the killed apps within 1–3 minutes — observed quicksearchbox coming
back at 210 MB, photos/calendar spawning fresh at 150+ MB each. Two
changes:

1. Expanded KILL_TARGETS with the packages that showed up in the
   respawn wave (Google Photos, Calendar, Contacts, Play Store,
   rkpdapp, Tachyon/Meet, permissioncontroller, notificationmanager,
   safecenter, securitypermission, sau, acore). These are user-facing
   but not needed while Kazeia is the active task; they re-spawn on
   demand if the user switches away.

2. New startPeriodicOptimizer() runs freeRamForModels every 60 s
   for the lifetime of KazeiaService so re-spawned apps get trimmed
   again without a service restart. Tied to serviceScope so it stops
   cleanly on destroy.

Net effect observed: avail RAM stays ~1.2–1.5 GB higher than without
the sweep. Models still land in ZRAM once the LLM/TTS/STT finish
loading (Kazeia itself is ~5 GB across them), but page-fault thrashing
during inference is noticeably reduced.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 22:44:09 +02:00
Kazeia Team 39babcb158 TTS+audio+memory: ColorOS playback fixes + kill-background reclaim
Three unrelated fixes rolled into one so testing on the tablet stayed
coherent. All were driven by what the user was observing during live
audio tests, not by pre-planned refactors.

1. **Audio playback actually audible.** ColorOS's AudioFlinger
   silently muted our AudioTrack ~600 ms after play() every time
   (dumpsys audio showed `event:muted updated source:clientVolume`
   and playbackHeadPosition stuck at 0), regardless of USAGE_MEDIA /
   USAGE_ASSISTANT / USAGE_VOICE_COMMUNICATION, regardless of audio
   focus grant, regardless of FGS type including mediaPlayback. A
   MediaPlayer path using the SAME usage attributes works because it
   routes through a different AudioFlinger thread that isn't under
   the same background-hardening policy. `USE_MEDIAPLAYER_FALLBACK`
   in Qwen3TtsEngine.kt flips playback to a WAV-per-segment pipeline.
   Two MediaPlayer instances are chained via `setNextMediaPlayer()`
   so segments transition without re-arming the DAC (that re-arm was
   audible as "beg beg" pops between sentences). Synth of seg N+1
   runs in parallel with playback of seg N via a capacity-2 Channel,
   hiding synthesis latency behind playback for all but the first seg.

2. **Mic no longer loops TTS back into STT.** The continuous-
   listening VAD in KazeiaService already had a guard to drop frames
   while `pipelineState is Speaking`, but that state was never set by
   any caller — so the mic kept recording during playback and fed our
   own speaker output back to Whisper, creating the infinite
   "Kazeia talks to Kazeia" loop the user observed. Both the
   stream_llm intent path and the main `processLlmResponse` TTS path
   now wrap the TTS call with `Speaking → Idle/Listening`.

3. **Free 1.6 GB of RAM at service start.** The OnePlus Pad 3 with
   ColorOS keeps ~7 GB of Google + OPLUS background services
   resident at idle. With Qwen3-4B (3.2 GB) + Qwen3-TTS (1 GB) +
   Whisper (0.5 GB) on top, most of our model weights were going to
   ZRAM swap — "the NPU is stuck" reports were actually page faults
   paging 3 GB of LLM weights back in before each inference. New
   `MemoryOptimizer` kills 30-ish non-essential background packages
   (Google optional: YouTube, Wallet, Chromecast, Messaging, AICore,
   Quicksearchbox; OPLUS optional: smartsidebar, cosa, pantanal,
   nhs, midas, …) via `ActivityManager.killBackgroundProcesses`.
   Measured reclaim on first run: **avail RAM 8468 MB → 10112 MB,
   +1644 MB**. Uses KILL_BACKGROUND_PROCESSES (normal perm, no user
   prompt); system-critical packages and the launcher/systemui are
   explicitly excluded from the target list.

Collateral changes:
- Added FOREGROUND_SERVICE_MEDIA_PLAYBACK permission + fgsType flag
  (didn't fix the mute on its own, but it's correct per Android 14
  policy and leaving it without would be a latent compliance risk).
- Kept `USE_STREAMING_DECODE` + CP↔BigVGAN overlap code intact
  behind the MediaPlayer-fallback branch so reverting to the
  AudioTrack streaming path is a single-const flip if ColorOS ever
  lifts the hardening (or we move to a device without it).
- New AudioTrack path has a keep-alive silence watchdog and a
  playback-head drain wait on stop. Both were attempts to fix the
  mute that didn't pan out on their own; leaving them in so the
  streaming path stays usable on non-hardened devices.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 22:37:20 +02:00
Kazeia Team 0632db1ee0 UI: drop Magisk prompt — ResourceMonitor stops probing su
ResourceMonitor.init ran `su -c id` at every ChatActivity launch to see
if root was available, then used root to read /sys/class/kgsl/... and
/sys/bus/platform/devices/soc:qcom,msm-cdsp-rm/... for GPU/NPU usage %.
That probe was the only thing still triggering the Magisk auth dialog
on each app start after the no-root LLM migration.

Remove the root probe and the execRoot helper. GPU/NPU reads now return
-1 (UI already renders "—" for negative values). The non-root
/sys/class/kgsl/kgsl-3d0/gpubusy path is kept as a best-effort — it's
world-readable on some devices, silently fails otherwise. CPU and RAM
readouts are unaffected (never needed root).

Dead-code `su -c ...` calls remain in Qwen3TtsEngine (hexStartRunner,
hexStartCpRunner, hexStopRunner, etc.) and WhisperNpuSttEngine, but all
are gated behind fallback paths that don't execute under the current
PTE-only config (talkerPteModule != null && cpPteModule != null short-
circuits before any su call). Left in place to avoid churning the TTS
Hexagon fallback; can be purged in a later cleanup pass if needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 18:35:18 +02:00
Kazeia Team 10fd10fd90 TTS: overlap CP↔BigVGAN — first audio 14.5s → 10.9s per segment
Streaming variant of the per-segment decode pipeline. As soon as SEQ_LEN
codes are accumulated from the talker/CP loop, BigVGAN is dispatched on
a background coroutine while the producer keeps generating the rest of
the segment. The BigVGAN consumer feeds a streaming crossfader that
emits stable audio as it arrives and holds back overlapSamples for the
next chunk's blend.

Mirrors decodeChunked's semantics exactly so final audio is bit-identical
modulo the fadeOut application location (now applied to the final
emission tail instead of the full buffer; the last 40ms still get faded).

Validated A/B on the same prompt 3 used in the recent benchmark:
  prompt: "Je me sens un peu triste aujourdhui…"
  seg 0 first audio:  14 485 ms → 10 936 ms (−3.5 s)
  end-to-end first audio (LLM trigger → audio): 16.2 s → 12.7 s
  Stream LLM total: 33 234 ms → 28 594 ms (−4.6 s)

Short segments (<SEQ_LEN codes) and the legacy non-streaming callers
(generateSegmentAudioVC, decodeChunked, multi-segment pipelines, etc.)
are untouched. The new path is gated behind USE_STREAMING_DECODE so it
can be reverted by flipping a single const if a regression is found.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 16:22:15 +02:00
Kazeia Team a41619ed67 TTS: keep BigVGAN on CPU after GPU regression; LLM filter strips more tags
#2 BigVGAN GPU experiment: ORT-QNN GPU EP loaded the v2_decoder_conv ONNX
model successfully (session creation 463 ms, no fallback warnings) but
per-phrase inference jumped to ~3.5 s vs ~2 s on CPU 8-thread. The GPU/CPU
memory transfer cost dominates for this conv-heavy decoder, and the
optimization went the wrong way. Comment block updated to record both the
HTP and GPU paths as tried-and-rejected so future passes don't re-walk the
same ground.

LLM streaming filter: extend the lookahead-based <think>…</think>
suppressor to also strip singleton special tokens (<|im_start|>,
<|im_end|>, <|endoftext|>). Previously the closing <|im_end|> at end of
the assistant's turn leaked into the SentenceStreamer and ended up as a
spurious sentence at the end of the TTS output. Same lookahead-buffer
trick handles split tokens.

Validated end-to-end: 'Bonjour, comment vas-tu ?' → "Bonjour ! Je vais
bien, merci. Comment vas-tu ?" → seg 0 "Bonjour !", seg 1 "Je vais bien,
merci." (no <|im_end|>), BigVGAN back to 1.8 s/phrase.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 13:48:37 +02:00
Kazeia Team 3d435f9cdd LLM: trim system prompt to drop ~27 prefill tokens (-1.3s TTFT)
The verbose 55-token system prompt was the cheapest TTFT win on the
kv-only path (52 ms per prefill token). Compacting it to 25 tokens while
keeping the three load-bearing constraints — Kazeia identity, French only,
short replies, /no_think — measurably improved end-to-end latency.

Validated 'Bonjour, comment vas-tu ?' on tablet:
  Before: prompt_tokens=80, TTFT=4202ms, total=5716ms
  After:  prompt_tokens=53, TTFT=2865ms, total=4034ms (-1.3s, -32% TTFT)

Reply quality preserved: "Bonjour ! Je vais bien, merci. Comment vas-tu ?"

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 12:16:11 +02:00
Kazeia Team b57719fa5e LLM: filter <think> tokens out of the streaming TTS path
Even with /no_think in the system prompt Qwen3 still emits an empty
<think>…</think> wrapper before the real answer. Without filtering, the
SentenceStreamer treats '<think>' as a sentence boundary and feeds three
tokens of XML into the TTS, producing audible parasites at the start of
each reply.

The new in-callback filter buffers a small lookahead (just enough to span
"</think>"), suppresses everything between the open and close tags, and
flushes the surrounding prose to onToken in order. With the lookahead, tags
that arrive split across decoded pieces ("<thi"+"nk>") still match.

Validated end-to-end: prompt 'Bonjour, comment vas-tu ?' now streams
sentence-by-sentence to the TTS — first segment "Bonjour !" reaches the
talker at 4.6 s, no <think> sneak-through.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 11:16:08 +02:00
Kazeia Team f32b5ddfdd LLM no-root: validate end-to-end pipeline, fix kv_io_bit_width detection
End-to-end validation on OnePlus Pad 3 with stream_llm intent:
  Prompt:   'Bonjour, comment vas-tu ?'
  Response: 'Bonjour ! Je suis là pour t'écouter. Comment vas-tu aujourd'hui ?'
  TTS:      Talker(PTE) 37ms/step, CP(PTE) 73ms/step, audio synthesized.
  No su, no Magisk prompts.

Two fixes since the previous commit:
1. ExecuTorchLlmEngine: pass echo=false to LlmModule.generate() — by default
   the runner echoes the prompt tokens back via the callback, which fed the
   ChatML wrap (<|im_start|>user …) into the SentenceStreamer and TTS.
2. jni_layer_llama.cpp: pick Runner<uint8_t> vs Runner<uint16_t> based on the
   model's get_kv_io_bit_width metadata, mirroring qnn_llama_runner.cpp main().
   The hard-coded uint16_t was wrong for our Qwen3-4B export (which uses 8-bit
   KV I/O) and produced fluent-looking but completely random tokens
   ("blocked罩ug darkestSOLEQuotes作者本人 …") — same symptom whether greedy or
   sampled, the smoking gun for a width-mismatched KV cache reinterpretation.

Other tweaks:
- temperature=0.0 in the QNN_LLAMA branch of jni_layer_llama.cpp (greedy,
  matches the working qnn_llama_runner --temperature 0 invocation)
- shared_buffer=true (same as binary defaults)
- Kotlin chat template mirrors qnn_llama_runner.cpp's get_formatted_prompt for
  Qwen3 (user-first, then optional system, then "<|im_start|>assistant" with
  no trailing newline — that quirky ordering is what the .pte was trained on)

TFTT is ~4 s for a 77-token prompt on kv-only mode (sequential prefill, one
forward per token). To get a sub-second TTFT we'd need to re-export the model
in --model_mode hybrid which adds a parallel prefill_forward graph; not
required for the conversational use case.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 11:11:23 +02:00
Kazeia Team 809a6d4fed LLM no-root: migrate to in-process LlmModule (JNI) — zero su calls
The root cause of the previous su-c requirement was that Qualcomm's FastRPC
kernel driver rejects processes spawned via ProcessBuilder fork+exec because
they lose supplementary GIDs on exec. Zygote-forked app processes retain the
proper init-configured credentials and are accepted by the adsprpcd service,
which is why ORT-QNN (Whisper, in-process) worked while the subprocess
qnn_llama_runner did not. Running the LLM in-process via ExecuTorch's
LlmModule bypasses the fork+exec path entirely.

What this commit does:
- ExecuTorchLlmEngine now uses org.pytorch.executorch.extension.llm.LlmModule
  with MODEL_TYPE_QNN_LLAMA=4 (routes to example::Runner in jni_layer_llama.cpp,
  the same C++ runner that qnn_llama_runner embeds).
- All su, ProcessBuilder, file-based prompt/response plumbing, and run_llm.sh
  gone. ChatML template is built in Kotlin; tokens stream in via LlmCallback.

Supporting changes under executorch-patches/llm_in_process_jni.patch:
1. backends/qualcomm/CMakeLists.txt — gate PyQnnManagerAdaptor on NOT ANDROID.
   The original guard (CMAKE_SYSTEM_PROCESSOR MATCHES x86_64) misfires in a
   nested scope during Android cross-compile and tried to build the host
   Python bindings.
2. extension/android/jni/jni_layer_llama.cpp — hardcode decoder_model="qwen3"
   (was "llama3") and pass eval_mode=0 (EvalMode::kKVCached) + shared_buffer=true
   to match our hybrid_llama_qnn.pte which only contains kv_forward, not
   prefill_forward.

Build: scripts/build_android_library.sh arm64-v8a with QNN_SDK_ROOT pointing
to /opt/Kazeia/qnn_sdk_242/qairt/2.42.0.251225 and EXECUTORCH_BUILD_QNN=ON.
Produces libexecutorch_jni.so (192 MB) with QNN v2.42 backend + the llama
runner code, plus libqnn_executorch_backend.so. Both staged in jniLibs.

Validated on OnePlus Pad 3: LlmModule.load() completes in 4.2 s, no su
prompts, Pipeline ready with STT(WhisperHybridEngine) → [VoiceCommands →
LLM] → TTS(Qwen3TtsEngine). TTS .pte still loads with the upgraded v2.42
runtime — no regression.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 10:39:50 +02:00
Kazeia Team 6e6a2d9f82 Baseline before no-root migration: working state with root LLM
Commit de sauvegarde avant la tentative d'unification QNN SDK v2.37 et
suppression du su -c pour le LLM. État actuel fonctionnel :
- LLM Qwen3-4B via su -c qnn_llama_runner (v2.42 dans /data/local/tmp/kazeia-et/)
- TTS talker + CP via ExecuTorch .pte JNI (v2.31 dans jniLibs)
- STT Whisper via ORT-QNN 1.24.3

Le rapport kazeia-no-root-report.md documente en détail les tentatives de
no-root et leurs échecs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 08:19:36 +02:00
Kazeia Team 364016b7b8 LLM+TTS: short-response system prompt, PTE streaming fallback
- ExecuTorchLlmEngine: system prompt forces French, 1-2 short sentences,
  /no_think so the full budget goes to the answer (Qwen3 was consuming
  120+ tokens on <think>); eval_mode 0 matches our kv-mode export.
- Qwen3TtsEngine.generateSegmentAudioVC: when the Hexagon talker socket
  isn't open, fall back to runInterleavedPteFromEmbeds so the Stage 3
  streaming session still produces audio. Without this the session opened,
  accepted sentences, and silently emitted empty PCM.

Documents the QNN SDK version-skew pitfall in ExecuTorchLlmEngine.kt
ahead of the upcoming migration to a unified v2.42 toolchain.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 00:17:08 +02:00
Kazeia Team 9930bfa392 LLM: enable Qwen3-4B NPU (21 tok/s) in service pipeline
- ExecuTorchLlmEngine: eval_mode 0 (our .pte is kv-mode, not hybrid)
- KazeiaService: call llm.load() after TTS init; try/catch falls back
  to echo mode if the runner or .pte are missing.

Pipeline on device: STT(WhisperHybridEngine) → [VoiceCommands → LLM] → TTS(Qwen3TtsEngine).
Validated on OnePlus Pad 3: LLM ready in ~8 s, gen 21.3 tok/s, RSS 1.76 GB in the
qnn_llama_runner subprocess (out-of-process from the Kazeia app).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 23:00:25 +02:00
Kazeia Team f548e02283 TTS: dynamic EOS-rank boost terminates generation cleanly across voices
Replaces the fixed maxGen + length-based boost with a fully dynamic
end-of-utterance detector that watches the model's own EOS logit rank.
End result on the Baer 3-segment monologue, validated by user as
"FORMIDABLE" / "impeccable" with both Damien and Zelda voices:

  - All 3 segments terminate via EOS (no maxGen cap hit)
  - No "page beg beg" filler tail
  - No abrupt cuts between segments
  - Audio durations 5-8 s per segment, matching Python within ~10 %

How it works (runHexGenWithPrefill, in tts/Qwen3TtsEngine.kt):

  1. At every decode step, compute the rank of CODEC_EOS in the
     repetition-penalised logits. Mid-utterance the rank sits at
     150-700 (model is committed to producing speech). Approaching
     the natural end, the rank dips toward top-50.

  2. Arm the boost only when EOS rank stays below eosRankTrigger=60
     for THREE consecutive steps. The 3-step requirement filters out
     transient single-step dips that occur during low-energy phonemes
     mid-sentence (without it, short sentences would terminate after
     ~3 s). Arming is also gated by eosBoostMinStep (50 % of expected
     speech length) so we never arm in the very first frames.

  3. Once armed, the boost increments monotonically: each subsequent
     step adds boostStepsActive * eosBoostScale to the EOS logit. The
     accumulated boost lifts EOS above top-1 within 1-3 steps, the
     argmax check fires, and the loop breaks. Scale=4 gives the model
     a small natural decay before termination; scale=5 was perfect-but-
     slightly-clipping, scale=3 wasn't strong enough to outpace the
     growing top-1 logit.

Other tweaks bundled in this commit because they all contribute to
the clean output:

  * Inter-segment gap 120 → 250 ms — gives the listener a perceived
    sentence boundary instead of a hard concatenation.

  * fadeOut(audio, 40) on every segment — cosine roll-off over the
    last 40 ms so the EOS-clipped tail decays naturally instead of
    sample-clipping.

  * top_k 50 → 200 in the fallback sample call — wider pool to keep
    EOS reachable when the boost just fails to hit argmax.

Voice swap is a 45 KB file push (damien_voice_prefix.bin and
damien_voice_suffix.bin). Successfully tested today with Elodie
(female, norm 10.12) and Zelda (norm 9.39) using Damien (norm 10.36)
as the baseline — same Kotlin code, no rebuild needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 14:13:04 +02:00
Kazeia Team c25040a780 TTS: conditional tail-trim + export script accepts voice path arg
Two small changes:

  * export_tts_text_embeddings.py now takes the voice wav as an optional
    second CLI arg (defaults to damien_15s_24k.wav). Lets the same script
    capture voice-prefix+suffix for any speaker wav without editing the
    source — used today to test Elodie alongside Damien.

  * synthesizeTextStreaming + generateSegmentAudioVC only run the
    trimTailLowEnergy trim when n >= maxGen. The trim's 35%-of-peak
    threshold is tuned to catch "page beg beg" filler after the talker
    fails to emit EOS — but it was cutting valid speech when EOS fired
    early (observed on Elodie seg 1: 10.08 s → 2.92 s, a 4-second over-
    trim). With the guard it's a no-op on converging generations and
    only fires on the ~15% of segments that hit maxGen.

Validation after the fix (Elodie, Baer monologue):
  - seg 1: 126 tokens = maxGen → trimmed 10.08 s → 8.88 s (1.2 s cut,
           the filler tail)
  - seg 2: 105 tokens < 138 maxGen → no trim, 8.4 s kept as-is
  - seg 3: 69 tokens < 96 maxGen → no trim, 5.6 s kept as-is

Voice prefix/suffix shape is speaker-invariant except position 7 (the
xvector). Confirmed by capturing both Damien and Elodie and diffing:
positions 0-6 and 8 identical within 1e-8, suffix identical within
1e-8, only pos 7 has a different xvector embedding (norm 10.36 vs 10.12).
That means swapping speakers on-device is a 45 KB file push — no app
rebuild, no re-export of the 297 MB vocabulary table.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 11:32:33 +02:00
Kazeia Team 0833d1bd21 TTS: route all synthesizeAndPlay calls through Stage 3 streaming session
Replaces the four per-sentence TTS entry points (pipeline.speak, REPEAT
voice command, echo-mode TTS, LLM-response TTS) with a single shared
pipeline.speakText() that:

  * opens a Qwen3TtsEngine streaming session when the TTS backend is
    Qwen3 (voice-cloning path);
  * feeds the whole response through a SentenceStreamer so the first
    sentence starts playing as soon as it's decoded;
  * falls back to the old one-shot synthesizeAndPlay for non-Qwen3 TTS
    engines (AndroidTts, Chatterbox) that don't expose a session API.

KazeiaPipeline.speakText is now public so KazeiaService can use the
same dispatch — previously each call site re-implemented the
"streaming-or-fallback" logic or just called synthesizeAndPlay and
waited for the full synthesis.

Enabling the real on-device LLM is a separate task (task #48): the
existing llama-cli binary has ggml-hexagon linked in and fails to
init the DSP (0x80000406) when the TTS Hexagon runners hold the
session. Needs either a CPU-only llama-cli build or the restored
ExecuTorch qnn_llama_runner setup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 11:12:14 +02:00
Kazeia Team 2f07901ff3 TTS Stage 3: LLM stream → sentence split → TTS session → shared AudioTrack
Closes the loop on on-device conversational TTS. The LLM's token stream is
now consumed by a SentenceStreamer which fires a callback the moment a
terminal-punctuation boundary appears; each sentence is enqueued to a
persistent TTS streaming session that generates and plays audio through a
single shared AudioTrack. Sentence N's audio plays while sentence N+1 is
being generated on Hexagon+CP — no per-sentence AudioTrack init gap, and
no "wait for full response before hearing anything".

Mocked-LLM validation on the 3-sentence prompt:
  "Bonjour. Je suis là pour vous écouter. Comment allez-vous aujourd'hui."

  - First sentence detected:    1 ms
  - Seg 0 prefill (Hex):       567 ms
  - Seg 0 generated:         4 200 ms (18 tokens, 1.4 s audio)
  - Seg 1 generated:         9 100 ms (42 tokens)
  - Seg 2 generated:        11 000 ms (46 tokens)
  - Session closed:         33 500 ms (all audio drained)

Changes:

  * tts/SentenceStreamer.kt — 50-line helper that buffers tokens and
    fires onSentence when a "." "!" "?" ";" or "\n" appears. minChars = 4
    so "Oui." / "Bonjour." count as real sentences; higher thresholds
    swallowed conversational openers into the next segment and delayed
    first audio. flush() for the final partial sentence.

  * Qwen3TtsEngine.startStreamingSession / enqueueSentence / endStreamingSession
    triplet. startStreamingSession opens a 30-second MODE_STREAM
    AudioTrack plus a background worker coroutine that pulls sentences
    from an unlimited Channel. enqueueSentence is non-blocking; the worker
    serialises generation so audio order matches enqueue order.
    generateSegmentAudioVC is the per-sentence body (tokenize → prefill
    build → Hexagon gen → decode) without the WAV-save side effects that
    the /stream_text intent path does.

  * KazeiaService new intents:
      - stream_llm        : real LLM path (needs LLM loaded; currently the
                            debug build runs echo-mode so this path is
                            shipped but requires production config to
                            exercise).
      - stream_llm_mock   : fakes the LLM stream by splitting the given
                            text on spaces with 50 ms per "token" —
                            matches the ~20 tok/s rate the on-device LLM
                            produces and lets Stage 3 be validated without
                            flipping the LLM on.

Architectural notes:
  - AudioTrack buffer is 30 s so generation can run ahead of playback
    without blocking writes. RTF on Snapdragon 8 Elite is ~3 for short
    sentences, so for a 2-3 sentence response the buffer actually drains
    between segments and the user hears a short gap — expected, not a
    bug. Masking that gap requires RTF < 1 which is out of scope.
  - Hexagon KV is reset between sentences (hexReset) so the talker
    doesn't see stale context. Prefill observed cb0 = 1995 on every
    sentence that starts with a capital letter, matching the Python
    greedy reference — confirms prefill reconstruction is stable across
    segments within a session.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 10:52:46 +02:00
Kazeia Team 7f1a44c23d TTS Stage 2: on-device voice-cloning TTS for arbitrary text
Removes the PC-side prepare_tts_segments.py dependency for day-to-day
generation. The tablet now tokenizes, embeds, and voice-clones any
French (or Qwen3-supported) text with no network, no ADB push per
phrase, and quality that matches Python's reference on "Bonjour, je
suis Kazeia, je suis là pour vous écouter." — user validation:
"impeccable".

Three pieces that compose the path:

  1. Qwen3BpeTokenizer.kt — byte-level BPE matching Qwen2/Qwen3's
     Python implementation bit-for-bit. UTF-8 + GPT-2 byte encoder,
     Qwen regex with \p{IsAlphabetic}/\p{IsDigit} (Android's regex
     lacks UNICODE_CHARACTER_CLASS — caught in testing). Produces
     identical token IDs to HF's Qwen2TokenizerFast on the test phrase:
     [81581, 11, 4759, 35631, 730, 9832, 685, 11, 4759, 35631, 37915,
      4914, 9012, 90229, 2676, 13].

  2. export_tts_text_embeddings.py — one-time PC export of:
     * Full projected text embeddings for the entire 151936-token vocab
       as fp16 (297 MB). Sanity check: live vs stored max abs diff
       1.15e-4 on token 1043. Mmap'd on-device so it stays off the
       Java heap and leaves room for the 125 MB cp_embeddings alloc.
     * Damien voice PREFIX (9 × 1024 fp32) — positions 0..8 of a
       Python voice-clone capture, text-invariant across segments.
     * Damien voice SUFFIX (2 × 1024 fp32) — positions nP-2..nP-1
       of the same capture. Also text-invariant (diff = 0.0 across
       3 different-text segments). Without it the talker never sees
       "text ended" and decode falls into page/beg repetition.
     * Qwen3 tokenizer vocab.json + merges.txt.

  3. Qwen3TtsEngine.kt:
     * mmap loader for the embeddings table + buffered fp16→fp32
       lookup (halfToFloat covers subnormals/inf/NaN so pathological
       tokens don't become 0).
     * Stage 2 assets detected at init; missing file transparently
       falls back to legacy 1050-token reduced-vocab path.
     * synthesizeTextStreaming(text, onSegmentReady) — new public API:
       sentence-split → BPE → build prefill as
         [voice prefix] + [text_proj(id) + codec_pad] × N + [voice suffix]
       (exact structure Python emits; verified bit-for-bit by matching
       captured Baer prefill positions against text_projection(tok)+
       codec_embedding(CODEC_PAD)) → runHexGenWithPrefill → decode
       each segment through the existing BigVGAN pipeline → callback.
     * runHexGenWithPrefill — Hexagon prefill + interleaved CP decode
       loop. Feeds tts_eos once, tts_pad thereafter (same schedule as
       Python's voice_clone). Degeneracy guard stops when 9 identical
       cb0 in a row appear — catches the rare "page beg beg beg" tail
       when EOS never fires. maxGen = ids.size*4 + 10 matches the
       typical 3.3 codec-frames-per-text-token that Python produces.
     * Prefill build uses the speaker's captured prefix/suffix rather
       than the legacy in-code buildPrefillEmbeddings that puts only
       one text token in prefill — the structure mismatch produced
       garbled audio in the first attempt of this commit.

  4. KazeiaService.kt: new stream_text intent extra wires text input
     to synthesizeTextStreaming with an AudioTrack MODE_STREAM consumer.
     First-audio latency on the "Bonjour..." test: ~23 s on Snapdragon
     8 Elite (prefill + 74-token decode), vs a 3-phrase sentence batch
     that was 65 s pre-streaming — streaming + on-device text together
     unblock the MVP chat loop.

Known caveats:
  * 297 MB on-device footprint for the embedding table. Acceptable on
    OnePlus Pad 3; can be quantized further (int8 per-row) if storage
    becomes tight.
  * First init adds ~3 s for BPE vocab + merges load (151k × 2 hash-
    maps). Happens once per process.
  * maxGen cap means extremely long sentences may truncate. The
    sentence splitter already keeps segments ≤120 chars so this
    hasn't been observed in practice.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 10:12:09 +02:00
Kazeia Team 5e416713ce TTS Stage 1 streaming: play each segment the moment it's decoded
Adds a streaming multi-segment pipeline on top of the Hexagon talker + ONNX
CP backend. First audio arrives at ~20s (vs ~65s for the full phrase
non-streamed) on the Baer 16.56s reference (3-segment split). Voice cloning
is preserved per segment because each segment now ships its own full prefill.

Changes:

  * Qwen3TtsEngine.generateFromEmbedsHexagonStreaming(path, onSegmentReady)
    reads single- or multi-segment embeds, runs prefill + generation + VQ
    decode + BigVGAN per segment, and fires the callback with each
    segment's ShortArray the moment it's ready. Saves per-segment WAVs
    (kazeia_stream_seg{N}.wav) plus the concatenated kazeia_stream_full.wav
    for offline inspection. Extracted the common generation loop into
    runHexSegmentFromEmbeds(prefill, trailing, idx) so single-segment and
    streaming paths share exactly the same code (no quality drift between
    modes). Added hexReset() between segments so segment 2's prefill logits
    don't contain segment 1's KV state.

  * vqDecode buffer overrun fix: when the talker samples CODEC_EOS as cb0
    it stores a vocab id > CODEBOOK_SIZE, which vqDecode then used as a
    codebook row index — reading past the 2048-row buffer. The short Baer
    probe never hit this; longer phrases do. Clamp any out-of-vocab code
    to 0 at allCodebooks build time.

  * KazeiaService: new stream_pipeline intent extra wires the callback
    to an AudioTrack MODE_STREAM instance, writing each segment's audio as
    soon as it comes back. Logs time-to-first-audio.

  * prepare_tts_segments.py: the previous version only captured 1-token
    decode calls and substituted a generic 9-embed "prefill_base" pulled
    from an unrelated single-segment file — dropping the per-segment
    xvector conditioning AND the text-encoded embeddings, so Hexagon
    produced garbled mixed speech for segments 2..N. Now captures the
    multi-token prefill call too (like prepare_tts_voiceclone.py) so each
    segment is self-contained.

Limitation (documented, not fixed in this commit): RTF ~4.4 > 1 on the
Snapdragon 8 Elite with current config means each segment takes longer to
generate than it takes to play, so audible gaps between segments remain.
Removing the gaps requires either (a) producer/consumer parallelism across
two coroutines (doesn't help if RTF stays > 1), or (b) faster CP (the
~180ms/step ONNX MLAS CP is the bottleneck; Hexagon HMX has a known NaN bug
and the .pte path contends with Hexagon talker on the DSP).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 08:43:30 +02:00
Kazeia Team de878ddf5c TTS tremor investigation: identify cross-arch numerical floor, gate diag flags
Extensive investigation of the audible "tremor" in the generated voice-cloned
audio. Conclusion is architectural, not a bug:

  * Hexagon HMX fp16 talker logits correlate with PyTorch fp32 at 0.999998
  * ONNX Runtime CP V2 is bit-identical to PyTorch greedy CP (0.24% residual
    divergence measured by injecting Python's captured cb0 at each step —
    14/16 codebooks match 100%, cb14/cb15 miss 1 token out of 53)
  * BigVGAN decoder is bit-identical to PyTorch (validated earlier)
  * Therefore the tremor is caused entirely by the ~28% of cb0 argmax flips
    where the tiny fp16 logits drift crosses the top-1/top-2 margin. This
    cascades through the autoregressive chain into a trajectory the model
    never saw at training time → incoherent artifacts.

Cross-architecture test (x86 AVX-512 / ARM64 NEON+HMX) cannot be zeroed by
any runtime swap — LibTorch Android would use NEON kernels with a different
reduction order than PyTorch x86, same class of error, smaller but non-zero
residual. Temperature tweaking (0.3 → 0.9) and greedy-vs-sample gave no
perceptual difference: the floor is numeric, not in the sampling layer.

Accepted for MVP. Documented in project_tts_cross_arch_limit.md — this is a
thesis-relevant finding about on-device TTS deployment limits.

Cleanup:
  * All diagnostic flags (force_inject_pycb0, force_greedy_cb0, cb0_temp,
    force_python_codes, force_cpu_talker, force_cpu_talker_gguf) now gated
    behind BuildConfig.DEBUG via diagFlag()/diagFile() helpers. Release
    builds JIT-eliminate the file checks; debug builds keep the whole
    experimental toolchain for re-running the analysis for demos/thesis.
  * force_hexagon + force_cp_v2 stay unconditional — production routing.
  * Prefill cb0 now respects force_greedy_cb0 (was always sampleTopK 0.9).
  * Native TTS pipeline (executorch-custom/jni_layer_tts.cpp,
    app/src/main/jni/tts_pipeline.cpp): pad-zone sampling switched to
    greedy argmax so EOS gets a fair chance (temp 0.9 top-k kept producing
    audio past EOS where Python's seeded sampler terminated naturally).
  * scripts/prepare_tts_voiceclone.py: new script that captures Python
    greedy-CP reference (stochastic talker for EOS, deterministic CP) for
    token-by-token comparison.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 00:15:14 +02:00
Kazeia Team 199bc4fbc9 Full native C++ TTS validated on short + long phrases
Dynamic formula: target_len = n_tokens × 3.2 + 5 (calibrated)
- Short "Bonjour..." (18 tokens → 62 trailing): OK
- Long "Je suis Kazeia... difficiles" (30 tokens → 101 trailing): OK

RMS trim disabled (garbage is loud, can't distinguish from speech).
Length controlled purely by maxTokens = trailing count.

Pipeline: prepare_tts_native.py "any text" → adb push → run → audio

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 23:51:05 +02:00
Kazeia Team dafbe2a52b FULL NATIVE C++ TTS pipeline — any text, perfect quality
The complete solution for native TTS on NPU:
1. Python: tokenize + text_projection only (30ms, no model generation)
2. File: golden prefill[0:9] + text_proj + eos padding (ratio 3.5×)
3. C++ shared Module: codec_sum(our codes) + trailing text/eos/pad
4. RMS-based auto-trim of trailing noise after speech ends

Key insights:
- Shared Module C++ uses SAME QNN compiled graph as Java → self-consistent
- codec_sum from our NPU codes is coherent (same model instance)
- Text tokens consumed 1:1, then eos padding for remaining steps
- RMS trim detects 15% energy drop from peak → cuts garbage

Validated "impeccable" by user on "Bonjour, je m'appelle Kazeia..."
prepare_tts_native.py works for ANY text.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 23:39:06 +02:00
Kazeia Team 09d36f2025 Root cause found + on-device embed capture + KV=100 restored
Root cause: embeds must come from SAME NPU model instance.
Python fp32 embeds cause divergence on NPU fp16 after ~20 steps.

Solution: Java pipeline captures embeds on-device during generation.
Captured embeds work perfectly with C++ pipeline (validated "bon").

- Added capture mode: touch /data/local/tmp/kazeia/capture_mode
- Embeds saved to captured_embeds.bin (same format as pipeline input)
- KV_LEN restored to 100 (KV=64 lost role tokens → quality loss)
- C++ uses pre-computed embeds as-is (no double codec_sum)

Production path: Java pipeline RTF 1.8 for new texts (good quality)
Replay path: C++ pipeline RTF 1.26 with captured embeds

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 23:00:37 +02:00
Kazeia Team 3dcf73aa38 Restore KV=100 + fix as-is embeds + multi-segment support
- KV_LEN restored to 100 (KV=64 caused quality loss from evicted role tokens)
- C++ uses pre-computed embeds as-is (no double codec_sum)
- Multi-segment format support in Kotlin (detects n_segments header)
- prepare_tts_segments.py: splits text + generates per-segment embeds
- Quality issue: Python-captured embeds differ from original working file
  (original was likely captured on-device, not from Python model.forward)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 22:26:20 +02:00
Kazeia Team 10a3904d7d Multi-segment TTS for long text: split → generate → concatenate
- prepare_tts_segments.py: splits text at sentence boundaries,
  generates Python pre-computed embeds per segment
- Kotlin: detects multi-segment file format, processes each segment
  independently (fresh KV cache), concatenates audio
- Long text tested: 3 segments, 335 tokens, 26.8s audio, RTF 1.67

File format: n_segments, then per segment: nPrefill, nTotal, embeds[]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 14:34:05 +02:00
Kazeia Team f6df1738c5 Add prepare_tts_embeds.py for any text + codec_sum fix
- prepare_tts_embeds.py: generates pre-computed embeddings from any text
  via Python generate_voice_clone, capturing talker inputs
- C++ pipeline: always build codec_sum + trailing (not as-is)
- maxTokens: 4× trailing count (audio >> text tokens)
- Long text tested: 224 Python tokens → 125 NPU tokens (10s audio)
- Text-only embeds don't work (model needs Python pre-computed codec_sum)

Usage: python3 scripts/prepare_tts_embeds.py "Your text" output.bin
       adb push output.bin /data/local/tmp/.../full_pipeline_embeds.bin

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 14:05:42 +02:00
Kazeia Team 42bbb96fd8 Optimize decoder: BigVGAN 8T, small models 4T → RTF 1.26
BigVGAN benefits from 8 intra-op threads (all perf cores).
Pre_conv and pre_transformer kept at 4T (small, less contention).

BigVGAN: 2757ms → 1872ms (-885ms), decode total: 2830ms → 2035ms
Pipeline: 6438ms → 5834ms → RTF 1.26

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 13:00:05 +02:00
Kazeia Team a688edc9ec Reduce talker KV_LEN 100→64: saves 148ms (RTF 1.31)
KV window of 64 sufficient for ~70 token generation (10 prefill + 58 gen).
36% less KV memcpy per talker step (28L × 2 × 64×8×128 vs 100×8×128).

Generation: 3795ms → 3647ms, total: 6438ms → 6093ms

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 12:47:30 +02:00
Kazeia Team 4dcc4bb8b3 Fix KV buffer + revert HTP decoder (BigVGAN too complex for HTP)
- Restored intermediate KV buffer for talker (direct output→input
  caused trembling from buffer overwrite during execute())
- BigVGAN HTP compilation takes >5min, not viable
- RTF 1.35 with clean audio quality

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 12:37:50 +02:00
Kazeia Team e647911329 Shared Module C++ pipeline: RTF 1.6 with perfect quality
Key breakthrough: C++ pipeline loop using the SAME Method* instances
that Java loaded (via Module::method("forward")). This gives:
- Same QNN compiled graph → identical numerical results → no trembling
- C++ loop → no Java Tensor/EValue allocation overhead
- prepare_input_tensors + memcpy + Method::execute (like cp_et_runner)

Pipeline: talker ~20ms/step + CP ~44ms/step + decoder 2.8s = 7.3s for 4.64s

Added to executorch JNI:
- Module.nativeSetCpModule() — registers CP module for pipeline
- Module.nativeRunTtsPipeline(...) — runs full talker+CP loop in C++
- Updated executorch.jar with new native method declarations

From RTF 4.9 (start of session) to RTF 1.6 with impeccable audio quality.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 12:05:58 +02:00
Kazeia Team 38c0e9874a Disable C++ pipeline (QNN non-deterministic), keep Java RTF 1.8
Root cause found: QNN HTP level=1 compilation is not bitwise
deterministic. Two loads of the same .pte produce slightly different
hidden states → audible trembling in decoded speech.

Java pipeline uses single QNN instance → no trembling, validated quality.
C++ pipeline code preserved for future use when QNN context caching
is fixed (would make both loads use same compiled graph).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 11:42:49 +02:00
Kazeia Team 439629c9bf Revert "Pre-allocate Tensor/EValue in Java pipeline: 16s → 8.9s (RTF 1.9)"
This reverts commit 0f027c5fde.
2026-04-09 11:03:52 +02:00
Kazeia Team 0f027c5fde Pre-allocate Tensor/EValue in Java pipeline: 16s → 8.9s (RTF 1.9)
Reuse float arrays and Tensor/EValue objects across talker steps
instead of creating new ones each iteration. Eliminates ~7s of
GC overhead from thousands of JNI object allocations.

Same validated audio quality as before, no C++ pipeline needed.
Talker 35ms/step, CP 58ms/step, total 8.9s for 4.64s audio.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 10:59:13 +02:00
Kazeia Team 8e536094df Fix C++ pipeline eos/pad + disable for quality (keep Java default)
- Fixed trailing embed handling (use pre-computed as-is)
- Added eos/pad embed params to nativeRun
- Improved C++ PRNG for sampling
- Disabled native pipeline: slight quality regression vs Java
  (two separate QNN instances give different numerical results)
- Java pipeline (RTF 1.8) kept as default for validated quality

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 10:53:19 +02:00
Kazeia Team 3b01302cfb Fix missing eos/pad embeddings in native C++ pipeline
The native pipeline was adding zeros after trailing text tokens
instead of tts_eos_embed then tts_pad_embed. This caused the model
to mispronounce final words (e.g. "développement" → "devopment").

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 10:35:05 +02:00
Kazeia Team 393ce79eb5 Native C++ pipeline: RTF 1.4 (was 3.6 in Java)
Full talker+CP autoregressive loop in C++ via JNI.
Talker 20ms/step, CP 44ms/step, total 6.6s for 4.64s audio.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 10:09:32 +02:00
Kazeia Team fb6045a635 Pre-load CP heads + GPU decoder test (reverted) + headArgmaxOffset
- Pre-load all 15 CP heads at first CP call (eliminates lazy-load lag)
- Tested BigVGAN on GPU Adreno: no gain (+300ms vs CPU), kept on CPU
- Added headArgmaxOffset for future batch optimization
- Cancel previous pipeline on new run_pipeline intent

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 09:57:01 +02:00
Kazeia Team 6e6c562d53 Add DSP warmup + fix pipeline thread contention
- Warmup forward() for talker+CP during init (avoids 7s DSP compilation
  on first pipeline run)
- Cancel previous pipeline job before starting new one
- Use Dispatchers.IO for pipeline intent

First run after warmup: talker 19ms/step, CP 59ms/step → RTF ~1.9

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 09:24:18 +02:00
Kazeia Team 8bfe6c7445 Add NEON SIMD heads argmax for CP — 2.3× speedup
CP head dot products (15 × 2048×1024) optimized with ARM NEON
vfmaq_f32 (4 accumulators, 16 floats/iteration).

CP/frame: 131ms → 58ms, total pipeline: 22.7s → 14.7s (RTF 3.2)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 08:55:20 +02:00
Kazeia Team 389ffa7c61 Initial commit: Kazeia TTS pipeline on NPU via ExecuTorch
Full Qwen3-TTS-0.6B pipeline running on Snapdragon 8 Elite NPU:
  - Talker (28L) and Code Predictor (5L) as .pte on QNN HTP fp16
  - JNI integration, no root required
  - Validated audio quality: RTF 3.9

  Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 08:42:11 +02:00