Commit Graph

29 Commits

Author SHA1 Message Date
Kazeia Team 2f07901ff3 TTS Stage 3: LLM stream → sentence split → TTS session → shared AudioTrack
Closes the loop on on-device conversational TTS. The LLM's token stream is
now consumed by a SentenceStreamer which fires a callback the moment a
terminal-punctuation boundary appears; each sentence is enqueued to a
persistent TTS streaming session that generates and plays audio through a
single shared AudioTrack. Sentence N's audio plays while sentence N+1 is
being generated on Hexagon+CP — no per-sentence AudioTrack init gap, and
no "wait for full response before hearing anything".

Mocked-LLM validation on the 3-sentence prompt:
  "Bonjour. Je suis là pour vous écouter. Comment allez-vous aujourd'hui."

  - First sentence detected:    1 ms
  - Seg 0 prefill (Hex):       567 ms
  - Seg 0 generated:         4 200 ms (18 tokens, 1.4 s audio)
  - Seg 1 generated:         9 100 ms (42 tokens)
  - Seg 2 generated:        11 000 ms (46 tokens)
  - Session closed:         33 500 ms (all audio drained)

Changes:

  * tts/SentenceStreamer.kt — 50-line helper that buffers tokens and
    fires onSentence when a "." "!" "?" ";" or "\n" appears. minChars = 4
    so "Oui." / "Bonjour." count as real sentences; higher thresholds
    swallowed conversational openers into the next segment and delayed
    first audio. flush() for the final partial sentence.

  * Qwen3TtsEngine.startStreamingSession / enqueueSentence / endStreamingSession
    triplet. startStreamingSession opens a 30-second MODE_STREAM
    AudioTrack plus a background worker coroutine that pulls sentences
    from an unlimited Channel. enqueueSentence is non-blocking; the worker
    serialises generation so audio order matches enqueue order.
    generateSegmentAudioVC is the per-sentence body (tokenize → prefill
    build → Hexagon gen → decode) without the WAV-save side effects that
    the /stream_text intent path does.

  * KazeiaService new intents:
      - stream_llm        : real LLM path (needs LLM loaded; currently the
                            debug build runs echo-mode so this path is
                            shipped but requires production config to
                            exercise).
      - stream_llm_mock   : fakes the LLM stream by splitting the given
                            text on spaces with 50 ms per "token" —
                            matches the ~20 tok/s rate the on-device LLM
                            produces and lets Stage 3 be validated without
                            flipping the LLM on.

Architectural notes:
  - AudioTrack buffer is 30 s so generation can run ahead of playback
    without blocking writes. RTF on Snapdragon 8 Elite is ~3 for short
    sentences, so for a 2-3 sentence response the buffer actually drains
    between segments and the user hears a short gap — expected, not a
    bug. Masking that gap requires RTF < 1 which is out of scope.
  - Hexagon KV is reset between sentences (hexReset) so the talker
    doesn't see stale context. Prefill observed cb0 = 1995 on every
    sentence that starts with a capital letter, matching the Python
    greedy reference — confirms prefill reconstruction is stable across
    segments within a session.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 10:52:46 +02:00
Kazeia Team 7f1a44c23d TTS Stage 2: on-device voice-cloning TTS for arbitrary text
Removes the PC-side prepare_tts_segments.py dependency for day-to-day
generation. The tablet now tokenizes, embeds, and voice-clones any
French (or Qwen3-supported) text with no network, no ADB push per
phrase, and quality that matches Python's reference on "Bonjour, je
suis Kazeia, je suis là pour vous écouter." — user validation:
"impeccable".

Three pieces that compose the path:

  1. Qwen3BpeTokenizer.kt — byte-level BPE matching Qwen2/Qwen3's
     Python implementation bit-for-bit. UTF-8 + GPT-2 byte encoder,
     Qwen regex with \p{IsAlphabetic}/\p{IsDigit} (Android's regex
     lacks UNICODE_CHARACTER_CLASS — caught in testing). Produces
     identical token IDs to HF's Qwen2TokenizerFast on the test phrase:
     [81581, 11, 4759, 35631, 730, 9832, 685, 11, 4759, 35631, 37915,
      4914, 9012, 90229, 2676, 13].

  2. export_tts_text_embeddings.py — one-time PC export of:
     * Full projected text embeddings for the entire 151936-token vocab
       as fp16 (297 MB). Sanity check: live vs stored max abs diff
       1.15e-4 on token 1043. Mmap'd on-device so it stays off the
       Java heap and leaves room for the 125 MB cp_embeddings alloc.
     * Damien voice PREFIX (9 × 1024 fp32) — positions 0..8 of a
       Python voice-clone capture, text-invariant across segments.
     * Damien voice SUFFIX (2 × 1024 fp32) — positions nP-2..nP-1
       of the same capture. Also text-invariant (diff = 0.0 across
       3 different-text segments). Without it the talker never sees
       "text ended" and decode falls into page/beg repetition.
     * Qwen3 tokenizer vocab.json + merges.txt.

  3. Qwen3TtsEngine.kt:
     * mmap loader for the embeddings table + buffered fp16→fp32
       lookup (halfToFloat covers subnormals/inf/NaN so pathological
       tokens don't become 0).
     * Stage 2 assets detected at init; missing file transparently
       falls back to legacy 1050-token reduced-vocab path.
     * synthesizeTextStreaming(text, onSegmentReady) — new public API:
       sentence-split → BPE → build prefill as
         [voice prefix] + [text_proj(id) + codec_pad] × N + [voice suffix]
       (exact structure Python emits; verified bit-for-bit by matching
       captured Baer prefill positions against text_projection(tok)+
       codec_embedding(CODEC_PAD)) → runHexGenWithPrefill → decode
       each segment through the existing BigVGAN pipeline → callback.
     * runHexGenWithPrefill — Hexagon prefill + interleaved CP decode
       loop. Feeds tts_eos once, tts_pad thereafter (same schedule as
       Python's voice_clone). Degeneracy guard stops when 9 identical
       cb0 in a row appear — catches the rare "page beg beg beg" tail
       when EOS never fires. maxGen = ids.size*4 + 10 matches the
       typical 3.3 codec-frames-per-text-token that Python produces.
     * Prefill build uses the speaker's captured prefix/suffix rather
       than the legacy in-code buildPrefillEmbeddings that puts only
       one text token in prefill — the structure mismatch produced
       garbled audio in the first attempt of this commit.

  4. KazeiaService.kt: new stream_text intent extra wires text input
     to synthesizeTextStreaming with an AudioTrack MODE_STREAM consumer.
     First-audio latency on the "Bonjour..." test: ~23 s on Snapdragon
     8 Elite (prefill + 74-token decode), vs a 3-phrase sentence batch
     that was 65 s pre-streaming — streaming + on-device text together
     unblock the MVP chat loop.

Known caveats:
  * 297 MB on-device footprint for the embedding table. Acceptable on
    OnePlus Pad 3; can be quantized further (int8 per-row) if storage
    becomes tight.
  * First init adds ~3 s for BPE vocab + merges load (151k × 2 hash-
    maps). Happens once per process.
  * maxGen cap means extremely long sentences may truncate. The
    sentence splitter already keeps segments ≤120 chars so this
    hasn't been observed in practice.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 10:12:09 +02:00
Kazeia Team 5e416713ce TTS Stage 1 streaming: play each segment the moment it's decoded
Adds a streaming multi-segment pipeline on top of the Hexagon talker + ONNX
CP backend. First audio arrives at ~20s (vs ~65s for the full phrase
non-streamed) on the Baer 16.56s reference (3-segment split). Voice cloning
is preserved per segment because each segment now ships its own full prefill.

Changes:

  * Qwen3TtsEngine.generateFromEmbedsHexagonStreaming(path, onSegmentReady)
    reads single- or multi-segment embeds, runs prefill + generation + VQ
    decode + BigVGAN per segment, and fires the callback with each
    segment's ShortArray the moment it's ready. Saves per-segment WAVs
    (kazeia_stream_seg{N}.wav) plus the concatenated kazeia_stream_full.wav
    for offline inspection. Extracted the common generation loop into
    runHexSegmentFromEmbeds(prefill, trailing, idx) so single-segment and
    streaming paths share exactly the same code (no quality drift between
    modes). Added hexReset() between segments so segment 2's prefill logits
    don't contain segment 1's KV state.

  * vqDecode buffer overrun fix: when the talker samples CODEC_EOS as cb0
    it stores a vocab id > CODEBOOK_SIZE, which vqDecode then used as a
    codebook row index — reading past the 2048-row buffer. The short Baer
    probe never hit this; longer phrases do. Clamp any out-of-vocab code
    to 0 at allCodebooks build time.

  * KazeiaService: new stream_pipeline intent extra wires the callback
    to an AudioTrack MODE_STREAM instance, writing each segment's audio as
    soon as it comes back. Logs time-to-first-audio.

  * prepare_tts_segments.py: the previous version only captured 1-token
    decode calls and substituted a generic 9-embed "prefill_base" pulled
    from an unrelated single-segment file — dropping the per-segment
    xvector conditioning AND the text-encoded embeddings, so Hexagon
    produced garbled mixed speech for segments 2..N. Now captures the
    multi-token prefill call too (like prepare_tts_voiceclone.py) so each
    segment is self-contained.

Limitation (documented, not fixed in this commit): RTF ~4.4 > 1 on the
Snapdragon 8 Elite with current config means each segment takes longer to
generate than it takes to play, so audible gaps between segments remain.
Removing the gaps requires either (a) producer/consumer parallelism across
two coroutines (doesn't help if RTF stays > 1), or (b) faster CP (the
~180ms/step ONNX MLAS CP is the bottleneck; Hexagon HMX has a known NaN bug
and the .pte path contends with Hexagon talker on the DSP).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 08:43:30 +02:00
Kazeia Team de878ddf5c TTS tremor investigation: identify cross-arch numerical floor, gate diag flags
Extensive investigation of the audible "tremor" in the generated voice-cloned
audio. Conclusion is architectural, not a bug:

  * Hexagon HMX fp16 talker logits correlate with PyTorch fp32 at 0.999998
  * ONNX Runtime CP V2 is bit-identical to PyTorch greedy CP (0.24% residual
    divergence measured by injecting Python's captured cb0 at each step —
    14/16 codebooks match 100%, cb14/cb15 miss 1 token out of 53)
  * BigVGAN decoder is bit-identical to PyTorch (validated earlier)
  * Therefore the tremor is caused entirely by the ~28% of cb0 argmax flips
    where the tiny fp16 logits drift crosses the top-1/top-2 margin. This
    cascades through the autoregressive chain into a trajectory the model
    never saw at training time → incoherent artifacts.

Cross-architecture test (x86 AVX-512 / ARM64 NEON+HMX) cannot be zeroed by
any runtime swap — LibTorch Android would use NEON kernels with a different
reduction order than PyTorch x86, same class of error, smaller but non-zero
residual. Temperature tweaking (0.3 → 0.9) and greedy-vs-sample gave no
perceptual difference: the floor is numeric, not in the sampling layer.

Accepted for MVP. Documented in project_tts_cross_arch_limit.md — this is a
thesis-relevant finding about on-device TTS deployment limits.

Cleanup:
  * All diagnostic flags (force_inject_pycb0, force_greedy_cb0, cb0_temp,
    force_python_codes, force_cpu_talker, force_cpu_talker_gguf) now gated
    behind BuildConfig.DEBUG via diagFlag()/diagFile() helpers. Release
    builds JIT-eliminate the file checks; debug builds keep the whole
    experimental toolchain for re-running the analysis for demos/thesis.
  * force_hexagon + force_cp_v2 stay unconditional — production routing.
  * Prefill cb0 now respects force_greedy_cb0 (was always sampleTopK 0.9).
  * Native TTS pipeline (executorch-custom/jni_layer_tts.cpp,
    app/src/main/jni/tts_pipeline.cpp): pad-zone sampling switched to
    greedy argmax so EOS gets a fair chance (temp 0.9 top-k kept producing
    audio past EOS where Python's seeded sampler terminated naturally).
  * scripts/prepare_tts_voiceclone.py: new script that captures Python
    greedy-CP reference (stochastic talker for EOS, deterministic CP) for
    token-by-token comparison.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 00:15:14 +02:00
Kazeia Team ee186e9049 Auto-segmentation for long texts + dynamic pipeline
- prepare_tts_native.py: auto-splits long text at sentence/comma
  boundaries, max 15 tokens per segment
- Multi-segment format: each segment gets fresh KV cache
- Formula: target_len = n_tokens × 3.2 + 5 per segment
- Tested on Edouard Baer monologue: 28 segments, 102s audio

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 00:08:59 +02:00
Kazeia Team 199bc4fbc9 Full native C++ TTS validated on short + long phrases
Dynamic formula: target_len = n_tokens × 3.2 + 5 (calibrated)
- Short "Bonjour..." (18 tokens → 62 trailing): OK
- Long "Je suis Kazeia... difficiles" (30 tokens → 101 trailing): OK

RMS trim disabled (garbage is loud, can't distinguish from speech).
Length controlled purely by maxTokens = trailing count.

Pipeline: prepare_tts_native.py "any text" → adb push → run → audio

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 23:51:05 +02:00
Kazeia Team dafbe2a52b FULL NATIVE C++ TTS pipeline — any text, perfect quality
The complete solution for native TTS on NPU:
1. Python: tokenize + text_projection only (30ms, no model generation)
2. File: golden prefill[0:9] + text_proj + eos padding (ratio 3.5×)
3. C++ shared Module: codec_sum(our codes) + trailing text/eos/pad
4. RMS-based auto-trim of trailing noise after speech ends

Key insights:
- Shared Module C++ uses SAME QNN compiled graph as Java → self-consistent
- codec_sum from our NPU codes is coherent (same model instance)
- Text tokens consumed 1:1, then eos padding for remaining steps
- RMS trim detects 15% energy drop from peak → cuts garbage

Validated "impeccable" by user on "Bonjour, je m'appelle Kazeia..."
prepare_tts_native.py works for ANY text.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 23:39:06 +02:00
Kazeia Team 09d36f2025 Root cause found + on-device embed capture + KV=100 restored
Root cause: embeds must come from SAME NPU model instance.
Python fp32 embeds cause divergence on NPU fp16 after ~20 steps.

Solution: Java pipeline captures embeds on-device during generation.
Captured embeds work perfectly with C++ pipeline (validated "bon").

- Added capture mode: touch /data/local/tmp/kazeia/capture_mode
- Embeds saved to captured_embeds.bin (same format as pipeline input)
- KV_LEN restored to 100 (KV=64 lost role tokens → quality loss)
- C++ uses pre-computed embeds as-is (no double codec_sum)

Production path: Java pipeline RTF 1.8 for new texts (good quality)
Replay path: C++ pipeline RTF 1.26 with captured embeds

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 23:00:37 +02:00
Kazeia Team 3dcf73aa38 Restore KV=100 + fix as-is embeds + multi-segment support
- KV_LEN restored to 100 (KV=64 caused quality loss from evicted role tokens)
- C++ uses pre-computed embeds as-is (no double codec_sum)
- Multi-segment format support in Kotlin (detects n_segments header)
- prepare_tts_segments.py: splits text + generates per-segment embeds
- Quality issue: Python-captured embeds differ from original working file
  (original was likely captured on-device, not from Python model.forward)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 22:26:20 +02:00
Kazeia Team 10a3904d7d Multi-segment TTS for long text: split → generate → concatenate
- prepare_tts_segments.py: splits text at sentence boundaries,
  generates Python pre-computed embeds per segment
- Kotlin: detects multi-segment file format, processes each segment
  independently (fresh KV cache), concatenates audio
- Long text tested: 3 segments, 335 tokens, 26.8s audio, RTF 1.67

File format: n_segments, then per segment: nPrefill, nTotal, embeds[]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 14:34:05 +02:00
Kazeia Team 24157c0a68 Fix: use pre-computed embeds as-is (no double codec_sum)
Pre-computed embeds from Python already contain codec_sum+text.
Using them as-is works correctly. After exhausted, fallback to
our codec_sum + pad.

Long text: 191 tokens, 15.28s audio, RTF 1.27

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 14:10:23 +02:00
Kazeia Team f6df1738c5 Add prepare_tts_embeds.py for any text + codec_sum fix
- prepare_tts_embeds.py: generates pre-computed embeddings from any text
  via Python generate_voice_clone, capturing talker inputs
- C++ pipeline: always build codec_sum + trailing (not as-is)
- maxTokens: 4× trailing count (audio >> text tokens)
- Long text tested: 224 Python tokens → 125 NPU tokens (10s audio)
- Text-only embeds don't work (model needs Python pre-computed codec_sum)

Usage: python3 scripts/prepare_tts_embeds.py "Your text" output.bin
       adb push output.bin /data/local/tmp/.../full_pipeline_embeds.bin

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 14:05:42 +02:00
Kazeia Team 173606dae7 Stable: decoder 8T optimization + restore pre-computed embeds
- BigVGAN: 8 threads (2757→1872ms), pre_conv/pre_transformer: 4 threads
- Restored pre-computed embeds format (codec_sum+text from Python)
- Text-only trailing embeds don't work: model needs codec_sum for EOS

For long phrases, pre-computed embeds must be generated from Python.
RTF 1.26 on short phrase.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 13:42:02 +02:00
Kazeia Team 42bbb96fd8 Optimize decoder: BigVGAN 8T, small models 4T → RTF 1.26
BigVGAN benefits from 8 intra-op threads (all perf cores).
Pre_conv and pre_transformer kept at 4T (small, less contention).

BigVGAN: 2757ms → 1872ms (-885ms), decode total: 2830ms → 2035ms
Pipeline: 6438ms → 5834ms → RTF 1.26

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 13:00:05 +02:00
Kazeia Team a688edc9ec Reduce talker KV_LEN 100→64: saves 148ms (RTF 1.31)
KV window of 64 sufficient for ~70 token generation (10 prefill + 58 gen).
36% less KV memcpy per talker step (28L × 2 × 64×8×128 vs 100×8×128).

Generation: 3795ms → 3647ms, total: 6438ms → 6093ms

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 12:47:30 +02:00
Kazeia Team 4dcc4bb8b3 Fix KV buffer + revert HTP decoder (BigVGAN too complex for HTP)
- Restored intermediate KV buffer for talker (direct output→input
  caused trembling from buffer overwrite during execute())
- BigVGAN HTP compilation takes >5min, not viable
- RTF 1.35 with clean audio quality

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 12:37:50 +02:00
Kazeia Team 985fd9cff9 Direct output→input KV copy: RTF 1.51 → 1.31
Skip intermediate KV buffer: copy output tensors directly into
next step's input pointers. Saves ~1.5GB/run of memcpy for talker
(28L × 2 × 100×8×128 floats × 58 steps) and CP similarly.

Generation: 4007ms → 3713ms, total: 7180ms → 6078ms

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 12:23:45 +02:00
Kazeia Team 14f7e5b05f Optimize CP+talker: eliminate prepare_input_tensors per step
Cache input tensor pointers after first prepare_input_tensors call,
then memcpy directly into them for all subsequent steps.

Eliminates ~14000 mallocs per pipeline run (986 CP + 58 talker calls).
Generation: 4640ms → 4007ms (-633ms), total RTF: 1.6 → 1.51

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 12:16:38 +02:00
Kazeia Team e647911329 Shared Module C++ pipeline: RTF 1.6 with perfect quality
Key breakthrough: C++ pipeline loop using the SAME Method* instances
that Java loaded (via Module::method("forward")). This gives:
- Same QNN compiled graph → identical numerical results → no trembling
- C++ loop → no Java Tensor/EValue allocation overhead
- prepare_input_tensors + memcpy + Method::execute (like cp_et_runner)

Pipeline: talker ~20ms/step + CP ~44ms/step + decoder 2.8s = 7.3s for 4.64s

Added to executorch JNI:
- Module.nativeSetCpModule() — registers CP module for pipeline
- Module.nativeRunTtsPipeline(...) — runs full talker+CP loop in C++
- Updated executorch.jar with new native method declarations

From RTF 4.9 (start of session) to RTF 1.6 with impeccable audio quality.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 12:05:58 +02:00
Kazeia Team 38c0e9874a Disable C++ pipeline (QNN non-deterministic), keep Java RTF 1.8
Root cause found: QNN HTP level=1 compilation is not bitwise
deterministic. Two loads of the same .pte produce slightly different
hidden states → audible trembling in decoded speech.

Java pipeline uses single QNN instance → no trembling, validated quality.
C++ pipeline code preserved for future use when QNN context caching
is fixed (would make both loads use same compiled graph).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 11:42:49 +02:00
Kazeia Team 439629c9bf Revert "Pre-allocate Tensor/EValue in Java pipeline: 16s → 8.9s (RTF 1.9)"
This reverts commit 0f027c5fde.
2026-04-09 11:03:52 +02:00
Kazeia Team 0f027c5fde Pre-allocate Tensor/EValue in Java pipeline: 16s → 8.9s (RTF 1.9)
Reuse float arrays and Tensor/EValue objects across talker steps
instead of creating new ones each iteration. Eliminates ~7s of
GC overhead from thousands of JNI object allocations.

Same validated audio quality as before, no C++ pipeline needed.
Talker 35ms/step, CP 58ms/step, total 8.9s for 4.64s audio.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 10:59:13 +02:00
Kazeia Team 8e536094df Fix C++ pipeline eos/pad + disable for quality (keep Java default)
- Fixed trailing embed handling (use pre-computed as-is)
- Added eos/pad embed params to nativeRun
- Improved C++ PRNG for sampling
- Disabled native pipeline: slight quality regression vs Java
  (two separate QNN instances give different numerical results)
- Java pipeline (RTF 1.8) kept as default for validated quality

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 10:53:19 +02:00
Kazeia Team 3b01302cfb Fix missing eos/pad embeddings in native C++ pipeline
The native pipeline was adding zeros after trailing text tokens
instead of tts_eos_embed then tts_pad_embed. This caused the model
to mispronounce final words (e.g. "développement" → "devopment").

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 10:35:05 +02:00
Kazeia Team 393ce79eb5 Native C++ pipeline: RTF 1.4 (was 3.6 in Java)
Full talker+CP autoregressive loop in C++ via JNI.
Talker 20ms/step, CP 44ms/step, total 6.6s for 4.64s audio.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 10:09:32 +02:00
Kazeia Team fb6045a635 Pre-load CP heads + GPU decoder test (reverted) + headArgmaxOffset
- Pre-load all 15 CP heads at first CP call (eliminates lazy-load lag)
- Tested BigVGAN on GPU Adreno: no gain (+300ms vs CPU), kept on CPU
- Added headArgmaxOffset for future batch optimization
- Cancel previous pipeline on new run_pipeline intent

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 09:57:01 +02:00
Kazeia Team 6e6c562d53 Add DSP warmup + fix pipeline thread contention
- Warmup forward() for talker+CP during init (avoids 7s DSP compilation
  on first pipeline run)
- Cancel previous pipeline job before starting new one
- Use Dispatchers.IO for pipeline intent

First run after warmup: talker 19ms/step, CP 59ms/step → RTF ~1.9

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 09:24:18 +02:00
Kazeia Team 8bfe6c7445 Add NEON SIMD heads argmax for CP — 2.3× speedup
CP head dot products (15 × 2048×1024) optimized with ARM NEON
vfmaq_f32 (4 accumulators, 16 floats/iteration).

CP/frame: 131ms → 58ms, total pipeline: 22.7s → 14.7s (RTF 3.2)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 08:55:20 +02:00
Kazeia Team 389ffa7c61 Initial commit: Kazeia TTS pipeline on NPU via ExecuTorch
Full Qwen3-TTS-0.6B pipeline running on Snapdragon 8 Elite NPU:
  - Talker (28L) and Code Predictor (5L) as .pte on QNN HTP fp16
  - JNI integration, no root required
  - Validated audio quality: RTF 3.9

  Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 08:42:11 +02:00