kazeia/kazeia-android/app/src/main
Kazeia Team 7f1a44c23d TTS Stage 2: on-device voice-cloning TTS for arbitrary text
Removes the PC-side prepare_tts_segments.py dependency for day-to-day
generation. The tablet now tokenizes, embeds, and voice-clones any
French (or Qwen3-supported) text with no network, no ADB push per
phrase, and quality that matches Python's reference on "Bonjour, je
suis Kazeia, je suis là pour vous écouter." — user validation:
"impeccable".

Three pieces that compose the path:

  1. Qwen3BpeTokenizer.kt — byte-level BPE matching Qwen2/Qwen3's
     Python implementation bit-for-bit. UTF-8 + GPT-2 byte encoder,
     Qwen regex with \p{IsAlphabetic}/\p{IsDigit} (Android's regex
     lacks UNICODE_CHARACTER_CLASS — caught in testing). Produces
     identical token IDs to HF's Qwen2TokenizerFast on the test phrase:
     [81581, 11, 4759, 35631, 730, 9832, 685, 11, 4759, 35631, 37915,
      4914, 9012, 90229, 2676, 13].

  2. export_tts_text_embeddings.py — one-time PC export of:
     * Full projected text embeddings for the entire 151936-token vocab
       as fp16 (297 MB). Sanity check: live vs stored max abs diff
       1.15e-4 on token 1043. Mmap'd on-device so it stays off the
       Java heap and leaves room for the 125 MB cp_embeddings alloc.
     * Damien voice PREFIX (9 × 1024 fp32) — positions 0..8 of a
       Python voice-clone capture, text-invariant across segments.
     * Damien voice SUFFIX (2 × 1024 fp32) — positions nP-2..nP-1
       of the same capture. Also text-invariant (diff = 0.0 across
       3 different-text segments). Without it the talker never sees
       "text ended" and decode falls into page/beg repetition.
     * Qwen3 tokenizer vocab.json + merges.txt.

  3. Qwen3TtsEngine.kt:
     * mmap loader for the embeddings table + buffered fp16→fp32
       lookup (halfToFloat covers subnormals/inf/NaN so pathological
       tokens don't become 0).
     * Stage 2 assets detected at init; missing file transparently
       falls back to legacy 1050-token reduced-vocab path.
     * synthesizeTextStreaming(text, onSegmentReady) — new public API:
       sentence-split → BPE → build prefill as
         [voice prefix] + [text_proj(id) + codec_pad] × N + [voice suffix]
       (exact structure Python emits; verified bit-for-bit by matching
       captured Baer prefill positions against text_projection(tok)+
       codec_embedding(CODEC_PAD)) → runHexGenWithPrefill → decode
       each segment through the existing BigVGAN pipeline → callback.
     * runHexGenWithPrefill — Hexagon prefill + interleaved CP decode
       loop. Feeds tts_eos once, tts_pad thereafter (same schedule as
       Python's voice_clone). Degeneracy guard stops when 9 identical
       cb0 in a row appear — catches the rare "page beg beg beg" tail
       when EOS never fires. maxGen = ids.size*4 + 10 matches the
       typical 3.3 codec-frames-per-text-token that Python produces.
     * Prefill build uses the speaker's captured prefix/suffix rather
       than the legacy in-code buildPrefillEmbeddings that puts only
       one text token in prefill — the structure mismatch produced
       garbled audio in the first attempt of this commit.

  4. KazeiaService.kt: new stream_text intent extra wires text input
     to synthesizeTextStreaming with an AudioTrack MODE_STREAM consumer.
     First-audio latency on the "Bonjour..." test: ~23 s on Snapdragon
     8 Elite (prefill + 74-token decode), vs a 3-phrase sentence batch
     that was 65 s pre-streaming — streaming + on-device text together
     unblock the MVP chat loop.

Known caveats:
  * 297 MB on-device footprint for the embedding table. Acceptable on
    OnePlus Pad 3; can be quantized further (int8 per-row) if storage
    becomes tight.
  * First init adds ~3 s for BPE vocab + merges load (151k × 2 hash-
    maps). Happens once per process.
  * maxGen cap means extremely long sentences may truncate. The
    sentence splitter already keeps segments ≤120 chars so this
    hasn't been observed in practice.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 10:12:09 +02:00
..
assets Initial commit: Kazeia TTS pipeline on NPU via ExecuTorch 2026-04-09 08:42:11 +02:00
java/com/kazeia TTS Stage 2: on-device voice-cloning TTS for arbitrary text 2026-04-13 10:12:09 +02:00
jni TTS tremor investigation: identify cross-arch numerical floor, gate diag flags 2026-04-13 00:15:14 +02:00
res Initial commit: Kazeia TTS pipeline on NPU via ExecuTorch 2026-04-09 08:42:11 +02:00
AndroidManifest.xml Initial commit: Kazeia TTS pipeline on NPU via ExecuTorch 2026-04-09 08:42:11 +02:00