Commit Graph

2 Commits

Author SHA1 Message Date
Kazeia Team c25040a780 TTS: conditional tail-trim + export script accepts voice path arg
Two small changes:

  * export_tts_text_embeddings.py now takes the voice wav as an optional
    second CLI arg (defaults to damien_15s_24k.wav). Lets the same script
    capture voice-prefix+suffix for any speaker wav without editing the
    source — used today to test Elodie alongside Damien.

  * synthesizeTextStreaming + generateSegmentAudioVC only run the
    trimTailLowEnergy trim when n >= maxGen. The trim's 35%-of-peak
    threshold is tuned to catch "page beg beg" filler after the talker
    fails to emit EOS — but it was cutting valid speech when EOS fired
    early (observed on Elodie seg 1: 10.08 s → 2.92 s, a 4-second over-
    trim). With the guard it's a no-op on converging generations and
    only fires on the ~15% of segments that hit maxGen.

Validation after the fix (Elodie, Baer monologue):
  - seg 1: 126 tokens = maxGen → trimmed 10.08 s → 8.88 s (1.2 s cut,
           the filler tail)
  - seg 2: 105 tokens < 138 maxGen → no trim, 8.4 s kept as-is
  - seg 3: 69 tokens < 96 maxGen → no trim, 5.6 s kept as-is

Voice prefix/suffix shape is speaker-invariant except position 7 (the
xvector). Confirmed by capturing both Damien and Elodie and diffing:
positions 0-6 and 8 identical within 1e-8, suffix identical within
1e-8, only pos 7 has a different xvector embedding (norm 10.36 vs 10.12).
That means swapping speakers on-device is a 45 KB file push — no app
rebuild, no re-export of the 297 MB vocabulary table.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 11:32:33 +02:00
Kazeia Team 7f1a44c23d TTS Stage 2: on-device voice-cloning TTS for arbitrary text
Removes the PC-side prepare_tts_segments.py dependency for day-to-day
generation. The tablet now tokenizes, embeds, and voice-clones any
French (or Qwen3-supported) text with no network, no ADB push per
phrase, and quality that matches Python's reference on "Bonjour, je
suis Kazeia, je suis là pour vous écouter." — user validation:
"impeccable".

Three pieces that compose the path:

  1. Qwen3BpeTokenizer.kt — byte-level BPE matching Qwen2/Qwen3's
     Python implementation bit-for-bit. UTF-8 + GPT-2 byte encoder,
     Qwen regex with \p{IsAlphabetic}/\p{IsDigit} (Android's regex
     lacks UNICODE_CHARACTER_CLASS — caught in testing). Produces
     identical token IDs to HF's Qwen2TokenizerFast on the test phrase:
     [81581, 11, 4759, 35631, 730, 9832, 685, 11, 4759, 35631, 37915,
      4914, 9012, 90229, 2676, 13].

  2. export_tts_text_embeddings.py — one-time PC export of:
     * Full projected text embeddings for the entire 151936-token vocab
       as fp16 (297 MB). Sanity check: live vs stored max abs diff
       1.15e-4 on token 1043. Mmap'd on-device so it stays off the
       Java heap and leaves room for the 125 MB cp_embeddings alloc.
     * Damien voice PREFIX (9 × 1024 fp32) — positions 0..8 of a
       Python voice-clone capture, text-invariant across segments.
     * Damien voice SUFFIX (2 × 1024 fp32) — positions nP-2..nP-1
       of the same capture. Also text-invariant (diff = 0.0 across
       3 different-text segments). Without it the talker never sees
       "text ended" and decode falls into page/beg repetition.
     * Qwen3 tokenizer vocab.json + merges.txt.

  3. Qwen3TtsEngine.kt:
     * mmap loader for the embeddings table + buffered fp16→fp32
       lookup (halfToFloat covers subnormals/inf/NaN so pathological
       tokens don't become 0).
     * Stage 2 assets detected at init; missing file transparently
       falls back to legacy 1050-token reduced-vocab path.
     * synthesizeTextStreaming(text, onSegmentReady) — new public API:
       sentence-split → BPE → build prefill as
         [voice prefix] + [text_proj(id) + codec_pad] × N + [voice suffix]
       (exact structure Python emits; verified bit-for-bit by matching
       captured Baer prefill positions against text_projection(tok)+
       codec_embedding(CODEC_PAD)) → runHexGenWithPrefill → decode
       each segment through the existing BigVGAN pipeline → callback.
     * runHexGenWithPrefill — Hexagon prefill + interleaved CP decode
       loop. Feeds tts_eos once, tts_pad thereafter (same schedule as
       Python's voice_clone). Degeneracy guard stops when 9 identical
       cb0 in a row appear — catches the rare "page beg beg beg" tail
       when EOS never fires. maxGen = ids.size*4 + 10 matches the
       typical 3.3 codec-frames-per-text-token that Python produces.
     * Prefill build uses the speaker's captured prefix/suffix rather
       than the legacy in-code buildPrefillEmbeddings that puts only
       one text token in prefill — the structure mismatch produced
       garbled audio in the first attempt of this commit.

  4. KazeiaService.kt: new stream_text intent extra wires text input
     to synthesizeTextStreaming with an AudioTrack MODE_STREAM consumer.
     First-audio latency on the "Bonjour..." test: ~23 s on Snapdragon
     8 Elite (prefill + 74-token decode), vs a 3-phrase sentence batch
     that was 65 s pre-streaming — streaming + on-device text together
     unblock the MVP chat loop.

Known caveats:
  * 297 MB on-device footprint for the embedding table. Acceptable on
    OnePlus Pad 3; can be quantized further (int8 per-row) if storage
    becomes tight.
  * First init adds ~3 s for BPE vocab + merges load (151k × 2 hash-
    maps). Happens once per process.
  * maxGen cap means extremely long sentences may truncate. The
    sentence splitter already keeps segments ≤120 chars so this
    hasn't been observed in practice.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 10:12:09 +02:00