kazeia

History

Kazeia Team 7f1a44c23d TTS Stage 2: on-device voice-cloning TTS for arbitrary text Removes the PC-side prepare_tts_segments.py dependency for day-to-day generation. The tablet now tokenizes, embeds, and voice-clones any French (or Qwen3-supported) text with no network, no ADB push per phrase, and quality that matches Python's reference on "Bonjour, je suis Kazeia, je suis là pour vous écouter." — user validation: "impeccable". Three pieces that compose the path: 1. Qwen3BpeTokenizer.kt — byte-level BPE matching Qwen2/Qwen3's Python implementation bit-for-bit. UTF-8 + GPT-2 byte encoder, Qwen regex with \p{IsAlphabetic}/\p{IsDigit} (Android's regex lacks UNICODE_CHARACTER_CLASS — caught in testing). Produces identical token IDs to HF's Qwen2TokenizerFast on the test phrase: [81581, 11, 4759, 35631, 730, 9832, 685, 11, 4759, 35631, 37915, 4914, 9012, 90229, 2676, 13]. 2. export_tts_text_embeddings.py — one-time PC export of: * Full projected text embeddings for the entire 151936-token vocab as fp16 (297 MB). Sanity check: live vs stored max abs diff 1.15e-4 on token 1043. Mmap'd on-device so it stays off the Java heap and leaves room for the 125 MB cp_embeddings alloc. * Damien voice PREFIX (9 × 1024 fp32) — positions 0..8 of a Python voice-clone capture, text-invariant across segments. * Damien voice SUFFIX (2 × 1024 fp32) — positions nP-2..nP-1 of the same capture. Also text-invariant (diff = 0.0 across 3 different-text segments). Without it the talker never sees "text ended" and decode falls into page/beg repetition. * Qwen3 tokenizer vocab.json + merges.txt. 3. Qwen3TtsEngine.kt: * mmap loader for the embeddings table + buffered fp16→fp32 lookup (halfToFloat covers subnormals/inf/NaN so pathological tokens don't become 0). * Stage 2 assets detected at init; missing file transparently falls back to legacy 1050-token reduced-vocab path. * synthesizeTextStreaming(text, onSegmentReady) — new public API: sentence-split → BPE → build prefill as [voice prefix] + [text_proj(id) + codec_pad] × N + [voice suffix] (exact structure Python emits; verified bit-for-bit by matching captured Baer prefill positions against text_projection(tok)+ codec_embedding(CODEC_PAD)) → runHexGenWithPrefill → decode each segment through the existing BigVGAN pipeline → callback. * runHexGenWithPrefill — Hexagon prefill + interleaved CP decode loop. Feeds tts_eos once, tts_pad thereafter (same schedule as Python's voice_clone). Degeneracy guard stops when 9 identical cb0 in a row appear — catches the rare "page beg beg beg" tail when EOS never fires. maxGen = ids.size4 + 10 matches the typical 3.3 codec-frames-per-text-token that Python produces. Prefill build uses the speaker's captured prefix/suffix rather than the legacy in-code buildPrefillEmbeddings that puts only one text token in prefill — the structure mismatch produced garbled audio in the first attempt of this commit. 4. KazeiaService.kt: new stream_text intent extra wires text input to synthesizeTextStreaming with an AudioTrack MODE_STREAM consumer. First-audio latency on the "Bonjour..." test: ~23 s on Snapdragon 8 Elite (prefill + 74-token decode), vs a 3-phrase sentence batch that was 65 s pre-streaming — streaming + on-device text together unblock the MVP chat loop. Known caveats: * 297 MB on-device footprint for the embedding table. Acceptable on OnePlus Pad 3; can be quantized further (int8 per-row) if storage becomes tight. * First init adds ~3 s for BPE vocab + merges load (151k × 2 hash- maps). Happens once per process. * maxGen cap means extremely long sentences may truncate. The sentence splitter already keeps segments ≤120 chars so this hasn't been observed in practice. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>		2026-04-13 10:12:09 +02:00
..
assets	Initial commit: Kazeia TTS pipeline on NPU via ExecuTorch	2026-04-09 08:42:11 +02:00
java/com/kazeia	TTS Stage 2: on-device voice-cloning TTS for arbitrary text	2026-04-13 10:12:09 +02:00
jni	TTS tremor investigation: identify cross-arch numerical floor, gate diag flags	2026-04-13 00:15:14 +02:00
res	Initial commit: Kazeia TTS pipeline on NPU via ExecuTorch	2026-04-09 08:42:11 +02:00
AndroidManifest.xml	Initial commit: Kazeia TTS pipeline on NPU via ExecuTorch	2026-04-09 08:42:11 +02:00