TTS Stage 2: on-device voice-cloning TTS for arbitrary text

Removes the PC-side prepare_tts_segments.py dependency for day-to-day generation. The tablet now tokenizes, embeds, and voice-clones any French (or Qwen3-supported) text with no network, no ADB push per phrase, and quality that matches Python's reference on "Bonjour, je suis Kazeia, je suis là pour vous écouter." — user validation: "impeccable". Three pieces that compose the path: 1. Qwen3BpeTokenizer.kt — byte-level BPE matching Qwen2/Qwen3's Python implementation bit-for-bit. UTF-8 + GPT-2 byte encoder, Qwen regex with \p{IsAlphabetic}/\p{IsDigit} (Android's regex lacks UNICODE_CHARACTER_CLASS — caught in testing). Produces identical token IDs to HF's Qwen2TokenizerFast on the test phrase: [81581, 11, 4759, 35631, 730, 9832, 685, 11, 4759, 35631, 37915, 4914, 9012, 90229, 2676, 13]. 2. export_tts_text_embeddings.py — one-time PC export of: * Full projected text embeddings for the entire 151936-token vocab as fp16 (297 MB). Sanity check: live vs stored max abs diff 1.15e-4 on token 1043. Mmap'd on-device so it stays off the Java heap and leaves room for the 125 MB cp_embeddings alloc. * Damien voice PREFIX (9 × 1024 fp32) — positions 0..8 of a Python voice-clone capture, text-invariant across segments. * Damien voice SUFFIX (2 × 1024 fp32) — positions nP-2..nP-1 of the same capture. Also text-invariant (diff = 0.0 across 3 different-text segments). Without it the talker never sees "text ended" and decode falls into page/beg repetition. * Qwen3 tokenizer vocab.json + merges.txt. 3. Qwen3TtsEngine.kt: * mmap loader for the embeddings table + buffered fp16→fp32 lookup (halfToFloat covers subnormals/inf/NaN so pathological tokens don't become 0). * Stage 2 assets detected at init; missing file transparently falls back to legacy 1050-token reduced-vocab path. * synthesizeTextStreaming(text, onSegmentReady) — new public API: sentence-split → BPE → build prefill as [voice prefix] + [text_proj(id) + codec_pad] × N + [voice suffix] (exact structure Python emits; verified bit-for-bit by matching captured Baer prefill positions against text_projection(tok)+ codec_embedding(CODEC_PAD)) → runHexGenWithPrefill → decode each segment through the existing BigVGAN pipeline → callback. * runHexGenWithPrefill — Hexagon prefill + interleaved CP decode loop. Feeds tts_eos once, tts_pad thereafter (same schedule as Python's voice_clone). Degeneracy guard stops when 9 identical cb0 in a row appear — catches the rare "page beg beg beg" tail when EOS never fires. maxGen = ids.size*4 + 10 matches the typical 3.3 codec-frames-per-text-token that Python produces. * Prefill build uses the speaker's captured prefix/suffix rather than the legacy in-code buildPrefillEmbeddings that puts only one text token in prefill — the structure mismatch produced garbled audio in the first attempt of this commit. 4. KazeiaService.kt: new stream_text intent extra wires text input to synthesizeTextStreaming with an AudioTrack MODE_STREAM consumer. First-audio latency on the "Bonjour..." test: ~23 s on Snapdragon 8 Elite (prefill + 74-token decode), vs a 3-phrase sentence batch that was 65 s pre-streaming — streaming + on-device text together unblock the MVP chat loop. Known caveats: * 297 MB on-device footprint for the embedding table. Acceptable on OnePlus Pad 3; can be quantized further (int8 per-row) if storage becomes tight. * First init adds ~3 s for BPE vocab + merges load (151k × 2 hash- maps). Happens once per process. * maxGen cap means extremely long sentences may truncate. The sentence splitter already keeps segments ≤120 chars so this hasn't been observed in practice. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 10:12:09 +02:00 · 2026-04-13 10:12:09 +02:00 · 7f1a44c23d
parent 5e416713ce
commit 7f1a44c23d
4 changed files with 759 additions and 1 deletions
--- a/kazeia-android/app/src/main/java/com/kazeia/service/KazeiaService.kt
+++ b/kazeia-android/app/src/main/java/com/kazeia/service/KazeiaService.kt
@ -123,6 +123,47 @@ class KazeiaService : Service() {
                // Audio is played by the TTS engine internally
            }
        }
+        intent?.getStringExtra("stream_text")?.let { text ->
+            // Stage 2 streaming from arbitrary text: BPE tokenize on-device,
+            // look up embeds in the full Qwen3 vocab, run the existing
+            // interleaved Hexagon generation loop, and play each segment
+            // as soon as it's decoded. No PC-side prep required.
+            log("Stream text: '${text.take(60)}${if (text.length>60) "..." else ""}'")
+            serviceScope.launch {
+                try {
+                    val qwenTts = tts as? com.kazeia.tts.Qwen3TtsEngine ?: return@launch
+                    val sr = 24000
+                    val track = android.media.AudioTrack.Builder()
+                        .setAudioAttributes(android.media.AudioAttributes.Builder()
+                            .setUsage(android.media.AudioAttributes.USAGE_MEDIA)
+                            .setContentType(android.media.AudioAttributes.CONTENT_TYPE_SPEECH)
+                            .build())
+                        .setAudioFormat(android.media.AudioFormat.Builder()
+                            .setEncoding(android.media.AudioFormat.ENCODING_PCM_16BIT)
+                            .setSampleRate(sr)
+                            .setChannelMask(android.media.AudioFormat.CHANNEL_OUT_MONO)
+                            .build())
+                        .setBufferSizeInBytes(sr * 4)
+                        .setTransferMode(android.media.AudioTrack.MODE_STREAM)
+                        .build()
+                    track.play()
+                    val tStart = System.currentTimeMillis()
+                    var firstLogged = false
+                    qwenTts.synthesizeTextStreaming(text) { segIdx, audio ->
+                        if (!firstLogged) {
+                            log("First audio out at ${System.currentTimeMillis() - tStart}ms (seg ${segIdx+1})")
+                            firstLogged = true
+                        }
+                        track.write(audio, 0, audio.size)
+                    }
+                    track.stop(); track.release()
+                    log("Stream text done at ${System.currentTimeMillis() - tStart}ms")
+                } catch (e: Exception) {
+                    log("Stream text error: ${e.message}")
+                    e.printStackTrace()
+                }
+            }
+        }
        intent?.getStringExtra("stream_pipeline")?.let { embedsPath ->
            // Stage 1 streaming pipeline: generate segment-by-segment and play each
            // segment the moment it's ready via an AudioTrack MODE_STREAM. First audio
--- a/kazeia-android/app/src/main/java/com/kazeia/tts/Qwen3BpeTokenizer.kt
+++ b/kazeia-android/app/src/main/java/com/kazeia/tts/Qwen3BpeTokenizer.kt
@ -0,0 +1,219 @@
+package com.kazeia.tts
+
+import android.util.Log
+import org.json.JSONObject
+import java.io.File
+import java.util.regex.Pattern
+
+/**
+ * Byte-level BPE tokenizer compatible with Qwen2/Qwen3 tokenizer.
+ *
+ * Why this file exists:
+ *   The app needs to tokenize arbitrary LLM-generated French text on-device
+ *   for the TTS pipeline (the reduced 1050-token table shipped previously
+ *   couldn't handle free-form text). The reference is the HuggingFace
+ *   Qwen2TokenizerFast — a GPT-2-style byte-level BPE. We reimplement it
+ *   in Kotlin rather than linking libtokenizers.so so there's no new
+ *   native dependency and the numbers are easy to audit.
+ *
+ * Algorithm (identical to GPT-2 / Qwen2):
+ *   1. Pre-tokenize the input with a regex that groups contractions,
+ *      words, numbers, and whitespace into "word chunks" that BPE never
+ *      merges across.
+ *   2. UTF-8 encode each chunk, then map each byte (0..255) to one of
+ *      256 printable Unicode code points via a fixed "byte encoder" table
+ *      — this is the trick that lets a byte-level vocab fit inside JSON
+ *      without invalid or control characters.
+ *   3. Apply BPE: repeatedly find the pair with the lowest rank in the
+ *      merges list and merge it, until no more merges apply. Look up each
+ *      resulting super-token in vocab.json to get the final token IDs.
+ *
+ * Bit-perfect compatibility notes:
+ *   - The pre-tokenize regex below MUST match the one in tokenizer_config.json.
+ *     Qwen2 uses the GPT-2 pattern with a couple of Unicode property
+ *     extensions; Java regex supports these directly.
+ *   - Byte encoder is canonical (see bytesToUnicode()).
+ *   - Merges are rank-ordered: lower line number = higher priority, matching
+ *     HuggingFace's `merges.txt` file ordering.
+ *   - We do NOT add BOS/EOS or chat-template specials — the TTS prefill
+ *     prepends its own 9-embed voice prefix that already handles role tokens.
+ */
+class Qwen3BpeTokenizer private constructor(
+    private val vocab: HashMap<String, Int>,
+    private val merges: HashMap<Pair<String, String>, Int>,
+    private val byteEncoder: IntArray,
+) {
+    companion object {
+        private const val TAG = "Qwen3BPE"
+
+        // Qwen2/Qwen3 pre-tokenization regex. This is the exact pattern used
+        // by the HuggingFace Qwen2Tokenizer (adapted from GPT-2). Matches
+        // contractions ('s|'d|'ll|'ve|'re|'t), word chunks with a leading
+        // optional non-word char + letters, runs of digits, runs of
+        // punctuation, and whitespace runs with boundary semantics.
+        //
+        // Note on the character classes: Android's Pattern does NOT support
+        // UNICODE_CHARACTER_CLASS — so plain \\p{L} / \\p{N} would collapse
+        // to ASCII-only and break French accents ("é" → wrong token). Use
+        // \\p{IsAlphabetic} and \\p{IsDigit} instead; those ARE Unicode-aware
+        // out of the box (they map to ICU's IsAlphabetic / IsDigit properties
+        // in both JDK and Android runtimes). Output matches Python's Qwen2
+        // tokenizer on French text.
+        private val PRE_TOKENIZE_PATTERN: Pattern = Pattern.compile(
+            "'s|'d|'ll|'ve|'re|'t" +
+            "|[^\\r\\n\\p{IsAlphabetic}\\p{IsDigit}]?\\p{IsAlphabetic}+" +
+            "|\\p{IsDigit}{1,3}" +
+            "| ?[^\\s\\p{IsAlphabetic}\\p{IsDigit}]+[\\r\\n]*" +
+            "|\\s*[\\r\\n]+" +
+            "|\\s+(?!\\S)" +
+            "|\\s+"
+        )
+
+        /**
+         * GPT-2 byte encoder: maps 0..255 → a printable Unicode codepoint.
+         * Ensures every possible byte has a visible, JSON-safe representation
+         * so a byte-level vocab can be stored as strings in vocab.json.
+         */
+        private fun bytesToUnicode(): IntArray {
+            val bs = mutableListOf<Int>()
+            // Printable ASCII and common Latin blocks.
+            bs.addAll(('!'.code..'~'.code).toList())
+            bs.addAll(('¡'.code..'¬'.code).toList())
+            bs.addAll(('®'.code..'ÿ'.code).toList())
+            val cs = bs.toMutableList()
+            // Every byte not in bs gets mapped to a code point past 255 so no
+            // existing character collides with it.
+            var n = 0
+            val map = IntArray(256)
+            for (b in 0..255) {
+                if (b in bs) {
+                    map[b] = b
+                } else {
+                    map[b] = 256 + n
+                    bs.add(b)  // placeholder only, we've already recorded cs from the frozen snapshot
+                    cs.add(256 + n)
+                    n += 1
+                }
+            }
+            return map
+        }
+
+        fun load(modelDir: String): Qwen3BpeTokenizer {
+            val t0 = System.currentTimeMillis()
+            val vocabFile = File("$modelDir/vocab.json")
+            val mergesFile = File("$modelDir/merges.txt")
+            require(vocabFile.exists()) { "vocab.json missing at $modelDir" }
+            require(mergesFile.exists()) { "merges.txt missing at $modelDir" }
+
+            val vocabJson = JSONObject(vocabFile.readText())
+            val vocab = HashMap<String, Int>(vocabJson.length())
+            val keys = vocabJson.keys()
+            while (keys.hasNext()) {
+                val k = keys.next()
+                vocab[k] = vocabJson.getInt(k)
+            }
+
+            val merges = HashMap<Pair<String, String>, Int>()
+            var rank = 0
+            mergesFile.useLines { lines ->
+                for (line in lines) {
+                    // Skip header / blanks. Qwen's merges.txt starts with
+                    // "#version" which we simply filter out.
+                    if (line.isBlank() || line.startsWith("#")) continue
+                    val sp = line.indexOf(' ')
+                    if (sp < 0) continue
+                    merges[Pair(line.substring(0, sp), line.substring(sp + 1))] = rank
+                    rank++
+                }
+            }
+
+            Log.i(TAG, "Loaded vocab=${vocab.size} merges=${merges.size} in ${System.currentTimeMillis()-t0}ms")
+            return Qwen3BpeTokenizer(vocab, merges, bytesToUnicode())
+        }
+    }
+
+    /**
+     * Convert a single pre-tokenized word (UTF-8 bytes encoded via the byte
+     * encoder into a string) into token IDs via BPE merges. Caches results so
+     * repeated common words (spaces, punctuation) only BPE once.
+     */
+    private val bpeCache = HashMap<String, IntArray>()
+
+    private fun bpeEncode(byteEncodedWord: String): IntArray {
+        bpeCache[byteEncodedWord]?.let { return it }
+
+        // Start with one "sub-token" per Unicode code point (code points,
+        // not chars — surrogate pairs are handled automatically since the
+        // byte encoder only produces BMP codepoints by construction).
+        val parts = ArrayList<String>(byteEncodedWord.length)
+        var i = 0
+        while (i < byteEncodedWord.length) {
+            val cp = byteEncodedWord.codePointAt(i)
+            parts.add(String(Character.toChars(cp)))
+            i += Character.charCount(cp)
+        }
+        if (parts.size < 2) {
+            val id = vocab[parts.getOrElse(0) { "" }] ?: vocab["<unk>"] ?: 0
+            val out = intArrayOf(id)
+            bpeCache[byteEncodedWord] = out
+            return out
+        }
+
+        // Greedy lowest-rank merge, classic BPE. We scan for the pair with
+        // the smallest rank, merge ALL its occurrences, then repeat. This
+        // matches HF's reference implementation.
+        while (parts.size > 1) {
+            var bestRank = Int.MAX_VALUE
+            var bestIdx = -1
+            for (k in 0 until parts.size - 1) {
+                val r = merges[Pair(parts[k], parts[k + 1])] ?: continue
+                if (r < bestRank) { bestRank = r; bestIdx = k }
+            }
+            if (bestIdx < 0) break
+            // Merge all non-overlapping occurrences of that exact pair.
+            val a = parts[bestIdx]; val b = parts[bestIdx + 1]
+            val merged = a + b
+            var k = 0
+            val out = ArrayList<String>(parts.size - 1)
+            while (k < parts.size) {
+                if (k < parts.size - 1 && parts[k] == a && parts[k + 1] == b) {
+                    out.add(merged); k += 2
+                } else {
+                    out.add(parts[k]); k += 1
+                }
+            }
+            parts.clear(); parts.addAll(out)
+        }
+
+        val ids = IntArray(parts.size)
+        for (k in parts.indices) {
+            ids[k] = vocab[parts[k]] ?: vocab["<unk>"] ?: 0
+        }
+        bpeCache[byteEncodedWord] = ids
+        return ids
+    }
+
+    /**
+     * Encode text → token IDs using the full Qwen3 vocabulary. Does NOT
+     * prepend BOS/EOS — callers can add specials themselves. Unicode
+     * characters outside ASCII (e.g. French accents) are UTF-8 encoded and
+     * go through the byte encoder, so "é" and "ï" tokenize the same way as
+     * they do in Python.
+     */
+    fun encode(text: String): IntArray {
+        val all = ArrayList<Int>(text.length / 2 + 4)
+        val matcher = PRE_TOKENIZE_PATTERN.matcher(text)
+        while (matcher.find()) {
+            val chunk = matcher.group()
+            // UTF-8 encode the chunk, then map each raw byte to its Unicode
+            // "byte-encoded" character. This produces the exact string that
+            // BPE merges operate on.
+            val bytes = chunk.toByteArray(Charsets.UTF_8)
+            val sb = StringBuilder(bytes.size)
+            for (b in bytes) sb.appendCodePoint(byteEncoder[b.toInt() and 0xff])
+            val ids = bpeEncode(sb.toString())
+            for (id in ids) all.add(id)
+        }
+        return all.toIntArray()
+    }
+}
--- a/kazeia-android/app/src/main/java/com/kazeia/tts/Qwen3TtsEngine.kt
+++ b/kazeia-android/app/src/main/java/com/kazeia/tts/Qwen3TtsEngine.kt
@ -100,8 +100,21 @@ class Qwen3TtsEngine(
    private var decoderOnGpu: Boolean = false

    // Dual embedding tables for talker input
-    private var textEmbeds: FloatArray? = null     // [1050, 1024] - pre-projected text embeddings
+    private var textEmbeds: FloatArray? = null     // [1050, 1024] - reduced vocab, legacy fallback
    private var codecEmbedding: FloatArray? = null  // [3072, 1024] - codec/control token embeddings
+
+    // Stage 2 — on-device full-vocab text embeddings + BPE tokenizer.
+    // textEmbedsFull is 151936 × 1024 fp16 memory-mapped (~296 MB); using
+    // mmap keeps the bytes off the Java heap so the app doesn't crash when
+    // the ~125 MB cp_embeddings allocation comes next. damienVoicePrefix is
+    // the fixed 9-embed voice-cloning header that is prepended to the
+    // tokenized text to form a full prefill.
+    private var textEmbedsFullBuf: java.nio.ByteBuffer? = null
+    private var textEmbedsFullChan: java.nio.channels.FileChannel? = null
+    private val textEmbedsFullLen = 151936
+    private var damienVoicePrefix: Array<FloatArray>? = null
+    private var damienVoiceSuffix: Array<FloatArray>? = null
+    private var bpeTokenizer: Qwen3BpeTokenizer? = null
    private var ttsBosEmbed: FloatArray? = null     // [1024] - tts_bos text-side embedding
    private var ttsEosEmbed: FloatArray? = null     // [1024] - tts_eos text-side embedding
    private var ttsPadEmbed: FloatArray? = null     // [1024] - tts_pad text-side embedding
@ -457,6 +470,72 @@ class Qwen3TtsEngine(
                // Load dual embedding tables for talker
                textEmbeds = loadNpy("$path/text_embeds_projected.npy")
                nlog("Text embeddings: ${textEmbeds!!.size / TALKER_DIM} × $TALKER_DIM")
+
+                // Stage 2 assets: full-vocab text embeddings + voice prefix + BPE.
+                // All three are optional — if any is missing we fall back to the
+                // legacy 1050-token path. This keeps the app bootable during
+                // asset rollout and avoids turning a missing file into a crash.
+                try {
+                    val fullEmbFile = File("$path/text_embeds_full_fp16.bin")
+                    val prefixFile = File("$path/damien_voice_prefix.bin")
+                    val tokDir = File("$path/qwen3_tokenizer")
+                    if (fullEmbFile.exists() && prefixFile.exists() && tokDir.isDirectory) {
+                        val tE0 = System.currentTimeMillis()
+                        // Memory-map the fp16 embeddings table instead of heap-
+                        // loading it. Without mmap, the 296 MB ByteArray plus
+                        // the 125 MB cp_embeddings FloatArray loaded a few
+                        // lines below overrun the ~536 MB large-heap limit and
+                        // the app OOMs during init. mmap pages the file via
+                        // the kernel and keeps zero bytes on the Java heap.
+                        val expectedBytes = textEmbedsFullLen.toLong() * TALKER_DIM * 2
+                        if (fullEmbFile.length() != expectedBytes) {
+                            nlog("text_embeds_full_fp16 size mismatch (got ${fullEmbFile.length()}, expected $expectedBytes) — disabling on-device text")
+                        } else {
+                            textEmbedsFullChan = java.io.RandomAccessFile(fullEmbFile, "r").channel
+                            textEmbedsFullBuf = textEmbedsFullChan!!.map(
+                                java.nio.channels.FileChannel.MapMode.READ_ONLY, 0L, expectedBytes
+                            ).order(ByteOrder.LITTLE_ENDIAN)
+                            nlog("Full-vocab text embeddings mmap: ${textEmbedsFullLen} × $TALKER_DIM fp16 (${expectedBytes/1024/1024}MB, off-heap) in ${System.currentTimeMillis()-tE0}ms")
+                        }
+
+                        val pb = ByteBuffer.wrap(prefixFile.readBytes()).order(ByteOrder.LITTLE_ENDIAN)
+                        val nPref = pb.int; val dimPref = pb.int
+                        if (nPref == 9 && dimPref == TALKER_DIM) {
+                            damienVoicePrefix = Array(9) { FloatArray(TALKER_DIM).also { arr -> for (j in 0 until TALKER_DIM) arr[j] = pb.float } }
+                            nlog("Damien voice prefix: $nPref × $dimPref")
+                        } else {
+                            nlog("damien_voice_prefix.bin header mismatch ($nPref × $dimPref) — disabling on-device text")
+                        }
+
+                        // Voice SUFFIX — the 2 fixed positions that close out
+                        // the prefill after text tokens. Empirically invariant
+                        // across segments of the same speaker (diff = 0.0).
+                        val suffixFile = File("$path/damien_voice_suffix.bin")
+                        if (suffixFile.exists()) {
+                            val sb = ByteBuffer.wrap(suffixFile.readBytes()).order(ByteOrder.LITTLE_ENDIAN)
+                            val nSuf = sb.int; val dimSuf = sb.int
+                            if (nSuf == 2 && dimSuf == TALKER_DIM) {
+                                damienVoiceSuffix = Array(2) { FloatArray(TALKER_DIM).also { arr -> for (j in 0 until TALKER_DIM) arr[j] = sb.float } }
+                                nlog("Damien voice suffix: $nSuf × $dimSuf")
+                            }
+                        } else {
+                            nlog("damien_voice_suffix.bin missing — on-device text will lack closure markers")
+                        }
+
+                        if (textEmbedsFullBuf != null && damienVoicePrefix != null && damienVoiceSuffix != null) {
+                            bpeTokenizer = Qwen3BpeTokenizer.load(tokDir.absolutePath)
+                            nlog("Stage 2 on-device text path ready (BPE + full embeds + voice prefix)")
+                        }
+                    } else {
+                        nlog("Stage 2 assets not fully present (full=$fullEmbFile, prefix=$prefixFile, tok=$tokDir) — legacy path only")
+                    }
+                } catch (e: Exception) {
+                    nlog("Stage 2 asset load failed: ${e.message} — legacy path only")
+                    textEmbedsFullBuf = null; damienVoicePrefix = null; bpeTokenizer = null
+                    try { textEmbedsFullChan?.close() } catch (_: Exception) {}
+                    textEmbedsFullChan = null
+                }
+
                codecEmbedding = loadNpy("$path/codec_embedding.npy")
                nlog("Codec embedding: ${codecEmbedding!!.size / TALKER_DIM} × $TALKER_DIM")
                val ttsSpecial = loadNpy("$path/tts_special_embeds.npy") // [3, 1024] = bos, eos, pad
@ -3183,6 +3262,269 @@ class Qwen3TtsEngine(
        return concat
    }

+    /**
+     * Look up a single token in the fp16 full-vocab text-embedding table and
+     * return it as fp32. Uses direct ByteBuffer arithmetic so we don't
+     * allocate a new buffer per token — for a typical 50-token sentence the
+     * inner loop runs 50 × 1024 fp16→fp32 conversions.
+     */
+    private fun textEmbFromFull(tokenId: Int): FloatArray {
+        val buf = textEmbedsFullBuf ?: error("Stage 2 full embeddings not loaded")
+        val clamped = tokenId.coerceIn(0, textEmbedsFullLen - 1)
+        val base = clamped * TALKER_DIM * 2
+        val out = FloatArray(TALKER_DIM)
+        synchronized(buf) {
+            // MappedByteBuffer has mutable position; guard in case two
+            // coroutines ever race on tokenizer output concurrently.
+            buf.position(base)
+            for (j in 0 until TALKER_DIM) {
+                val bits = buf.short
+                out[j] = halfToFloat(bits)
+            }
+        }
+        return out
+    }
+
+    /** IEEE 754 fp16 -> fp32 conversion. Handles subnormals, inf and NaN
+     *  exactly the way Python's `torch.float16.to(float32)` does. */
+    private fun halfToFloat(h: Short): Float {
+        val bits = h.toInt() and 0xffff
+        val sign = (bits ushr 15) and 0x1
+        val exp = (bits ushr 10) and 0x1f
+        val mant = bits and 0x3ff
+        val f32: Int = when {
+            exp == 0 && mant == 0 -> sign shl 31
+            exp == 0 -> {
+                // Subnormal: normalize by shifting until leading 1 appears.
+                var e = -14; var m = mant
+                while ((m and 0x400) == 0) { m = m shl 1; e -= 1 }
+                val e32 = e + 127
+                (sign shl 31) or (e32 shl 23) or ((m and 0x3ff) shl 13)
+            }
+            exp == 0x1f -> (sign shl 31) or (0xff shl 23) or (mant shl 13)  // inf / NaN
+            else -> (sign shl 31) or ((exp - 15 + 127) shl 23) or (mant shl 13)
+        }
+        return Float.fromBits(f32)
+    }
+
+    /**
+     * Run the Hexagon talker + CP generation loop with a fully pre-built
+     * prefill (voice prefix + all text tokens). Same decode recipe as
+     * runInterleavedHexagon's inner loop: at each step the next talker
+     * input is codecSum (codec embedding of previous codes) + tts_pad
+     * (since all text has already been consumed in the prefill). On EOS,
+     * terminate early. Returns the generated [step, codebook] codes.
+     */
+    private fun runHexGenWithPrefill(prefill: List<FloatArray>, maxGen: Int): Array<IntArray> {
+        val padE = ttsPadEmbed ?: return emptyArray()
+        val eosE = ttsEosEmbed ?: return emptyArray()
+        val allCodes = mutableListOf<IntArray>()
+        val generatedCb0 = mutableListOf<Int>()
+        var totalTalkerMs = 0L; var totalCpMs = 0L
+
+        val tPrefill = System.currentTimeMillis()
+        val prefillResults = hexForward(prefill)
+        nlog("VC prefill (Hex): ${System.currentTimeMillis() - tPrefill}ms, ${prefillResults.size} steps")
+        if (prefillResults.isEmpty()) return emptyArray()
+
+        var pastHidden = prefillResults.last().first
+        val prefillLogits = prefillResults.last().second
+        for (j in CODEBOOK_SIZE until TALKER_VOCAB) { if (j != CODEC_EOS) prefillLogits[j] = Float.NEGATIVE_INFINITY }
+        var currentCb0 = sampleTopK(prefillLogits, 0.9f, 50)
+        nlog("VC prefill done: first cb0=$currentCb0")
+
+        // After the text has been fully consumed in prefill, Python's voice-
+        // clone loop feeds tts_eos once, then tts_pad for every subsequent
+        // decode step. We follow the same schedule so the model's attention
+        // sees the same "text exhausted" signal it was trained with.
+        var eosFedOnce = false
+        for (genStep in 0 until maxGen) {
+            val codes = IntArray(NUM_CODEBOOKS); codes[0] = currentCb0
+            val tCp = System.currentTimeMillis()
+            val cpCodes = runCodePredictorInterleaved(pastHidden, currentCb0)
+            for (cb in 1 until NUM_CODEBOOKS) codes[cb] = cpCodes[cb - 1]
+            allCodes.add(codes); generatedCb0.add(currentCb0)
+            totalCpMs += System.currentTimeMillis() - tCp
+
+            val codecSum = FloatArray(TALKER_DIM)
+            addEmb(codecSum, codecEmb(codes[0]))
+            for (cb in 1 until NUM_CODEBOOKS) addEmb(codecSum, cpEmb(cb - 1, codes[cb]))
+            val nextEmbed = if (!eosFedOnce) { eosFedOnce = true; sumEmb(codecSum, eosE) } else sumEmb(codecSum, padE)
+
+            val tT = System.currentTimeMillis()
+            val results = hexForward(listOf(nextEmbed))
+            totalTalkerMs += System.currentTimeMillis() - tT
+            if (results.isEmpty()) { nlog("VC: hex empty at step ${genStep+1}"); break }
+            pastHidden = results[0].first
+            val logits = results[0].second
+            for (j in CODEBOOK_SIZE until TALKER_VOCAB) { if (j != CODEC_EOS) logits[j] = Float.NEGATIVE_INFINITY }
+            val seen = HashSet<Int>(); for (prev in generatedCb0) seen.add(prev)
+            for (tok in seen) { logits[tok] = if (logits[tok] > 0) logits[tok] / 1.05f else logits[tok] * 1.05f }
+            currentCb0 = sampleTopK(logits, 0.9f, 50)
+            if (currentCb0 == CODEC_EOS) { nlog("VC EOS at step ${genStep+1}"); break }
+
+            // Degeneracy guard: when the talker fails to emit EOS it falls
+            // into a stuck loop where cb0 repeats (the "page beg beg beg"
+            // artifact audible at the tail of generated phrases). Nine
+            // consecutive identical cb0s is the threshold the native .pte
+            // pipeline uses too. The short history is just the last 9
+            // entries of generatedCb0 — cheap to scan.
+            val nHist = generatedCb0.size
+            if (nHist >= 9) {
+                val last = generatedCb0[nHist - 1]
+                var allSame = true
+                for (i in nHist - 9 until nHist) if (generatedCb0[i] != last) { allSame = false; break }
+                if (allSame && currentCb0 == last) {
+                    nlog("VC degen: cb0=$last repeated ≥9× at step ${genStep+1}, stopping")
+                    break
+                }
+            }
+        }
+        nlog("VC gen: ${allCodes.size} tokens | Talker(HEX): ${totalTalkerMs}ms | CP: ${totalCpMs}ms")
+        return allCodes.toTypedArray()
+    }
+
+    /**
+     * Split text into short segments for the streaming pipeline. Reproduces
+     * the behaviour of scripts/prepare_tts_segments.py but on-device so the
+     * tablet doesn't depend on a PC-side preprocessor. The 120-character
+     * target matches the length where the talker still terminates reliably
+     * on EOS; longer segments risk the auto-repressor's repetition penalty
+     * cutting decode short of the full phrase.
+     */
+    private fun splitSentences(text: String, maxChars: Int = 120): List<String> {
+        val first = text.trim().split(Regex("(?<=[.!?;:])\\s+"))
+        val out = mutableListOf<String>()
+        for (part in first) {
+            if (part.length <= maxChars) {
+                if (part.isNotBlank()) out.add(part.trim())
+                continue
+            }
+            // Break overlong sentences at commas, greedily packing sub-parts
+            // back together up to maxChars so we don't over-split.
+            val subs = part.split(Regex("(?<=,)\\s+"))
+            var current = ""
+            for (s in subs) {
+                if (current.isNotEmpty() && current.length + s.length > maxChars) {
+                    out.add(current.trim()); current = s
+                } else {
+                    current = if (current.isEmpty()) s else "$current $s"
+                }
+            }
+            if (current.isNotBlank()) out.add(current.trim())
+        }
+        return if (out.isEmpty()) listOf(text) else out
+    }
+
+    /**
+     * Tokenize `text` on-device and stream the synthesized audio segment by
+     * segment via `onSegmentReady`. The PC-side prep script becomes optional —
+     * this is the first path in the app where TTS runs fully offline on
+     * arbitrary LLM output.
+     *
+     * For each segment:
+     *   1. BPE tokenize the segment text (same algorithm as Python's
+     *      Qwen2Tokenizer).
+     *   2. Look up each token ID in the fp16 full-vocab table, converted to
+     *      fp32 on the fly. One embedding per token.
+     *   3. Call the existing runInterleavedHexagon loop — which already
+     *      synthesizes its own decode inputs via codec_sum + trailing text —
+     *      so we reuse the same prefill construction and generation path that
+     *      runs today for the pre-computed-embeds test harness.
+     *   4. Decode codes → audio via decodeChunked, emit the audio through
+     *      the callback immediately, save WAV per segment plus the concat.
+     *
+     * Emits at most one callback per segment. First-audio latency ≈ prefill +
+     * one segment's decode (typically ~17-22s for a 5-6s-duration segment on
+     * Snapdragon 8 Elite). The ordering and gaps between segments are the
+     * same as generateFromEmbedsHexagonStreaming.
+     */
+    fun synthesizeTextStreaming(
+        text: String,
+        onSegmentReady: ((segIdx: Int, audio: ShortArray) -> Unit)? = null
+    ): ShortArray {
+        if (!loaded || !useHexagonTalker) {
+            nlog("synthesizeTextStreaming: Hexagon talker not ready"); return ShortArray(0)
+        }
+        if (bpeTokenizer == null || textEmbedsFullBuf == null) {
+            nlog("synthesizeTextStreaming: Stage 2 assets missing"); return ShortArray(0)
+        }
+        val segments = splitSentences(text)
+        nlog("synthesizeTextStreaming: ${segments.size} segment(s) for ${text.length} chars")
+
+        hexReset()
+        val segmentAudios = mutableListOf<ShortArray>()
+        val gapSamples = SR * 120 / 1000
+        val gap = ShortArray(gapSamples)
+        val t0 = System.currentTimeMillis()
+
+        val prefix = damienVoicePrefix!!
+        val suffix = damienVoiceSuffix!!
+        // CODEC_PAD embedding is the per-token companion that Python voice-
+        // cloning sums into every text-encoded prefill position. Computed
+        // once here so the per-token loop stays a simple vector add.
+        val codecPadEmb = codecEmb(CODEC_PAD)
+
+        for ((segIdx, segText) in segments.withIndex()) {
+            if (segIdx > 0) hexReset()
+            val tSeg = System.currentTimeMillis()
+            val ids = bpeTokenizer!!.encode(segText)
+            nlog("Seg ${segIdx+1}/${segments.size}: '${segText.take(60)}' → ${ids.size} tokens: ${ids.toList()}")
+
+            // Voice-cloning prefill, fully reconstructed on-device — exact
+            // structure Python emits via generate_voice_clone (verified
+            // bit-for-bit by comparing captured Baer segments):
+            //   [0..8]     damienVoicePrefix (9 fixed positions, xvector@7)
+            //   [9..N-3]   text_projection(BPE_id) + codec_embedding(CODEC_PAD)
+            //   [N-2, N-1] damienVoiceSuffix (2 fixed positions, end-of-text marker)
+            // The earlier attempt skipped the suffix and used raw text
+            // projections, which produced garbled audio — the talker needs
+            // BOTH the per-token codec_pad sum AND the closure markers to
+            // know that text input has ended and decoding can begin.
+            val prefill = ArrayList<FloatArray>(prefix.size + ids.size + suffix.size)
+            for (e in prefix) prefill.add(e)
+            for (id in ids) prefill.add(sumEmb(textEmbFromFull(id), codecPadEmb))
+            for (e in suffix) prefill.add(e)
+
+            // Empirical budget: Python's voice_clone typically emits ~3.3
+            // codec frames per text token for French. Keep a small cushion
+            // so ~80% of runs terminate via EOS/degeneracy before exhausting
+            // the budget; trimming is done by the degeneracy guard inside
+            // runHexGenWithPrefill. Too-generous maxGen guarantees the tail
+            // artifacts the user hears as "page beg beg beg".
+            val maxGen = minOf(ids.size * 4 + 10, MAX_CONTEXT - 15)
+            val codes = runHexGenWithPrefill(prefill, maxGen)
+            if (codes.isEmpty()) { nlog("Seg ${segIdx+1}: empty codes"); continue }
+
+            val n = codes.size
+            val padLen = maxOf(n, SEQ_LEN)
+            val codebooks = Array(NUM_CODEBOOKS) { cb ->
+                IntArray(padLen) { t ->
+                    if (t < n) { val v = codes[t][cb]; if (v in 0 until CODEBOOK_SIZE) v else 0 } else 0
+                }
+            }
+            val audio = decodeChunked(codebooks, n)
+            val segMs = System.currentTimeMillis() - tSeg
+            nlog("Seg ${segIdx+1}/${segments.size}: $n tokens, ${audio.size/SR.toFloat()}s audio in ${segMs}ms")
+
+            segmentAudios.add(audio)
+            saveWav("/data/local/tmp/kazeia/kazeia_stream_seg${segIdx+1}.wav", audio)
+            onSegmentReady?.invoke(segIdx, audio)
+        }
+
+        if (segmentAudios.isEmpty()) return ShortArray(0)
+        val total = segmentAudios.sumOf { it.size } + maxOf(0, segmentAudios.size - 1) * gapSamples
+        val concat = ShortArray(total)
+        var off = 0
+        for ((i, s) in segmentAudios.withIndex()) {
+            System.arraycopy(s, 0, concat, off, s.size); off += s.size
+            if (i < segmentAudios.size - 1) { System.arraycopy(gap, 0, concat, off, gapSamples); off += gapSamples }
+        }
+        saveWav("/data/local/tmp/kazeia/kazeia_stream_full.wav", concat)
+        nlog("synthesizeTextStreaming total: ${System.currentTimeMillis() - t0}ms for ${concat.size/SR.toFloat()}s")
+        return concat
+    }
+
    /** Write PCM16 mono audio to a WAV file. Used by the streaming pipeline to
     *  save one file per segment plus the concatenated result for inspection. */
    private fun saveWav(path: String, audio: ShortArray) {
--- a/scripts/export_tts_text_embeddings.py
+++ b/scripts/export_tts_text_embeddings.py
@ -0,0 +1,156 @@
+#!/usr/bin/env python3
+"""
+Export everything the tablet needs to build TTS prefill embeds for arbitrary
+LLM text, offline, without talking to a PC.
+
+Outputs (pushed to /data/local/tmp/kazeia/models/qwen3-tts-npu/):
+  - text_embeds_full_fp16.bin    : 151936 × 1024 fp16 = 311 MB
+       Pre-projected text embeddings for the full Qwen3 vocab. Per-token
+       lookup on-device replaces a lookup + FC1 + SiLU + FC2 + bias. Same
+       numbers PyTorch produces for text_projection(text_embedding(id)).
+
+  - damien_voice_prefix.bin      : 9 × 1024 fp32 = 36 KB
+       The fixed voice-cloning prefix (positions 0..8) for speaker Damien,
+       captured from a real voice-clone run. Positions 0..6 = role/control
+       tokens, position 7 = xvector (L2 norm ~10), position 8 = trailing
+       voice-marker. Same for every phrase uttered by this speaker, so we
+       capture once here and reuse indefinitely on-device.
+
+  - damien_voice_suffix.bin      : 2 × 1024 fp32 = 8 KB
+       The fixed voice-cloning SUFFIX (last 2 positions of the prefill)
+       that Python emits AFTER the text tokens. Verified bit-identical
+       across segments of different texts → invariant closure marker
+       for the voice-clone conditioning. Without it the talker misreads
+       the end of the text and produces garbled output.
+
+  - qwen3_tokenizer/             : tokenizer files copied from HF snapshot
+       tokenizer.json, vocab.json, merges.txt, special_tokens_map.json.
+       Kotlin BPE implementation reads vocab + merges at init.
+
+The combination lets the tablet build, for any text, the exact same
+prefill tensor PyTorch would build, bit-for-bit at fp16 — which is
+what our Hexagon talker consumes anyway.
+
+Usage:
+    python3 export_tts_text_embeddings.py [output_dir]
+"""
+import sys, os, struct, shutil, warnings
+os.chdir("/tmp")
+warnings.filterwarnings("ignore")
+
+OUTPUT_DIR = sys.argv[1] if len(sys.argv) > 1 else "/tmp/kazeia_tts_export"
+MODEL = "/home/alf/.cache/huggingface/hub/models--Qwen--Qwen3-TTS-12Hz-0.6B-Base/snapshots/5d83992436eae1d760afd27aff78a71d676296fc"
+VOICE = "/opt/Kazeia/voix/damien_15s_24k.wav"
+
+os.makedirs(OUTPUT_DIR, exist_ok=True)
+os.makedirs(f"{OUTPUT_DIR}/qwen3_tokenizer", exist_ok=True)
+
+import torch, numpy as np
+from qwen_tts import Qwen3TTSModel
+
+print("Loading Qwen3-TTS model (~30s, CPU)...")
+tts = Qwen3TTSModel.from_pretrained(MODEL, local_files_only=True, device_map="cpu")
+talker = tts.model.talker
+
+# ---- 1. Full projected text embeddings ----
+# Evaluate text_projection(text_embedding.weight) for EVERY vocab entry.
+# Batching keeps peak memory bounded; fp32 matmul then fp16 store preserves
+# precision up to the final quantization step.
+print("\n[1/3] Precomputing projected embeddings for full vocab...")
+vocab_size = talker.model.text_embedding.weight.shape[0]
+print(f"  Vocab size: {vocab_size}")
+BATCH = 4096
+out_path = f"{OUTPUT_DIR}/text_embeds_full_fp16.bin"
+with torch.no_grad():
+    W_emb = talker.model.text_embedding.weight        # [vocab, 2048]
+    fc1_w = talker.text_projection.linear_fc1.weight  # [2048, 2048]
+    fc1_b = talker.text_projection.linear_fc1.bias    # [2048]
+    fc2_w = talker.text_projection.linear_fc2.weight  # [1024, 2048]
+    fc2_b = talker.text_projection.linear_fc2.bias    # [1024]
+    with open(out_path, "wb") as f:
+        for start in range(0, vocab_size, BATCH):
+            end = min(start + BATCH, vocab_size)
+            x = W_emb[start:end].float()                          # [b, 2048]
+            h = torch.nn.functional.linear(x, fc1_w, fc1_b)       # [b, 2048]
+            h = torch.nn.functional.silu(h)                       # [b, 2048]
+            y = torch.nn.functional.linear(h, fc2_w, fc2_b)       # [b, 1024]
+            f.write(y.to(torch.float16).numpy().tobytes())
+            if start % (BATCH * 4) == 0:
+                print(f"  {end}/{vocab_size}  ({end*100//vocab_size}%)", flush=True)
+sz_mb = os.path.getsize(out_path) / (1024*1024)
+print(f"  -> {out_path} ({sz_mb:.1f} MB)")
+
+# Sanity check: re-read a couple of tokens, project live, compare.
+print("\n  Sanity check (token 1043 = 'Bonjour'):")
+with torch.no_grad():
+    live = talker.text_projection(talker.model.text_embedding(torch.tensor([1043])))[0].float().numpy()
+with open(out_path, "rb") as f:
+    f.seek(1043 * 1024 * 2)
+    stored = np.frombuffer(f.read(1024 * 2), dtype=np.float16).astype(np.float32)
+diff = float(np.abs(live - stored).max())
+print(f"    max abs diff live vs stored fp16: {diff:.2e}  (expect < 1e-3)")
+
+# ---- 2. Damien voice prefix (positions 0..8) ----
+# Run a voice-clone and capture the multi-token prefill call, then keep the
+# first 9 rows. Those are fixed per speaker — same for every phrase — so
+# one capture suffices for the app's lifetime.
+print(f"\n[2/3] Capturing Damien voice prefix from {VOICE}...")
+captured = []
+call_shapes = []
+original_forward = talker.model.forward
+def patched(input_ids=None, inputs_embeds=None, **kwargs):
+    if inputs_embeds is not None and inputs_embeds.dim() == 3:
+        call_shapes.append(inputs_embeds.shape[1])
+        for i in range(inputs_embeds.shape[1]):
+            captured.append(inputs_embeds[0, i, :].detach().cpu().numpy().astype(np.float32))
+    return original_forward(input_ids=input_ids, inputs_embeds=inputs_embeds, **kwargs)
+talker.model.forward = patched
+# Any short sentence works — we only keep positions 0..8 which are text-
+# invariant.
+_ = tts.generate_voice_clone(
+    text="Bonjour, je suis Kazeia.", ref_audio=VOICE, language="french",
+    x_vector_only_mode=True, non_streaming_mode=True,
+)
+talker.model.forward = original_forward
+nP = call_shapes[0]
+print(f"  Prefill size: {nP} tokens")
+prefix_9 = np.stack(captured[:9])  # [9, 1024]
+suffix_2 = np.stack(captured[nP-2:nP])  # [2, 1024]
+
+prefix_path = f"{OUTPUT_DIR}/damien_voice_prefix.bin"
+with open(prefix_path, "wb") as f:
+    f.write(struct.pack("<i", 9))
+    f.write(struct.pack("<i", 1024))
+    f.write(prefix_9.astype(np.float32).tobytes())
+print(f"  prefix -> {prefix_path} ({os.path.getsize(prefix_path)} bytes)")
+
+suffix_path = f"{OUTPUT_DIR}/damien_voice_suffix.bin"
+with open(suffix_path, "wb") as f:
+    f.write(struct.pack("<i", 2))
+    f.write(struct.pack("<i", 1024))
+    f.write(suffix_2.astype(np.float32).tobytes())
+print(f"  suffix -> {suffix_path} ({os.path.getsize(suffix_path)} bytes)")
+
+norms_pref = [float(np.linalg.norm(prefix_9[i])) for i in range(9)]
+norms_suff = [float(np.linalg.norm(suffix_2[i])) for i in range(2)]
+print(f"  Prefix norms: {[f'{n:.2f}' for n in norms_pref]}  (pos 7 = xvector ~10, others ~1.6-1.8)")
+print(f"  Suffix norms: {[f'{n:.2f}' for n in norms_suff]}")
+
+# ---- 3. Tokenizer files ----
+# Copy the HF tokenizer artefacts so a Kotlin BPE can reproduce Python
+# encode() bit-for-bit.
+print(f"\n[3/3] Copying tokenizer to {OUTPUT_DIR}/qwen3_tokenizer/...")
+for name in ("tokenizer.json", "vocab.json", "merges.txt", "tokenizer_config.json", "special_tokens_map.json"):
+    src = os.path.join(MODEL, name)
+    if os.path.exists(src):
+        shutil.copy(src, f"{OUTPUT_DIR}/qwen3_tokenizer/{name}")
+        print(f"  {name} ({os.path.getsize(src)} bytes)")
+    else:
+        print(f"  (skipped, not present: {name})")
+
+print(f"\n=== DONE ===")
+print(f"Files ready in {OUTPUT_DIR}/")
+print(f"\nPush to tablet:")
+print(f"  adb push {OUTPUT_DIR}/text_embeds_full_fp16.bin /data/local/tmp/kazeia/models/qwen3-tts-npu/")
+print(f"  adb push {OUTPUT_DIR}/damien_voice_prefix.bin /data/local/tmp/kazeia/models/qwen3-tts-npu/")
+print(f"  adb push {OUTPUT_DIR}/qwen3_tokenizer /data/local/tmp/kazeia/models/qwen3-tts-npu/")