kazeia/TTS_HEXAGON_NPU_GUIDE.md

# Guide Hexagon NPU FP16 natif pour TTS
## llama.cpp + ggml-hexagon + HMX FP16 — 47.8 tok/s

---

## 1. Pourquoi ça marche

Le QNN SDK quantifie automatiquement en int8/int16 → détruit le TTS.
Le ggml-hexagon contourne le QNN SDK et accède directement aux unités HMX
du Hexagon DSP en **vrai FP16 IEEE-754** via des kernels reverse-engineerés
(htp-ops-lib, Zixu Hao, EuroSys 2026).

| Approche | Précision | Vitesse | Audio TTS |
|----------|-----------|---------|-----------|
| QNN SDK HTP | int8/int16 quantifié | ~11ms/step | Silence/bruit |
| QNN SDK GPU | fp16 IEEE-754 | ~130ms/step | Parfait |
| **ggml-hexagon HMX** | **fp16 IEEE-754** | **~21ms/step** | **À tester** |
| ONNX Runtime CPU | fp32 | ~107ms/step | Parfait |

## 2. Build (snapdragon toolchain Docker)

### Prérequis
- Podman ou Docker
- Source llama.cpp

### Commandes
```bash
cd /opt/Kazeia/llama.cpp
cp docs/backend/snapdragon/CMakeUserPresets.json .
mkdir -p build-snapdragon

# Configure
podman run --rm --userns=keep-id \
    --volume $(pwd):/workspace:Z \
    --platform linux/amd64 \
    ghcr.io/snapdragon-toolchain/arm64-android:v0.3 \
    bash -c "cd /workspace && cmake --preset arm64-android-snapdragon-release -B build-snapdragon"

# Build
podman run --rm --userns=keep-id \
    --volume $(pwd):/workspace:Z \
    --platform linux/amd64 \
    ghcr.io/snapdragon-toolchain/arm64-android:v0.3 \
    bash -c "cd /workspace && cmake --build build-snapdragon -j\$(nproc)"
```

### Outputs
```
build-snapdragon/bin/llama-cli           # CLI (ARM64)
build-snapdragon/bin/lib*.so             # Shared libs
build-snapdragon/ggml/src/ggml-hexagon/libggml-htp-v79.so  # Hexagon v79 skel (HMX FP16)
```

## 3. Conversion talker → GGUF F16

Le talker TTS est un Qwen3 standard (28 layers, 1024 dim, q_norm/k_norm).
On extrait ses poids et crée un GGUF compatible.

### Script : extraction des poids
```python
# Extraire les poids du talker dans un format HF standalone
state_dict = {}
for name, param in inner.named_parameters():
    if name not in skip_set:
        state_dict[f"model.{name}"] = param.detach().clone()
state_dict["model.embed_tokens.weight"] = inner.codec_embedding.weight.detach().clone()
state_dict["lm_head.weight"] = codec_head.weight.detach().clone()
```

### Script : création GGUF manuelle
```python
from gguf import GGUFWriter, GGMLQuantizationType
writer = GGUFWriter("talker_f16.gguf", "qwen3")
# ... add metadata (hidden_size, num_layers, etc.)
# ... add tensors with F16 weights, F32 norms
writer.add_tokenizer_model("none")  # pas de tokenizer texte
```

### Fichier : `models_qnn/talker_f16.gguf` (852 MB)

## 4. Déploiement sur tablette

```bash
# Push binaries
DST=/data/local/tmp/kazeia/llama-hex
adb shell "mkdir -p $DST"
adb push build-snapdragon/bin/llama-* $DST/
adb push build-snapdragon/bin/*.so $DST/
adb push build-snapdragon/ggml/src/ggml-hexagon/libggml-htp-v*.so $DST/
adb push models_qnn/talker_f16.gguf /data/local/tmp/kazeia/models/

# Benchmark
adb shell "cd $DST && LD_LIBRARY_PATH=. ./llama-bench \
    -m /data/local/tmp/kazeia/models/talker_f16.gguf \
    -mmp 0 -ngl 99 -pg 1,5"
```

## 5. Benchmark (SM8750, Hexagon v79)

```
| model           | size     | backend     | test     | tok/s    |
|-----------------|----------|-------------|----------|----------|
| qwen3 0.6B F16  | 852 MiB  | OpenCL,HTP  | pp512    | 464 ± 16 |
| qwen3 0.6B F16  | 852 MiB  | OpenCL,HTP  | tg128    | 46.4 ± 1 |
| qwen3 0.6B F16  | 852 MiB  | OpenCL,HTP  | pp1+tg5  | 47.8 ± 1 |
```

**47.8 tok/s = ~21ms/step** (vs 107ms CPU = 5× plus rapide)

## 6. Prochaines étapes

### Runner custom pour embeddings TTS
llama.cpp API supporte `llama_batch.embd` pour envoyer des embeddings
au lieu de token IDs. Il faut écrire un petit runner C++ qui :
1. Charge le GGUF avec le backend hexagon
2. Accepte des embeddings composites (1024 floats) via stdin/fichier
3. Retourne les logits (3072 floats) sur stdout/fichier
4. Gère le KV-cache entre les steps

### Intégration dans l'app
- Le runner tourne en subprocess root (comme le LLM)
- L'app envoie les embeddings composites et lit les logits
- Le sampling (temp=0.9, top_k=50) reste sur CPU côté app

### CP sur Hexagon NPU
- Même approche : convertir le CP (5 layers) en GGUF F16
- 5 layers → encore plus rapide que le talker
- Estimation : ~5ms pour les 17 steps CP