129 lines
4.2 KiB
Markdown
129 lines
4.2 KiB
Markdown
# Guide Hexagon NPU FP16 natif pour TTS
|
||
## llama.cpp + ggml-hexagon + HMX FP16 — 47.8 tok/s
|
||
|
||
---
|
||
|
||
## 1. Pourquoi ça marche
|
||
|
||
Le QNN SDK quantifie automatiquement en int8/int16 → détruit le TTS.
|
||
Le ggml-hexagon contourne le QNN SDK et accède directement aux unités HMX
|
||
du Hexagon DSP en **vrai FP16 IEEE-754** via des kernels reverse-engineerés
|
||
(htp-ops-lib, Zixu Hao, EuroSys 2026).
|
||
|
||
| Approche | Précision | Vitesse | Audio TTS |
|
||
|----------|-----------|---------|-----------|
|
||
| QNN SDK HTP | int8/int16 quantifié | ~11ms/step | Silence/bruit |
|
||
| QNN SDK GPU | fp16 IEEE-754 | ~130ms/step | Parfait |
|
||
| **ggml-hexagon HMX** | **fp16 IEEE-754** | **~21ms/step** | **À tester** |
|
||
| ONNX Runtime CPU | fp32 | ~107ms/step | Parfait |
|
||
|
||
## 2. Build (snapdragon toolchain Docker)
|
||
|
||
### Prérequis
|
||
- Podman ou Docker
|
||
- Source llama.cpp
|
||
|
||
### Commandes
|
||
```bash
|
||
cd /opt/Kazeia/llama.cpp
|
||
cp docs/backend/snapdragon/CMakeUserPresets.json .
|
||
mkdir -p build-snapdragon
|
||
|
||
# Configure
|
||
podman run --rm --userns=keep-id \
|
||
--volume $(pwd):/workspace:Z \
|
||
--platform linux/amd64 \
|
||
ghcr.io/snapdragon-toolchain/arm64-android:v0.3 \
|
||
bash -c "cd /workspace && cmake --preset arm64-android-snapdragon-release -B build-snapdragon"
|
||
|
||
# Build
|
||
podman run --rm --userns=keep-id \
|
||
--volume $(pwd):/workspace:Z \
|
||
--platform linux/amd64 \
|
||
ghcr.io/snapdragon-toolchain/arm64-android:v0.3 \
|
||
bash -c "cd /workspace && cmake --build build-snapdragon -j\$(nproc)"
|
||
```
|
||
|
||
### Outputs
|
||
```
|
||
build-snapdragon/bin/llama-cli # CLI (ARM64)
|
||
build-snapdragon/bin/lib*.so # Shared libs
|
||
build-snapdragon/ggml/src/ggml-hexagon/libggml-htp-v79.so # Hexagon v79 skel (HMX FP16)
|
||
```
|
||
|
||
## 3. Conversion talker → GGUF F16
|
||
|
||
Le talker TTS est un Qwen3 standard (28 layers, 1024 dim, q_norm/k_norm).
|
||
On extrait ses poids et crée un GGUF compatible.
|
||
|
||
### Script : extraction des poids
|
||
```python
|
||
# Extraire les poids du talker dans un format HF standalone
|
||
state_dict = {}
|
||
for name, param in inner.named_parameters():
|
||
if name not in skip_set:
|
||
state_dict[f"model.{name}"] = param.detach().clone()
|
||
state_dict["model.embed_tokens.weight"] = inner.codec_embedding.weight.detach().clone()
|
||
state_dict["lm_head.weight"] = codec_head.weight.detach().clone()
|
||
```
|
||
|
||
### Script : création GGUF manuelle
|
||
```python
|
||
from gguf import GGUFWriter, GGMLQuantizationType
|
||
writer = GGUFWriter("talker_f16.gguf", "qwen3")
|
||
# ... add metadata (hidden_size, num_layers, etc.)
|
||
# ... add tensors with F16 weights, F32 norms
|
||
writer.add_tokenizer_model("none") # pas de tokenizer texte
|
||
```
|
||
|
||
### Fichier : `models_qnn/talker_f16.gguf` (852 MB)
|
||
|
||
## 4. Déploiement sur tablette
|
||
|
||
```bash
|
||
# Push binaries
|
||
DST=/data/local/tmp/kazeia/llama-hex
|
||
adb shell "mkdir -p $DST"
|
||
adb push build-snapdragon/bin/llama-* $DST/
|
||
adb push build-snapdragon/bin/*.so $DST/
|
||
adb push build-snapdragon/ggml/src/ggml-hexagon/libggml-htp-v*.so $DST/
|
||
adb push models_qnn/talker_f16.gguf /data/local/tmp/kazeia/models/
|
||
|
||
# Benchmark
|
||
adb shell "cd $DST && LD_LIBRARY_PATH=. ./llama-bench \
|
||
-m /data/local/tmp/kazeia/models/talker_f16.gguf \
|
||
-mmp 0 -ngl 99 -pg 1,5"
|
||
```
|
||
|
||
## 5. Benchmark (SM8750, Hexagon v79)
|
||
|
||
```
|
||
| model | size | backend | test | tok/s |
|
||
|-----------------|----------|-------------|----------|----------|
|
||
| qwen3 0.6B F16 | 852 MiB | OpenCL,HTP | pp512 | 464 ± 16 |
|
||
| qwen3 0.6B F16 | 852 MiB | OpenCL,HTP | tg128 | 46.4 ± 1 |
|
||
| qwen3 0.6B F16 | 852 MiB | OpenCL,HTP | pp1+tg5 | 47.8 ± 1 |
|
||
```
|
||
|
||
**47.8 tok/s = ~21ms/step** (vs 107ms CPU = 5× plus rapide)
|
||
|
||
## 6. Prochaines étapes
|
||
|
||
### Runner custom pour embeddings TTS
|
||
llama.cpp API supporte `llama_batch.embd` pour envoyer des embeddings
|
||
au lieu de token IDs. Il faut écrire un petit runner C++ qui :
|
||
1. Charge le GGUF avec le backend hexagon
|
||
2. Accepte des embeddings composites (1024 floats) via stdin/fichier
|
||
3. Retourne les logits (3072 floats) sur stdout/fichier
|
||
4. Gère le KV-cache entre les steps
|
||
|
||
### Intégration dans l'app
|
||
- Le runner tourne en subprocess root (comme le LLM)
|
||
- L'app envoie les embeddings composites et lit les logits
|
||
- Le sampling (temp=0.9, top_k=50) reste sur CPU côté app
|
||
|
||
### CP sur Hexagon NPU
|
||
- Même approche : convertir le CP (5 layers) en GGUF F16
|
||
- 5 layers → encore plus rapide que le talker
|
||
- Estimation : ~5ms pour les 17 steps CP
|