56 lines
2.3 KiB
Markdown
56 lines
2.3 KiB
Markdown
# Executorch patches for Kazeia
|
|
|
|
Local modifications to /opt/Kazeia/executorch (upstream pytorch/executorch @ v1.2.0)
|
|
required to export Qwen3-4B to QNN for OnePlus Pad 3 (Snapdragon 8 Elite, Hexagon V79).
|
|
|
|
Not upstreamable as-is (phi_4_mini torchtune guard is a local dependency workaround;
|
|
Qwen3_4B class matches upstream style but hasn't been submitted).
|
|
|
|
## qwen3_4b_decoder.patch
|
|
|
|
Applied to: `/opt/Kazeia/executorch/`
|
|
|
|
```
|
|
cd /opt/Kazeia/executorch && git apply ../executorch-patches/qwen3_4b_decoder.patch
|
|
```
|
|
|
|
Adds:
|
|
- `examples/qualcomm/oss_scripts/llama/__init__.py`:
|
|
- `try/except` around `convert_phi_4_mini_weights` import (phi_4_mini pulls torchtune
|
|
which conflicts with our torchao 0.17 pin).
|
|
- New `Qwen3_4B` class registered as `qwen3-4b`, `num_sharding=2` (4B at num_sharding=1
|
|
OOMed during QNN compile even with 48 GB free RAM; sharding=2 is the minimum that
|
|
lets the compile partitioner split the HTP context).
|
|
- `examples/qualcomm/oss_scripts/llama/decoder_constants.py`:
|
|
- Adds `"qwen3-4b": "qwen3"` to `DECODER_MODEL_VERSION`.
|
|
|
|
## torchtune_quantization.patch
|
|
|
|
Applied to: `/opt/Kazeia/et_venv/lib64/python3.10/site-packages/torchtune/training/quantization.py`
|
|
|
|
torchao 0.17+ removed `int4_weight_only` and `int8_dynamic_activation_int4_weight`.
|
|
torchtune 0.6.1 still imports them. Since our Qwen3 QNN export path doesn't use either,
|
|
wrap the import in try/except and set them to None on ImportError.
|
|
|
|
## Host env reminders (not in patches)
|
|
|
|
- symlink `libc++.so.1` and `libc++abi.so.1` in `backends/qualcomm/sdk/libcxx-14.0.0/`
|
|
- copy `build-x86/backends/qualcomm/PyQnn*.so` to `backends/qualcomm/python/`
|
|
- `QNN_SDK_ROOT=/opt/Kazeia/executorch/backends/qualcomm/sdk/qnn`
|
|
- `LD_LIBRARY_PATH=$QNN_SDK_ROOT/lib/x86_64-linux-clang:.../sdk/libcxx-14.0.0`
|
|
- `PATH+=build-x86/third-party/flatc_ep/bin`
|
|
- `PYTHONPATH=/opt/Kazeia`
|
|
|
|
## RAM/swap for 4B export
|
|
|
|
Peak RAM during prepare_pt2e + QNN compile: **46 GB anon-rss**.
|
|
On a 62 GB + 8 GB zram box this OOMs. Fix: add a swapfile:
|
|
|
|
```
|
|
sudo dd if=/dev/zero of=/swapfile bs=1M count=49152
|
|
sudo chmod 600 /swapfile && sudo mkswap /swapfile && sudo swapon /swapfile
|
|
```
|
|
|
|
Compile then uses ~59 GB RAM + 24 GB swap, completes in ~30 min wall.
|
|
Put `--artifact` on `/home` not `/tmp` (the 25 GB `decode_qdq.pt2` overflows tmpfs).
|