kazeia/executorch-patches/README.md

56 lines
2.3 KiB
Markdown

# Executorch patches for Kazeia
Local modifications to /opt/Kazeia/executorch (upstream pytorch/executorch @ v1.2.0)
required to export Qwen3-4B to QNN for OnePlus Pad 3 (Snapdragon 8 Elite, Hexagon V79).
Not upstreamable as-is (phi_4_mini torchtune guard is a local dependency workaround;
Qwen3_4B class matches upstream style but hasn't been submitted).
## qwen3_4b_decoder.patch
Applied to: `/opt/Kazeia/executorch/`
```
cd /opt/Kazeia/executorch && git apply ../executorch-patches/qwen3_4b_decoder.patch
```
Adds:
- `examples/qualcomm/oss_scripts/llama/__init__.py`:
- `try/except` around `convert_phi_4_mini_weights` import (phi_4_mini pulls torchtune
which conflicts with our torchao 0.17 pin).
- New `Qwen3_4B` class registered as `qwen3-4b`, `num_sharding=2` (4B at num_sharding=1
OOMed during QNN compile even with 48 GB free RAM; sharding=2 is the minimum that
lets the compile partitioner split the HTP context).
- `examples/qualcomm/oss_scripts/llama/decoder_constants.py`:
- Adds `"qwen3-4b": "qwen3"` to `DECODER_MODEL_VERSION`.
## torchtune_quantization.patch
Applied to: `/opt/Kazeia/et_venv/lib64/python3.10/site-packages/torchtune/training/quantization.py`
torchao 0.17+ removed `int4_weight_only` and `int8_dynamic_activation_int4_weight`.
torchtune 0.6.1 still imports them. Since our Qwen3 QNN export path doesn't use either,
wrap the import in try/except and set them to None on ImportError.
## Host env reminders (not in patches)
- symlink `libc++.so.1` and `libc++abi.so.1` in `backends/qualcomm/sdk/libcxx-14.0.0/`
- copy `build-x86/backends/qualcomm/PyQnn*.so` to `backends/qualcomm/python/`
- `QNN_SDK_ROOT=/opt/Kazeia/executorch/backends/qualcomm/sdk/qnn`
- `LD_LIBRARY_PATH=$QNN_SDK_ROOT/lib/x86_64-linux-clang:.../sdk/libcxx-14.0.0`
- `PATH+=build-x86/third-party/flatc_ep/bin`
- `PYTHONPATH=/opt/Kazeia`
## RAM/swap for 4B export
Peak RAM during prepare_pt2e + QNN compile: **46 GB anon-rss**.
On a 62 GB + 8 GB zram box this OOMs. Fix: add a swapfile:
```
sudo dd if=/dev/zero of=/swapfile bs=1M count=49152
sudo chmod 600 /swapfile && sudo mkswap /swapfile && sudo swapon /swapfile
```
Compile then uses ~59 GB RAM + 24 GB swap, completes in ~30 min wall.
Put `--artifact` on `/home` not `/tmp` (the 25 GB `decode_qdq.pt2` overflows tmpfs).