# Executorch patches for Kazeia Local modifications to /opt/Kazeia/executorch (upstream pytorch/executorch @ v1.2.0) required to export Qwen3-4B to QNN for OnePlus Pad 3 (Snapdragon 8 Elite, Hexagon V79). Not upstreamable as-is (phi_4_mini torchtune guard is a local dependency workaround; Qwen3_4B class matches upstream style but hasn't been submitted). ## qwen3_4b_decoder.patch Applied to: `/opt/Kazeia/executorch/` ``` cd /opt/Kazeia/executorch && git apply ../executorch-patches/qwen3_4b_decoder.patch ``` Adds: - `examples/qualcomm/oss_scripts/llama/__init__.py`: - `try/except` around `convert_phi_4_mini_weights` import (phi_4_mini pulls torchtune which conflicts with our torchao 0.17 pin). - New `Qwen3_4B` class registered as `qwen3-4b`, `num_sharding=2` (4B at num_sharding=1 OOMed during QNN compile even with 48 GB free RAM; sharding=2 is the minimum that lets the compile partitioner split the HTP context). - `examples/qualcomm/oss_scripts/llama/decoder_constants.py`: - Adds `"qwen3-4b": "qwen3"` to `DECODER_MODEL_VERSION`. ## torchtune_quantization.patch Applied to: `/opt/Kazeia/et_venv/lib64/python3.10/site-packages/torchtune/training/quantization.py` torchao 0.17+ removed `int4_weight_only` and `int8_dynamic_activation_int4_weight`. torchtune 0.6.1 still imports them. Since our Qwen3 QNN export path doesn't use either, wrap the import in try/except and set them to None on ImportError. ## Host env reminders (not in patches) - symlink `libc++.so.1` and `libc++abi.so.1` in `backends/qualcomm/sdk/libcxx-14.0.0/` - copy `build-x86/backends/qualcomm/PyQnn*.so` to `backends/qualcomm/python/` - `QNN_SDK_ROOT=/opt/Kazeia/executorch/backends/qualcomm/sdk/qnn` - `LD_LIBRARY_PATH=$QNN_SDK_ROOT/lib/x86_64-linux-clang:.../sdk/libcxx-14.0.0` - `PATH+=build-x86/third-party/flatc_ep/bin` - `PYTHONPATH=/opt/Kazeia` ## RAM/swap for 4B export Peak RAM during prepare_pt2e + QNN compile: **46 GB anon-rss**. On a 62 GB + 8 GB zram box this OOMs. Fix: add a swapfile: ``` sudo dd if=/dev/zero of=/swapfile bs=1M count=49152 sudo chmod 600 /swapfile && sudo mkswap /swapfile && sudo swapon /swapfile ``` Compile then uses ~59 GB RAM + 24 GB swap, completes in ~30 min wall. Put `--artifact` on `/home` not `/tmp` (the 25 GB `decode_qdq.pt2` overflows tmpfs).