kazeia/executorch-patches/README.md

# Executorch patches for Kazeia

Local modifications to /opt/Kazeia/executorch (upstream pytorch/executorch @ v1.2.0)
required to export Qwen3-4B to QNN for OnePlus Pad 3 (Snapdragon 8 Elite, Hexagon V79).

Not upstreamable as-is (phi_4_mini torchtune guard is a local dependency workaround;
Qwen3_4B class matches upstream style but hasn't been submitted).

## qwen3_4b_decoder.patch

Applied to: `/opt/Kazeia/executorch/`

```
cd /opt/Kazeia/executorch && git apply ../executorch-patches/qwen3_4b_decoder.patch
```

Adds:
- `examples/qualcomm/oss_scripts/llama/__init__.py`:
  - `try/except` around `convert_phi_4_mini_weights` import (phi_4_mini pulls torchtune
    which conflicts with our torchao 0.17 pin).
  - New `Qwen3_4B` class registered as `qwen3-4b`, `num_sharding=2` (4B at num_sharding=1
    OOMed during QNN compile even with 48 GB free RAM; sharding=2 is the minimum that
    lets the compile partitioner split the HTP context).
- `examples/qualcomm/oss_scripts/llama/decoder_constants.py`:
  - Adds `"qwen3-4b": "qwen3"` to `DECODER_MODEL_VERSION`.

## torchtune_quantization.patch

Applied to: `/opt/Kazeia/et_venv/lib64/python3.10/site-packages/torchtune/training/quantization.py`

torchao 0.17+ removed `int4_weight_only` and `int8_dynamic_activation_int4_weight`.
torchtune 0.6.1 still imports them. Since our Qwen3 QNN export path doesn't use either,
wrap the import in try/except and set them to None on ImportError.

## Host env reminders (not in patches)

- symlink `libc++.so.1` and `libc++abi.so.1` in `backends/qualcomm/sdk/libcxx-14.0.0/`
- copy `build-x86/backends/qualcomm/PyQnn*.so` to `backends/qualcomm/python/`
- `QNN_SDK_ROOT=/opt/Kazeia/executorch/backends/qualcomm/sdk/qnn`
- `LD_LIBRARY_PATH=$QNN_SDK_ROOT/lib/x86_64-linux-clang:.../sdk/libcxx-14.0.0`
- `PATH+=build-x86/third-party/flatc_ep/bin`
- `PYTHONPATH=/opt/Kazeia`

## RAM/swap for 4B export

Peak RAM during prepare_pt2e + QNN compile: **46 GB anon-rss**.
On a 62 GB + 8 GB zram box this OOMs. Fix: add a swapfile:

```
sudo dd if=/dev/zero of=/swapfile bs=1M count=49152
sudo chmod 600 /swapfile && sudo mkswap /swapfile && sudo swapon /swapfile
```

Compile then uses ~59 GB RAM + 24 GB swap, completes in ~30 min wall.
Put `--artifact` on `/home` not `/tmp` (the 25 GB `decode_qdq.pt2` overflows tmpfs).