Skip intermediate KV buffer: copy output tensors directly into next step's input pointers. Saves ~1.5GB/run of memcpy for talker (28L × 2 × 100×8×128 floats × 58 steps) and CP similarly. Generation: 4007ms → 3713ms, total: 7180ms → 6078ms Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| Module.java | ||
| cp_et_runner.cpp | ||
| cp_et_test_client.cpp | ||
| jni_layer_tts.cpp | ||
| tts_pipeline_jni.cpp | ||