kazeia

Commit Graph

Author	SHA1	Message	Date
Kazeia Team	173606dae7	Stable: decoder 8T optimization + restore pre-computed embeds - BigVGAN: 8 threads (2757→1872ms), pre_conv/pre_transformer: 4 threads - Restored pre-computed embeds format (codec_sum+text from Python) - Text-only trailing embeds don't work: model needs codec_sum for EOS For long phrases, pre-computed embeds must be generated from Python. RTF 1.26 on short phrase. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 13:42:02 +02:00
Kazeia Team	a688edc9ec	Reduce talker KV_LEN 100→64: saves 148ms (RTF 1.31) KV window of 64 sufficient for ~70 token generation (10 prefill + 58 gen). 36% less KV memcpy per talker step (28L × 2 × 64×8×128 vs 100×8×128). Generation: 3795ms → 3647ms, total: 6438ms → 6093ms Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 12:47:30 +02:00
Kazeia Team	4dcc4bb8b3	Fix KV buffer + revert HTP decoder (BigVGAN too complex for HTP) - Restored intermediate KV buffer for talker (direct output→input caused trembling from buffer overwrite during execute()) - BigVGAN HTP compilation takes >5min, not viable - RTF 1.35 with clean audio quality Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 12:37:50 +02:00
Kazeia Team	985fd9cff9	Direct output→input KV copy: RTF 1.51 → 1.31 Skip intermediate KV buffer: copy output tensors directly into next step's input pointers. Saves ~1.5GB/run of memcpy for talker (28L × 2 × 100×8×128 floats × 58 steps) and CP similarly. Generation: 4007ms → 3713ms, total: 7180ms → 6078ms Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 12:23:45 +02:00
Kazeia Team	14f7e5b05f	Optimize CP+talker: eliminate prepare_input_tensors per step Cache input tensor pointers after first prepare_input_tensors call, then memcpy directly into them for all subsequent steps. Eliminates ~14000 mallocs per pipeline run (986 CP + 58 talker calls). Generation: 4640ms → 4007ms (-633ms), total RTF: 1.6 → 1.51 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 12:16:38 +02:00
Kazeia Team	e647911329	Shared Module C++ pipeline: RTF 1.6 with perfect quality Key breakthrough: C++ pipeline loop using the SAME Method* instances that Java loaded (via Module::method("forward")). This gives: - Same QNN compiled graph → identical numerical results → no trembling - C++ loop → no Java Tensor/EValue allocation overhead - prepare_input_tensors + memcpy + Method::execute (like cp_et_runner) Pipeline: talker ~20ms/step + CP ~44ms/step + decoder 2.8s = 7.3s for 4.64s Added to executorch JNI: - Module.nativeSetCpModule() — registers CP module for pipeline - Module.nativeRunTtsPipeline(...) — runs full talker+CP loop in C++ - Updated executorch.jar with new native method declarations From RTF 4.9 (start of session) to RTF 1.6 with impeccable audio quality. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 12:05:58 +02:00

6 Commits