Two small changes:
* export_tts_text_embeddings.py now takes the voice wav as an optional
second CLI arg (defaults to damien_15s_24k.wav). Lets the same script
capture voice-prefix+suffix for any speaker wav without editing the
source — used today to test Elodie alongside Damien.
* synthesizeTextStreaming + generateSegmentAudioVC only run the
trimTailLowEnergy trim when n >= maxGen. The trim's 35%-of-peak
threshold is tuned to catch "page beg beg" filler after the talker
fails to emit EOS — but it was cutting valid speech when EOS fired
early (observed on Elodie seg 1: 10.08 s → 2.92 s, a 4-second over-
trim). With the guard it's a no-op on converging generations and
only fires on the ~15% of segments that hit maxGen.
Validation after the fix (Elodie, Baer monologue):
- seg 1: 126 tokens = maxGen → trimmed 10.08 s → 8.88 s (1.2 s cut,
the filler tail)
- seg 2: 105 tokens < 138 maxGen → no trim, 8.4 s kept as-is
- seg 3: 69 tokens < 96 maxGen → no trim, 5.6 s kept as-is
Voice prefix/suffix shape is speaker-invariant except position 7 (the
xvector). Confirmed by capturing both Damien and Elodie and diffing:
positions 0-6 and 8 identical within 1e-8, suffix identical within
1e-8, only pos 7 has a different xvector embedding (norm 10.36 vs 10.12).
That means swapping speakers on-device is a 45 KB file push — no app
rebuild, no re-export of the 297 MB vocabulary table.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>