LLM: trim system prompt to drop ~27 prefill tokens (-1.3s TTFT)

The verbose 55-token system prompt was the cheapest TTFT win on the
kv-only path (52 ms per prefill token). Compacting it to 25 tokens while
keeping the three load-bearing constraints — Kazeia identity, French only,
short replies, /no_think — measurably improved end-to-end latency.

Validated 'Bonjour, comment vas-tu ?' on tablet:
  Before: prompt_tokens=80, TTFT=4202ms, total=5716ms
  After:  prompt_tokens=53, TTFT=2865ms, total=4034ms (-1.3s, -32% TTFT)

Reply quality preserved: "Bonjour ! Je vais bien, merci. Comment vas-tu ?"

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Kazeia Team 2026-04-14 12:16:11 +02:00
parent 7dc6704e95
commit 3d435f9cdd
1 changed files with 4 additions and 4 deletions

View File

@ -33,10 +33,10 @@ class ExecuTorchLlmEngine(
companion object {
private const val TAG = "ExecuTorchLLM"
// /no_think disables Qwen3's chain-of-thought block so the full token
// budget goes to the actual answer. Short-response directive keeps
// TTS latency manageable.
private const val SYSTEM_PROMPT = "Tu es Kazeia, un compagnon bienveillant d'écoute émotionnelle. Réponds toujours en français, en 1 ou 2 phrases courtes (40 mots maximum). Pas de raisonnement, donne directement la réponse. /no_think"
// /no_think disables Qwen3's chain-of-thought block. Compact wording
// keeps prefill cost low: this prompt is ~25 tokens vs ~55 in the
// earlier verbose version → saves ~1.5 s of TTFT in kv-only mode.
private const val SYSTEM_PROMPT = "Tu es Kazeia, à l'écoute en français. Réponds en 1-2 phrases courtes, sans raisonnement. /no_think"
private const val MODEL_DIR = "/data/local/tmp/kazeia-et"
private const val MODEL_PATH = "$MODEL_DIR/hybrid_llama_qnn.pte"