From 3d435f9cddea57a7e6d57e8efddaaea68e80f749 Mon Sep 17 00:00:00 2001 From: Kazeia Team Date: Tue, 14 Apr 2026 12:16:11 +0200 Subject: [PATCH] LLM: trim system prompt to drop ~27 prefill tokens (-1.3s TTFT) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The verbose 55-token system prompt was the cheapest TTFT win on the kv-only path (52 ms per prefill token). Compacting it to 25 tokens while keeping the three load-bearing constraints — Kazeia identity, French only, short replies, /no_think — measurably improved end-to-end latency. Validated 'Bonjour, comment vas-tu ?' on tablet: Before: prompt_tokens=80, TTFT=4202ms, total=5716ms After: prompt_tokens=53, TTFT=2865ms, total=4034ms (-1.3s, -32% TTFT) Reply quality preserved: "Bonjour ! Je vais bien, merci. Comment vas-tu ?" Co-Authored-By: Claude Opus 4.6 (1M context) --- .../src/main/java/com/kazeia/llm/ExecuTorchLlmEngine.kt | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/kazeia-android/app/src/main/java/com/kazeia/llm/ExecuTorchLlmEngine.kt b/kazeia-android/app/src/main/java/com/kazeia/llm/ExecuTorchLlmEngine.kt index b40d080..055772a 100644 --- a/kazeia-android/app/src/main/java/com/kazeia/llm/ExecuTorchLlmEngine.kt +++ b/kazeia-android/app/src/main/java/com/kazeia/llm/ExecuTorchLlmEngine.kt @@ -33,10 +33,10 @@ class ExecuTorchLlmEngine( companion object { private const val TAG = "ExecuTorchLLM" - // /no_think disables Qwen3's chain-of-thought block so the full token - // budget goes to the actual answer. Short-response directive keeps - // TTS latency manageable. - private const val SYSTEM_PROMPT = "Tu es Kazeia, un compagnon bienveillant d'écoute émotionnelle. Réponds toujours en français, en 1 ou 2 phrases courtes (40 mots maximum). Pas de raisonnement, donne directement la réponse. /no_think" + // /no_think disables Qwen3's chain-of-thought block. Compact wording + // keeps prefill cost low: this prompt is ~25 tokens vs ~55 in the + // earlier verbose version → saves ~1.5 s of TTFT in kv-only mode. + private const val SYSTEM_PROMPT = "Tu es Kazeia, à l'écoute en français. Réponds en 1-2 phrases courtes, sans raisonnement. /no_think" private const val MODEL_DIR = "/data/local/tmp/kazeia-et" private const val MODEL_PATH = "$MODEL_DIR/hybrid_llama_qnn.pte"