7.5 KiB

Raw Blame History

Architecture Pipeline Kazeia

Version 2.0 — 28 mars 2026

Principe

Le pipeline Kazeia est modulaire : STT et TTS sont indépendants et échangent uniquement du texte avec une chaîne de processeurs pluggables.

┌─────────┐     ┌──────────────────────────┐     ┌─────────┐
│   STT   │────→│   PROCESSOR CHAIN        │────→│   TTS   │
│(Whisper)│     │                          │     │(Android/│
│         │     │  ┌──────────────────┐    │     │Chatterbox│
│  Audio  │     │  │ Voice Commands   │    │     │         │
│  → Text │     │  └────────┬─────────┘    │     │  Text   │
│         │     │  ┌────────▼─────────┐    │     │  → Audio│
│         │     │  │ LLM (Qwen3 NPU) │    │     │         │
│         │     │  └────────┬─────────┘    │     │         │
│         │     │  ┌────────▼─────────┐    │     │         │
│         │     │  │ (Future: RAG)    │    │     │         │
│         │     │  └────────┬─────────┘    │     │         │
│         │     │  ┌────────▼─────────┐    │     │         │
│         │     │  │ (Future: Emotion)│    │     │         │
│         │     │  └──────────────────┘    │     │         │
└─────────┘     └──────────────────────────┘     └─────────┘

Interfaces

SttEngine (Speech-to-Text)

interface SttEngine {
    suspend fun load(modelPath: String?)
    fun isLoaded(): Boolean
    suspend fun transcribe(audioData: ShortArray, language: String): TranscriptionResult
    fun release()
}

Implémentations :

Classe	Backend	Latence	NPU
`WhisperSttEngine`	whisper.cpp CPU	~1500ms	Non
`WhisperNpuSttEngine`	ExecuTorch QNN	~50ms*	Oui
`AndroidSttEngine`	Google SpeechRecognizer	~500ms	Non (cloud)

TtsEngine (Text-to-Speech)

interface TtsEngine {
    suspend fun load(modelPath: String?, voiceId: String?)
    fun isLoaded(): Boolean
    suspend fun synthesizeAndPlay(text: String, language: String, onStart: (() -> Unit)?, onComplete: (() -> Unit)?)
    fun stop()
    fun release()
}

Implémentations :

Classe	Backend	Latence	Clonage voix
`AndroidTtsEngine`	Google TTS	~200ms	Non
`ChatterboxTtsEngine`	ONNX CPU/NPU	~3-10s	Oui

MessageProcessor (Middleware)

interface MessageProcessor {
    val name: String
    suspend fun initialize()
    fun isReady(): Boolean
    suspend fun process(input: String, context: ConversationContext): ProcessorResult
    fun release()
}

Implémentations :

Classe	Rôle	Priorité
`VoiceCommandProcessor2`	Intercepte les commandes vocales	1 (premier)
`LlmProcessor`	Génère des réponses via LLM	2
`EchoProcessor`	Répète l'input (fallback/test)	3
(Future) `EmotionProcessor`	Détecte l'émotion de la voix	1.5
(Future) `RagProcessor`	Enrichit avec des documents	1.5
(Future) `DiarizationProcessor`	Identifie le locuteur	1

Chaîne de traitement

Les processeurs sont exécutés dans l'ordre. Le premier qui retourne shouldContinueChain = false termine la chaîne.

Input: "Bonjour, comment vas-tu ?"
  → VoiceCommandProcessor2: pas de commande → continue
  → LlmProcessor: "Je vais bien, comment puis-je t'aider ?" → done
Output: "Je vais bien, comment puis-je t'aider ?"

Input: "stop"
  → VoiceCommandProcessor2: commande STOP_LISTENING → done (shouldSpeak=false)
Output: (arrête l'écoute)

Ajouter un nouveau processeur

class MonNouveauProcessor : MessageProcessor {
    override val name = "MonProcesseur"

    override suspend fun process(input: String, context: ConversationContext): ProcessorResult {
        // Traiter l'input
        val enrichedInput = "[$emotion] $input"

        return ProcessorResult(
            responseText = "",
            shouldContinueChain = true,  // passe au processeur suivant
            metadata = mapOf("emotion" to "triste")
        )
    }
}

// Ajout au pipeline
pipeline.addProcessor(MonNouveauProcessor())

ConversationContext

Le contexte est partagé entre tous les processeurs :

data class ConversationContext(
    val history: List<ChatMessage>,    // historique conversation
    val metadata: MutableMap<String, Any>,  // données partagées
    val language: String,               // "fr"
    val speakerId: String?,             // identification locuteur
    val emotion: String?,               // émotion détectée
    val sessionId: String               // identifiant session
)

Les processeurs peuvent lire et écrire dans metadata pour communiquer entre eux.

Performances actuelles

Composant	Backend	Latence
STT Whisper	CPU (whisper.cpp)	1500ms
STT Whisper	NPU (ExecuTorch)	~50ms*
LLM Qwen3-0.6B	NPU (ExecuTorch)	93 tok/s, TTFT 31ms
LLM Qwen3-1.7B	NPU (ExecuTorch)	46 tok/s, TTFT 27ms
TTS Android	Google	200ms
Pipeline total (CPU STT)	STT→LLM→TTS	~3-7s
Pipeline total (NPU STT)*	STT→LLM→TTS	~1-3s

*STT NPU en cours d'intégration

Fichiers

kazeia-android/app/src/main/java/com/kazeia/
├── core/
│   ├── LlmEngine.kt          # Interface LLM
│   ├── SttEngine.kt           # Interface STT
│   ├── TtsEngine.kt           # Interface TTS
│   ├── VadEngine.kt           # Interface VAD
│   ├── ConversationState.kt   # États pipeline
│   └── Pipeline.kt            # Interfaces MessageProcessor, PipelineOrchestrator
├── llm/
│   ├── ExecuTorchLlmEngine.kt # LLM sur NPU via ExecuTorch
│   └── GenieLlmEngine.kt      # LLM via Genie SDK (abandonné)
├── stt/
│   ├── WhisperSttEngine.kt    # STT CPU via whisper.cpp
│   ├── WhisperNpuSttEngine.kt # STT NPU via ExecuTorch
│   └── AndroidSttEngine.kt    # STT cloud via Google
├── tts/
│   ├── AndroidTtsEngine.kt    # TTS Google natif
│   └── ChatterboxTtsEngine.kt # TTS avec clonage voix
├── conversation/
│   ├── LlmProcessor.kt        # Processor LLM
│   ├── EchoProcessor.kt       # Processor écho
│   ├── VoiceCommandProcessor.kt   # Commandes vocales (config JSON)
│   ├── VoiceCommandProcessor2.kt  # Adapter MessageProcessor
│   ├── PromptBuilder.kt       # Construction prompts
│   └── StoppingCriteria.kt    # Critères d'arrêt
├── service/
│   ├── KazeiaService.kt       # Service Android foreground
│   └── KazeiaPipeline.kt      # Orchestrateur pipeline modulaire
└── ui/
    ├── ChatActivity.kt         # Interface utilisateur
    ├── ChatAdapter.kt          # Adapter RecyclerView
    ├── MiniGraphView.kt        # Graphe temps réel
    └── ResourceMonitor.kt      # Monitoring CPU/GPU/RAM

Projet Kazeia — Damien Micottis & Richard Loyer

7.5 KiB Raw Blame History

Architecture Pipeline Kazeia

Principe

Interfaces

SttEngine (Speech-to-Text)

TtsEngine (Text-to-Speech)

MessageProcessor (Middleware)

Chaîne de traitement

Ajouter un nouveau processeur

ConversationContext

Performances actuelles

Fichiers

7.5 KiB

Raw Blame History