# Qwen3-TTS Integration Specification **Version:** 2.0 **Date:** 2026-01-27 **Status:** Draft **Author:** Makima Engineering ## Table of Contents 1. [Overview](#1-overview) 2. [Functional Requirements](#2-functional-requirements) 3. [Non-Functional Requirements](#3-non-functional-requirements) 4. [Architecture Specification](#4-architecture-specification) 5. [API Contract](#5-api-contract) 6. [Voice Asset Requirements](#6-voice-asset-requirements) 7. [Testing Strategy](#7-testing-strategy) 8. [Implementation Phases](#8-implementation-phases) 9. [Appendix](#appendix) --- ## 1. Overview ### 1.1 Purpose This specification defines the integration of Qwen3-TTS-12Hz-0.6B-Base as a replacement for the existing Chatterbox-Turbo TTS implementation in the makima system. The new implementation is a **pure Rust** solution using the **candle** ML framework — no Python, no separate microservice. The TTS model runs directly inside the main makima process. The implementation will provide: - **Streaming TTS** with near-real-time audio synthesis - **Voice cloning** with default Makima voice (Japanese voice actress speaking English) - **Bidirectional speech integration** for the Listen page - **WebSocket-based streaming API** for low-latency delivery ### 1.2 Background The current TTS implementation (Chatterbox-Turbo-ONNX) has limitations: - No streaming support (batch-only generation) - No HTTP/WebSocket endpoint exposed - High latency for interactive use cases Qwen3-TTS offers significant improvements: - **97ms** end-to-end latency (vs. batch processing) - **10 languages** supported including Japanese cross-lingual cloning - **3-second** reference audio for voice cloning - **Dual-track streaming architecture** ### 1.3 Scope This specification covers: - WebSocket endpoint `/api/v1/speak` for streaming TTS - Pure Rust candle-based model inference running in-process - Voice asset management - Testing and benchmarking Out of scope: - ONNX export of Qwen3-TTS (not available) - Instruction-following TTS features (base model only) - Full replacement of STT/Listen functionality --- ## 2. Functional Requirements ### 2.1 WebSocket Endpoint: `/api/v1/speak` The TTS service SHALL be exposed via a WebSocket endpoint at `/api/v1/speak` for streaming audio synthesis. #### 2.1.1 Connection Flow ``` Client Server (Rust/Axum) | | |--- WS Connect ------------------>| | | [Load TTS model lazily if needed] |<-- Ready (session_id) -----------| | | |--- Start (config) -------------->| |<-- Started ----------------------| [Load voice prompt] | | |--- Speak (text) ---------------->| [Direct candle model inference] | | |<-- AudioChunk (binary) ----------| |<-- AudioChunk (binary) ----------| |<-- Complete ---------------------| | | |--- Stop ------------------------>| |<-- Stopped ----------------------| | | ``` #### 2.1.2 Voice Cloning The system SHALL support voice cloning with the following modes: | Mode | Description | Requirements | |------|-------------|--------------| | Default (Makima) | Pre-loaded Makima voice | None (auto-selected) | | Custom Voice | User-provided reference | Audio file + transcript | | X-Vector Only | Speaker embedding only | Audio file (no transcript) | **Default Voice Behavior:** - If no voice is specified, use the pre-loaded Makima voice prompt - Makima voice SHALL be a Japanese voice actress (Tomori Kusunoki) speaking English - Voice prompt is pre-computed at model load time for zero-latency switching #### 2.1.3 Message Protocol ##### Client-to-Server Messages All messages use JSON format with a `type` field for routing. **Start Message** - Initialize TTS session ```json { "type": "start", "sampleRate": 24000, "encoding": "pcm16", "voice": "makima", "language": "English", "authToken": "optional-jwt-token", "contractId": "optional-contract-uuid" } ``` | Field | Type | Required | Description | |-------|------|----------|-------------| | `type` | string | Yes | Must be "start" | | `sampleRate` | number | No | Output sample rate (default: 24000) | | `encoding` | string | No | Audio encoding: "pcm16", "pcm32f" (default: "pcm16") | | `voice` | string | No | Voice ID or "makima" (default: "makima") | | `language` | string | No | Output language (default: "English") | | `authToken` | string | No | JWT for authenticated sessions | | `contractId` | string | No | Contract ID for context | **Speak Message** - Request speech synthesis ```json { "type": "speak", "text": "Hello, I am Makima.", "priority": "normal" } ``` | Field | Type | Required | Description | |-------|------|----------|-------------| | `type` | string | Yes | Must be "speak" | | `text` | string | Yes | Text to synthesize | | `priority` | string | No | "high" or "normal" (default: "normal") | **Stop Message** - End session ```json { "type": "stop", "reason": "user_requested" } ``` **Cancel Message** - Cancel current synthesis ```json { "type": "cancel" } ``` ##### Server-to-Client Messages **Ready Message** - Session established ```json { "type": "ready", "sessionId": "uuid-string", "voiceLoaded": "makima", "capabilities": { "streaming": true, "languages": ["English", "Japanese", "Chinese", "Korean", "German", "French", "Russian", "Portuguese", "Spanish", "Italian"] } } ``` **Started Message** - TTS session configured ```json { "type": "started", "sampleRate": 24000, "encoding": "pcm16", "voice": "makima" } ``` **AudioChunk Message** - Streaming audio data ```json { "type": "audioChunk", "data": "", "sequenceNumber": 1, "isFinal": false, "timestampMs": 1234567890 } ``` For binary transport (recommended for performance): - Server MAY send raw binary WebSocket frames - Binary frames contain PCM audio data directly - JSON control messages indicate start/end of audio stream **Complete Message** - Synthesis finished ```json { "type": "complete", "durationMs": 1500, "charactersProcessed": 25, "audioLengthMs": 2100 } ``` **Error Message** - Error occurred ```json { "type": "error", "code": "SYNTHESIS_ERROR", "message": "Failed to generate audio", "recoverable": true } ``` **Stopped Message** - Session ended ```json { "type": "stopped", "reason": "user_requested" } ``` #### 2.1.4 Integration with Listen Page The TTS endpoint SHALL integrate with the existing Listen page (`/api/v1/listen`) to enable bidirectional speech: **Bidirectional Flow:** ``` User Speech -> /api/v1/listen (STT) -> Transcription | v LLM Processing / Task Creation | v Response Text -> /api/v1/speak (TTS) -> Audio -> User ``` **Implementation Requirements:** 1. Both endpoints SHALL support the same `contractId` for context sharing 2. TTS SHALL support interruption when new STT input is detected 3. Session management SHALL coordinate between STT and TTS ### 2.2 Voice Configuration API #### 2.2.1 List Available Voices ``` GET /api/v1/voices ``` Response: ```json { "voices": [ { "id": "makima", "name": "Makima (Default)", "language": "Japanese", "description": "Default Makima voice (Tomori Kusunoki)", "isDefault": true } ] } ``` #### 2.2.2 Upload Custom Voice (Future) ``` POST /api/v1/voices Content-Type: multipart/form-data audio: transcript: "Text spoken in the audio" name: "Custom Voice" ``` --- ## 3. Non-Functional Requirements ### 3.1 Latency Requirements | Metric | Target | Maximum | Notes | |--------|--------|---------|-------| | First Audio Byte | < 200ms | 500ms | From text submission to first audio chunk | | Subsequent Chunks | < 50ms | 100ms | Inter-chunk latency | | End-to-End Latency | < 300ms | 800ms | Total time for short phrases | | Voice Prompt Loading | < 500ms | 2000ms | One-time at session start | **Measurement Points:** - T0: Client sends "speak" message - T1: First audio chunk received by client - T2: Last audio chunk received ("complete" message) - First Audio Latency = T1 - T0 - Total Latency = T2 - T0 ### 3.2 Audio Quality Requirements | Specification | Value | |---------------|-------| | Output Sample Rate | 24,000 Hz | | Bit Depth | 16-bit (PCM16) or 32-bit float | | Channels | Mono (1 channel) | | Audio Codec | Raw PCM (WebSocket), WAV (download) | | Voice Similarity | > 0.90 speaker similarity score | **Quality Metrics:** - MOS (Mean Opinion Score): Target > 4.0 - Speaker similarity to reference: Target > 0.90 - No audible artifacts or glitches in streaming mode ### 3.3 Hardware Requirements The TTS model runs directly in the makima process using candle. | Component | Minimum | Recommended | |-----------|---------|-------------| | GPU | CUDA-capable, 4GB VRAM (or Metal on macOS) | RTX 3060+ with 8GB+ VRAM | | GPU Memory | 4GB | 8GB | | System RAM | 8GB | 16GB | | Storage | 5GB (model weights) | 10GB | **GPU Memory Breakdown:** - Model weights (bf16): ~1.2GB - Speech tokenizer: ~682MB - KV cache during inference: ~1-2GB - Safety margin: ~1GB **CPU Fallback:** - candle supports CPU with MKL for systems without GPU - Latency will be higher but functional ### 3.4 Scalability Requirements | Metric | Target | |--------|--------| | Concurrent Sessions | 10 per GPU | | Requests per Second | 50 text-to-speech requests | | Audio Throughput | 10 hours of audio per hour | ### 3.5 Availability Requirements | Metric | Target | |--------|--------| | Service Uptime | 99.5% | | Recovery Time | < 30 seconds | | Graceful Degradation | Fall back to batch mode if streaming fails | --- ## 4. Architecture Specification ### 4.1 System Architecture ``` +-------------------------------------------------------------------------+ | Client Application | | +-------------+ +-------------+ +------------------------------+ | | | Listen | | Speak | | UI Components | | | | (STT UI) | | (TTS UI) | | (Audio Player, Controls) | | | +------+------+ +------+------+ +------------------------------+ | +---------|--------------------|------------------------------------------+ | WebSocket | WebSocket | /api/v1/listen | /api/v1/speak | | +---------|--------------------|------------------------------------------+ | | Makima Server (Rust/Axum) | | +------v--------------------v------+ | | | WebSocket Router | | | | (axum WebSocket handlers) | | | +------+--------------------+------+ | | | | | | +------v------+ +------v------+ +-----------------------------+ | | | Listen | | Speak | | Shared State | | | | Handler | | Handler | | - ML Models (STT) | | | | (STT/ML) | | (TTS/ML) | | - TTS Model (candle) | | | +-------------+ +------+------+ | - Voice Prompt Cache | | | | | - Session Manager | | | | +-----------------------------+ | | +------v------+ | | | TTS Module | | | | (candle) | | | +------+------+ | | | | | +------v------+ | | | Qwen3-TTS | | | | Components | | | | - LM (28L) | | | | - Code Pred | | | | - Speech Tok| | | +-------------+ | +--------------------------------------------------------------------------+ ``` ### 4.2 TTS Module Structure ``` makima/src/tts/ ├── mod.rs // TTS trait + factory (select Chatterbox vs Qwen3) ├── chatterbox.rs // Existing ONNX-based Chatterbox (moved from tts.rs) ├── qwen3/ │ ├── mod.rs // Qwen3TTS public API │ ├── model.rs // Qwen3 LM transformer (28 layers) │ ├── code_predictor.rs // MTP module (5 layers, 16 codebooks) │ ├── speech_tokenizer.rs // Encoder + Decoder (causal ConvNet) │ ├── config.rs // Model config from config.json │ └── generate.rs // Autoregressive generation loop with KV cache ``` #### 4.2.1 TTS Trait ```rust // makima/src/tts/mod.rs /// Trait for text-to-speech implementations. #[async_trait] pub trait TtsEngine: Send + Sync { /// Generate audio from text with a given voice prompt. async fn generate( &self, text: &str, voice_id: &str, language: &str, ) -> Result, TtsError>; /// Load and cache a voice prompt from reference audio. async fn load_voice(&self, voice_id: &str) -> Result<(), TtsError>; /// Check if the engine is ready for inference. fn is_ready(&self) -> bool; } /// Select the appropriate TTS engine based on configuration. pub fn create_engine(config: &TtsConfig) -> Box { match config.engine { TtsEngineType::Qwen3 => Box::new(qwen3::Qwen3Tts::new(config)), TtsEngineType::Chatterbox => Box::new(chatterbox::ChatterboxTts::new(config)), } } ``` #### 4.2.2 Qwen3 Candle Implementation The Qwen3 module implements the three core model components using candle: 1. **Language Model** — 28-layer transformer using candle-transformers' Qwen2 attention with TTS-specific modifications 2. **Code Predictor** — 5-layer MTP module predicting 16 codebook layers 3. **Speech Tokenizer** — GAN-based codec with Conv1d encoder/decoder **Key candle features used:** - `candle_core::Tensor` for all tensor operations - `candle_nn::Module` for model layers - `candle_nn::VarBuilder` for loading safetensors weights - `candle_core::Device` for GPU/CPU selection #### 4.2.3 Model Loading Models are loaded lazily on first TTS request, following the pattern established by `listen.rs`: ```rust // Models held in SharedState behind async mutex pub struct TtsModels { pub engine: Box, pub voice_cache: VoicePromptCache, } impl AppState { pub async fn get_tts_models(&self) -> Result<&TtsModels, TtsError> { self.tts_models.get_or_try_init(|| async { // Load safetensors weights via candle // Initialize voice cache with default Makima voice }).await } } ``` ### 4.3 Speak Handler ```rust // makima/src/server/handlers/speak.rs /// WebSocket handler for TTS streaming. /// Calls the TTS engine directly — no proxy, no external service. pub async fn websocket_handler( ws: WebSocketUpgrade, State(state): State, ) -> Response { ws.on_upgrade(|socket| handle_speak_socket(socket, state)) } async fn handle_speak_socket(socket: WebSocket, state: SharedState) { let session_id = Uuid::new_v4().to_string(); // Get or lazily load TTS models let tts = match state.get_tts_models().await { Ok(tts) => tts, Err(e) => { // Send error and close return; } }; // Handle WebSocket messages directly // Parse JSON commands, run inference, stream audio chunks back } ``` ### 4.4 Voice Prompt Caching Voice prompts are cached in-memory using an LRU cache: ```rust // makima/src/tts/mod.rs pub struct VoicePromptCache { cache: tokio::sync::Mutex>, } impl VoicePromptCache { pub fn new(max_size: usize) -> Self { /* ... */ } pub async fn get(&self, voice_id: &str) -> Option { /* ... */ } pub async fn insert(&self, voice_id: String, prompt: VoicePrompt) { /* ... */ } } ``` ### 4.5 Error Handling and Recovery #### 4.5.1 Error Categories | Error Code | Category | Recoverable | Action | |------------|----------|-------------|--------| | `MODEL_LOADING` | Initialization | Yes | Wait and retry | | `SYNTHESIS_ERROR` | Generation | Yes | Retry with same input | | `INVALID_TEXT` | Input | No | Return error to client | | `VOICE_NOT_FOUND` | Configuration | No | Fall back to default voice | | `GPU_OUT_OF_MEMORY` | Resource | Yes | Clear cache, retry on CPU | | `TIMEOUT` | Inference | Yes | Retry with backoff | --- ## 5. API Contract ### 5.1 WebSocket Message Formats #### 5.1.1 Client-to-Server Messages ```typescript // TypeScript type definitions for client implementation interface StartMessage { type: "start"; sampleRate?: number; // Default: 24000 encoding?: "pcm16" | "pcm32f"; // Default: "pcm16" voice?: string; // Default: "makima" language?: string; // Default: "English" authToken?: string; // JWT for authenticated sessions contractId?: string; // Contract context } interface SpeakMessage { type: "speak"; text: string; // Required: text to synthesize priority?: "normal" | "high"; // Default: "normal" } interface CancelMessage { type: "cancel"; } interface StopMessage { type: "stop"; reason?: string; } type ClientMessage = StartMessage | SpeakMessage | CancelMessage | StopMessage; ``` #### 5.1.2 Server-to-Client Messages ```typescript interface ReadyMessage { type: "ready"; sessionId: string; voiceLoaded: string; capabilities: { streaming: boolean; languages: string[]; }; } interface StartedMessage { type: "started"; sampleRate: number; encoding: string; voice: string; } interface AudioChunkMessage { type: "audioChunk"; data: string; // Base64-encoded PCM audio sequenceNumber: number; isFinal: boolean; timestampMs: number; } interface CompleteMessage { type: "complete"; durationMs: number; charactersProcessed: number; audioLengthMs: number; } interface ErrorMessage { type: "error"; code: string; message: string; recoverable: boolean; } interface StoppedMessage { type: "stopped"; reason: string; } type ServerMessage = | ReadyMessage | StartedMessage | AudioChunkMessage | CompleteMessage | ErrorMessage | StoppedMessage; ``` ### 5.2 Error Codes | Code | HTTP-like | Description | Recovery | |------|-----------|-------------|----------| | `MODEL_LOADING` | 503 | Model still loading | Wait and retry | | `SYNTHESIS_ERROR` | 500 | Failed to generate audio | Retry | | `INVALID_TEXT` | 400 | Text is empty or invalid | Fix input | | `VOICE_NOT_FOUND` | 404 | Requested voice doesn't exist | Use default | | `UNAUTHORIZED` | 401 | Invalid or missing auth token | Re-authenticate | | `RATE_LIMITED` | 429 | Too many requests | Back off | | `TIMEOUT` | 408 | Operation timed out | Retry | | `CANCELLED` | 499 | Client cancelled request | N/A | ### 5.3 Session Management #### 5.3.1 Session Lifecycle ``` DISCONNECTED -> CONNECTING -> READY -> STARTED -> SPEAKING -> READY -> ... -> STOPPED | | | | v v v v ERROR ERROR ERROR STOPPED ``` --- ## 6. Voice Asset Requirements ### 6.1 Makima Voice Clip Specifications #### 6.1.1 Audio Requirements | Specification | Requirement | |---------------|-------------| | Duration | 5-10 seconds (minimum 3s) | | Format | WAV (PCM) | | Sample Rate | 24,000 Hz or higher | | Bit Depth | 16-bit or higher | | Channels | Mono (preferred) or Stereo | | Content | Clear speech, natural tone | | Background | Minimal noise/music | #### 6.1.2 Content Guidelines **DO:** - Use dialogue with varied intonation - Include multiple phonemes - Capture natural speaking rhythm - Extract from clean audio scenes **DON'T:** - Include background music - Use shouting or whispering - Include sound effects - Use heavily processed audio #### 6.1.3 Transcript Requirements | Specification | Requirement | |---------------|-------------| | Format | Plain text (.txt) or JSON | | Encoding | UTF-8 | | Content | Exact transcription of audio | | Language | Japanese (for Japanese reference) | ### 6.2 Storage Location and Management #### 6.2.1 Directory Structure ``` models/ └── voices/ ├── makima/ │ ├── reference.wav # Primary reference audio │ ├── transcript.txt # Plain text transcript │ ├── transcript.json # Structured transcript (optional) │ └── metadata.json # Voice metadata ├── makima-alt/ # Alternative Makima clips (future) │ └── ... └── custom/ # User-uploaded voices (future) └── {voice_id}/ ├── reference.wav ├── transcript.txt └── metadata.json ``` --- ## 7. Testing Strategy ### 7.1 Unit Tests #### 7.1.1 Rust TTS Module Tests ```rust // makima/src/tts/qwen3/tests.rs #[cfg(test)] mod tests { use super::*; #[test] fn test_config_loading() { let config = Qwen3Config::from_json("test_config.json").unwrap(); assert_eq!(config.hidden_size, 1024); assert_eq!(config.num_layers, 28); } #[test] fn test_voice_prompt_cache_lru() { let cache = VoicePromptCache::new(2); cache.insert("a", prompt_a); cache.insert("b", prompt_b); cache.get("a"); // access a cache.insert("c", prompt_c); // should evict b assert!(cache.get("a").is_some()); assert!(cache.get("b").is_none()); assert!(cache.get("c").is_some()); } #[tokio::test] async fn test_speak_handler_message_parsing() { let json = r#"{"type": "start", "voice": "makima"}"#; let msg: SpeakClientMessage = serde_json::from_str(json).unwrap(); match msg { SpeakClientMessage::Start(start) => { assert_eq!(start.voice, Some("makima".to_string())); } _ => panic!("Expected Start message"), } } } ``` ### 7.2 Integration Tests ```rust // tests/tts_integration.rs #[tokio::test] async fn test_speak_websocket_flow() { // Start test server with TTS enabled let state = create_test_state_with_tts().await; let app = make_router(state); // Connect WebSocket let ws = connect_ws("/api/v1/speak").await; // Send start ws.send_json(json!({"type": "start", "voice": "makima"})).await; let ready = ws.recv_json().await; assert_eq!(ready["type"], "ready"); // Send speak ws.send_json(json!({"type": "speak", "text": "Hello."})).await; // Collect audio chunks let mut chunks = vec![]; loop { let msg = ws.recv().await; match msg { WsMsg::Binary(data) => chunks.push(data), WsMsg::Text(json) => { let data: Value = serde_json::from_str(&json).unwrap(); if data["type"] == "complete" { break; } } } } assert!(!chunks.is_empty()); } ``` ### 7.3 Performance Targets | Metric | Target | Acceptable | Warning | |--------|--------|------------|---------| | First Audio (short) | < 150ms | < 200ms | > 300ms | | First Audio (medium) | < 200ms | < 300ms | > 500ms | | First Audio (long) | < 300ms | < 500ms | > 800ms | | Inter-chunk | < 30ms | < 50ms | > 100ms | | Memory (GPU) | < 4GB | < 6GB | > 8GB | | Memory (CPU) | < 2GB | < 4GB | > 8GB | --- ## 8. Implementation Phases ### Phase 1: Candle-Based Qwen3-TTS Module (Week 1-2) **Deliverables:** - [ ] `makima/src/tts/mod.rs` — TTS trait + factory - [ ] `makima/src/tts/chatterbox.rs` — Move existing code from tts.rs - [ ] `makima/src/tts/qwen3/model.rs` — 28-layer LM backbone (extend candle Qwen2) - [ ] `makima/src/tts/qwen3/code_predictor.rs` — MTP module (5 layers, 16 codebooks) - [ ] `makima/src/tts/qwen3/speech_tokenizer.rs` — ConvNet encoder/decoder + RVQ - [ ] `makima/src/tts/qwen3/config.rs` — Config from safetensors - [ ] `makima/src/tts/qwen3/generate.rs` — Autoregressive generation with KV cache - [ ] Add `candle-core`, `candle-nn`, `candle-transformers` to Cargo.toml **Success Criteria:** - Model loads safetensors weights successfully - Can generate audio from text via direct inference - First audio latency < 500ms (initial, unoptimized) ### Phase 2: WebSocket Handler + Voice Assets (Week 2-3) **Deliverables:** - [ ] Update `makima/src/server/handlers/speak.rs` — Direct TTS handler (no proxy) - [ ] Lazy model loading via `SharedState` - [ ] Voice prompt caching - [ ] Makima voice asset acquisition and processing - [ ] Basic error handling and session management **Success Criteria:** - `/api/v1/speak` endpoint produces streaming audio - Default Makima voice works - Error handling matches specification ### Phase 3: Optimization + Integration (Week 3-4) **Deliverables:** - [ ] Streaming audio generation (token-by-token decoding) - [ ] GPU memory optimization - [ ] Listen page integration for bidirectional speech - [ ] Session coordination between STT and TTS - [ ] Full test suite (unit, integration) - [ ] Latency benchmarks **Success Criteria:** - First audio latency < 200ms - Memory usage < 6GB - All tests passing - Documentation complete --- ## Appendix ### A. Dependencies #### Rust (Cargo.toml additions) ```toml [dependencies] candle-core = "0.8" candle-nn = "0.8" candle-transformers = "0.8" # Keep existing: tokenizers, hf-hub, ndarray (for compatibility) ``` ### B. Environment Variables ```bash # TTS Configuration TTS_ENGINE=qwen3 # "qwen3" or "chatterbox" TTS_MODEL_ID=Qwen/Qwen3-TTS-12Hz-0.6B-Base TTS_DEVICE=cuda:0 # "cuda:0", "metal", or "cpu" TTS_VOICES_DIR=models/voices TTS_DEFAULT_VOICE=makima ``` ### C. References 1. [Qwen3-TTS-12Hz-0.6B-Base (Hugging Face)](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base) 2. [Qwen3-TTS GitHub Repository](https://github.com/QwenLM/Qwen3-TTS) 3. [Qwen3-TTS Technical Report (arXiv:2601.15621)](https://arxiv.org/abs/2601.15621) 4. [Candle — HuggingFace Rust ML Framework](https://github.com/huggingface/candle) 5. [axum WebSocket Documentation](https://docs.rs/axum/latest/axum/extract/ws/index.html) 6. [docs/research/rust-native-tts-research.md](../research/rust-native-tts-research.md) — Detailed feasibility analysis ### D. Glossary | Term | Definition | |------|------------| | **TTS** | Text-to-Speech: Converting text input to audio output | | **STT** | Speech-to-Text: Converting audio input to text output | | **Voice Cloning** | Creating synthetic speech that mimics a specific speaker | | **Voice Prompt** | Pre-computed speaker embedding for voice cloning | | **Candle** | HuggingFace's minimalist Rust ML framework | | **SafeTensors** | Efficient, safe model weight serialization format | | **RVQ** | Residual Vector Quantization — multi-codebook audio tokenization | | **MTP** | Multi-Token Prediction — code predictor generating 16 codebook layers | | **bf16** | Brain floating-point 16-bit format for GPU computation |