# Qwen3-TTS Implementation Plan — Pure Rust (Candle) **Version:** 2.0 **Created:** 2026-01-27 **Status:** Final **Authors:** makima development team **Spec Reference:** [docs/specs/qwen3-tts-spec.md](../specs/qwen3-tts-spec.md) **Research:** [docs/research/rust-native-tts-research.md](../research/rust-native-tts-research.md) --- ## Table of Contents 1. [Overview](#1-overview) 2. [Task Breakdown](#2-task-breakdown) 3. [File Changes](#3-file-changes) 4. [Phase 1: Candle-Based TTS Module](#4-phase-1-candle-based-tts-module) 5. [Phase 2: WebSocket Handler + Voice Assets](#5-phase-2-websocket-handler--voice-assets) 6. [Phase 3: Optimization + Integration](#6-phase-3-optimization--integration) 7. [Testing Plan](#7-testing-plan) 8. [Risk Assessment](#8-risk-assessment) 9. [Dependencies & Prerequisites](#9-dependencies--prerequisites) 10. [Success Criteria](#10-success-criteria) --- ## 1. Overview This plan details the implementation of Qwen3-TTS integration for the makima system as a **pure Rust** solution using the **candle** ML framework. There is no Python microservice and no proxy pattern — the TTS model runs directly inside the main makima process, loading safetensors weights via candle. ### Key Objectives 1. Implement Qwen3-TTS model inference natively in Rust using candle 2. Create a `makima/src/tts/` module with TTS trait, Chatterbox adapter, and Qwen3 submodule 3. Update the `/api/v1/speak` WebSocket handler to call the TTS engine directly 4. Enable streaming audio delivery with <200ms time-to-first-audio (TTFA) 5. Support voice cloning with default Makima voice ### Architecture Summary ``` Client Browser │ │ WebSocket: /api/v1/speak ▼ Makima Server (Rust/Axum) │ │ speak.rs handler → TTS Engine (in-process) │ │ candle-based Qwen3-TTS inference │ (safetensors weights loaded directly) ▼ Audio Stream back to client ``` **Key architectural decisions:** - **No Python.** All inference runs in Rust via candle. - **No microservice.** TTS runs in-process, no separate service to deploy. - **No proxy.** The speak handler calls the TTS engine directly. - **Lazy loading.** Models loaded on first TTS request (like listen.rs pattern). - **SafeTensors.** Weights loaded directly — no ONNX conversion needed. --- ## 2. Task Breakdown ### Phase 1: Candle-Based TTS Module (Priority: Critical) | ID | Task | Depends On | Estimated Hours | |----|------|------------|-----------------| | P1.1 | Create `makima/src/tts/mod.rs` — TTS trait + factory + types | - | 3 | | P1.2 | Move existing `tts.rs` to `makima/src/tts/chatterbox.rs` | P1.1 | 2 | | P1.3 | Create `makima/src/tts/qwen3/config.rs` — Model config parsing | P1.1 | 2 | | P1.4 | Implement `makima/src/tts/qwen3/model.rs` — 28-layer LM backbone | P1.3 | 12 | | P1.5 | Implement `makima/src/tts/qwen3/code_predictor.rs` — MTP module | P1.4 | 8 | | P1.6 | Implement `makima/src/tts/qwen3/speech_tokenizer.rs` — ConvNet codec | P1.3 | 10 | | P1.7 | Implement `makima/src/tts/qwen3/generate.rs` — Autoregressive generation | P1.4, P1.5, P1.6 | 8 | | P1.8 | Create `makima/src/tts/qwen3/mod.rs` — Public API | P1.7 | 3 | | P1.9 | Add candle dependencies to `Cargo.toml` | - | 1 | | P1.10 | Unit tests for config, model layers, tokenizer | P1.4-P1.6 | 6 | **Phase 1 Total: ~55 hours** ### Phase 2: WebSocket Handler + Voice Assets (Priority: High) | ID | Task | Depends On | Estimated Hours | |----|------|------------|-----------------| | P2.1 | Rewrite `speak.rs` — Direct TTS handler (remove proxy) | P1.8 | 6 | | P2.2 | Add TTS models to `SharedState` (lazy loading via `OnceCell`) | P1.8 | 3 | | P2.3 | Implement voice prompt caching (LRU) | P1.8 | 3 | | P2.4 | Remove `tts_client.rs` (no longer needed) | P2.1 | 1 | | P2.5 | Update `state.rs` — Remove TTS proxy fields, add TTS model fields | P2.2 | 2 | | P2.6 | Update `mod.rs` — Remove `tts_client` module | P2.4 | 0.5 | | P2.7 | Create voice manifest structure (`models/voices/makima/`) | - | 1 | | P2.8 | Acquire Makima voice reference audio | - | 2 | | P2.9 | Test voice cloning quality | P1.8, P2.8 | 2 | **Phase 2 Total: ~20.5 hours** ### Phase 3: Optimization + Integration (Priority: Medium) | ID | Task | Depends On | Estimated Hours | |----|------|------------|-----------------| | P3.1 | Implement streaming generation (token-by-token waveform decode) | P2.1 | 6 | | P3.2 | GPU memory optimization (bf16, cache management) | P3.1 | 4 | | P3.3 | Listen page integration for bidirectional speech | P2.1 | 4 | | P3.4 | Latency benchmarks | P3.1 | 3 | | P3.5 | Integration tests (WebSocket end-to-end) | P2.1 | 4 | | P3.6 | Documentation | P3.5 | 2 | **Phase 3 Total: ~23 hours** --- ## 3. File Changes ### New Files ``` makima/src/tts/ ├── mod.rs // TTS trait, factory, shared types ├── chatterbox.rs // Existing ONNX-based Chatterbox (moved from tts.rs) └── qwen3/ ├── mod.rs // Qwen3TTS public API ├── model.rs // Qwen3 LM transformer (28 layers) ├── code_predictor.rs // MTP module (5 layers, 16 codebooks) ├── speech_tokenizer.rs // Encoder + Decoder (causal ConvNet + RVQ) ├── config.rs // Model config from config.json / safetensors └── generate.rs // Autoregressive generation loop with KV cache ``` ### Modified Files | File | Change Description | |------|-------------------| | `makima/src/server/handlers/speak.rs` | Rewrite: direct TTS engine call instead of proxy | | `makima/src/server/state.rs` | Remove `tts_client`/`tts_config` fields, add `tts_models: OnceCell` | | `makima/src/server/mod.rs` | Remove `pub mod tts_client;` | | `makima/src/server/handlers/mod.rs` | No change (speak already exported) | | `makima/Cargo.toml` | Add candle-core, candle-nn, candle-transformers; remove tokio-tungstenite if unused | | `makima/src/lib.rs` or `main.rs` | Add `pub mod tts;` | ### Deleted Files | File | Reason | |------|--------| | `makima/src/server/tts_client.rs` | No longer needed — no proxy pattern | | `tts-service/` (entire directory) | Python service rejected; pure Rust solution | --- ## 4. Phase 1: Candle-Based TTS Module ### 4.1 TTS Trait and Factory (`tts/mod.rs`) ```rust use async_trait::async_trait; /// Audio chunk for streaming output. pub struct AudioChunk { pub samples: Vec, pub sample_rate: u32, pub is_final: bool, } /// TTS engine trait — implemented by Chatterbox and Qwen3. #[async_trait] pub trait TtsEngine: Send + Sync { /// Generate audio from text. async fn generate( &self, text: &str, voice_id: &str, language: &str, ) -> Result, TtsError>; /// Pre-load a voice prompt. async fn load_voice(&self, voice_id: &str) -> Result<(), TtsError>; /// Check if the engine is ready. fn is_ready(&self) -> bool; } ``` ### 4.2 Qwen3 LM Backbone (`tts/qwen3/model.rs`) Extend candle-transformers' Qwen2 model implementation: - **28 transformer layers** with RoPE, GQA (16 heads, 8 KV heads), head dim 128 - **Hidden size:** 1024, **intermediate size:** 3072 - **Input:** text tokens + reference audio codes (concatenated) - **Output:** zeroth codebook token logits **Key implementation detail:** The existing `candle_transformers::models::qwen2` module provides the base attention and MLP layers. We extend this with: - TTS-specific input embedding (text + audio token embeddings) - Speaker encoder concatenation - Code predictor output head (instead of standard LM head) ### 4.3 Code Predictor (`tts/qwen3/code_predictor.rs`) - **5-layer** transformer module - **Input:** hidden states from the main LM - **Output:** 16 codebook predictions (vocab size 2048 each) - After the main LM predicts the zeroth codebook token, this module predicts the remaining 15 codebook layers in parallel ### 4.4 Speech Tokenizer (`tts/qwen3/speech_tokenizer.rs`) Two sub-components: **Encoder** (used for voice cloning): - Causal 1D ConvNet converting reference audio waveform → discrete multi-codebook tokens - 16-layer RVQ (Residual Vector Quantization) - First codebook = semantic (WavLM-guided), remaining 15 = acoustic **Decoder** (used for audio output): - Causal 1D ConvNet reconstructing waveforms from discrete codes - Input: 16 codebook indices → lookup embeddings → ConvNet → waveform - Output: 24kHz mono audio **candle implementation notes:** - `candle_nn::Conv1d` for all convolution layers - `candle_nn::Embedding` for codebook lookups - Weight normalization handled manually ### 4.5 Autoregressive Generation (`tts/qwen3/generate.rs`) ```rust pub async fn generate( model: &Qwen3Model, code_predictor: &CodePredictor, speech_tokenizer: &SpeechTokenizer, text_tokens: &[u32], voice_prompt: &VoicePrompt, ) -> Result, TtsError> { // 1. Encode reference audio → speaker embedding + audio codes let speaker_emb = speech_tokenizer.encode(&voice_prompt.audio)?; // 2. Prepare input: [text_tokens, audio_codes] let input = prepare_input(text_tokens, &speaker_emb)?; // 3. Autoregressive loop with KV cache let mut kv_cache = KvCache::new(model.num_layers()); let mut generated_codes = Vec::new(); loop { let logits = model.forward(&input, &mut kv_cache)?; let next_token = sample_token(&logits); if next_token == EOS_TOKEN { break; } generated_codes.push(next_token); // 4. Code predictor: predict remaining 15 codebooks let all_codes = code_predictor.predict(&model.last_hidden_state(), next_token)?; // 5. Decode to audio (can be done incrementally for streaming) let chunk = speech_tokenizer.decode(&all_codes)?; // yield chunk for streaming } } ``` --- ## 5. Phase 2: WebSocket Handler + Voice Assets ### 5.1 Speak Handler (Rewritten) ```rust // makima/src/server/handlers/speak.rs // // Direct TTS handler — no proxy, no external service. pub async fn websocket_handler( ws: WebSocketUpgrade, State(state): State, ) -> Response { ws.on_upgrade(|socket| handle_speak_socket(socket, state)) } async fn handle_speak_socket(socket: WebSocket, state: SharedState) { let session_id = Uuid::new_v4().to_string(); // Lazy-load TTS models (like listen.rs does for STT) let tts = match state.get_tts_models().await { Ok(tts) => tts, Err(e) => { send_error(&mut socket, "MODEL_LOADING", &e.to_string()).await; return; } }; // Session loop: parse JSON messages, dispatch to TTS engine let (mut sender, mut receiver) = socket.split(); while let Some(msg) = receiver.next().await { match parse_client_message(msg) { ClientMessage::Start(config) => { // Load voice, send Ready } ClientMessage::Speak(text) => { // Run inference, stream audio chunks for chunk in tts.engine.generate(&text, voice_id, language).await? { sender.send(Message::Binary(chunk.to_pcm16())).await?; } sender.send(complete_message()).await?; } ClientMessage::Stop => break, ClientMessage::Cancel => { /* abort current generation */ } } } } ``` ### 5.2 State Changes Remove from `AppState`: - `tts_client: Option>` - `tts_config: Option` - `with_tts()` method - `is_tts_healthy()` method Add to `AppState`: - `tts_models: OnceCell` — lazily loaded TTS engine - `get_tts_models()` method (async, like `get_ml_models()`) ### 5.3 Voice Assets ``` models/voices/makima/ ├── manifest.json # Voice metadata ├── reference.wav # 5-15 second reference audio └── transcript.txt # Exact transcript of reference audio ``` --- ## 6. Phase 3: Optimization + Integration ### 6.1 Streaming Generation The 12Hz model's causal architecture enables token-by-token waveform generation: - Each token = ~80ms of audio (12.5 Hz) - After generating each token, decode immediately and send audio chunk - Client receives audio before full generation completes ### 6.2 GPU Memory Optimization - Load weights in bf16/f16 (candle supports both) - Implement KV cache with fixed maximum size - Clear cache between sessions - CPU fallback when GPU is unavailable ### 6.3 Listen Page Integration Following the pattern in `listen.rs`: - TTS model protected behind `tokio::sync::Mutex` - WebSocket endpoint emits audio chunks as tokens are generated - Bidirectional: STT (listen) → process → TTS (speak) loop --- ## 7. Testing Plan ### 7.1 Unit Tests | Test Area | Coverage | Key Tests | |-----------|----------|-----------| | Config parsing | 100% | Load config from JSON, validate fields | | Model layers | 80% | Attention, MLP, Conv1d shapes | | Code predictor | 85% | Multi-codebook output shapes | | Speech tokenizer | 80% | Encode/decode round-trip | | Voice cache | 95% | LRU eviction, TTL expiration | | Message parsing | 100% | All client/server message types | ### 7.2 Integration Tests | Test | Description | |------|-------------| | WebSocket flow | Connect → Start → Speak → Audio chunks → Complete → Stop | | Error handling | Invalid text, unknown voice, model loading failure | | Cancellation | Cancel mid-generation | | Voice cloning | Generate with custom reference audio | ### 7.3 Latency Benchmarks | Metric | Target | Acceptable | Warning | |--------|--------|------------|---------| | First Audio (short text) | < 150ms | < 200ms | > 300ms | | First Audio (medium text) | < 200ms | < 300ms | > 500ms | | First Audio (long text) | < 300ms | < 500ms | > 800ms | | Inter-chunk latency | < 30ms | < 50ms | > 100ms | | GPU memory | < 4GB | < 6GB | > 8GB | --- ## 8. Risk Assessment ### 8.1 Technical Risks | Risk | Likelihood | Impact | Mitigation | |------|------------|--------|------------| | **Candle implementation takes longer** | Medium | Medium | Reference Crane's Spark-TTS; use qwen3-rs as LM reference | | **Speech tokenizer ConvNet is complex** | Medium | High | Study PyTorch source; ConvNet layers are simpler than transformers | | **Model quality differs from PyTorch** | Low | High | Validate with reference audio; ensure bf16 precision | | **Crane ships Qwen3-TTS first** | Medium | Positive | Adopt their implementation or use as reference | | **GPU memory issues** | Low | Medium | 0.6B model is small (~2.5GB); fits in 4GB VRAM | ### 8.2 Contingency Plans | Scenario | Response | |----------|----------| | Candle implementation blocked | Use Crane crate as dependency if they ship Qwen3-TTS | | ConvNet decoder too complex | Implement simplified decoder; optimize later | | Latency exceeds targets | Start with batch mode + chunked delivery (acceptable UX) | | No GPU available | CPU fallback with candle's MKL support (degraded performance) | --- ## 9. Dependencies & Prerequisites ### 9.1 Rust Dependencies Add to `Cargo.toml`: ```toml [dependencies] candle-core = "0.8" candle-nn = "0.8" candle-transformers = "0.8" # Keep existing: tokenizers, hf-hub, safetensors, ndarray ``` ### 9.2 Hardware Requirements | Component | Minimum | Recommended | |-----------|---------|-------------| | GPU | CUDA 4GB VRAM / Metal (macOS) | NVIDIA RTX 3060+ (8GB+) | | RAM | 8GB | 16GB | | Storage | 5GB (model weights) | 10GB | ### 9.3 Voice Asset Prerequisites Before Phase 2 voice testing: 1. Makima voice reference audio (5-15 seconds, clean speech) 2. Accurate transcript of reference audio 3. Format: WAV 24kHz mono, 16-bit PCM --- ## 10. Success Criteria ### 10.1 Phase 1 Completion - [ ] candle-based Qwen3 model loads safetensors weights - [ ] Forward pass produces valid logits - [ ] Speech tokenizer encodes/decodes audio - [ ] Code predictor generates 16 codebook layers - [ ] Unit tests pass with > 80% coverage ### 10.2 Phase 2 Completion - [ ] `/api/v1/speak` endpoint produces audio from text - [ ] No Python service required - [ ] Voice cloning works with reference audio - [ ] Error handling returns appropriate codes - [ ] speak.rs calls TTS engine directly (no proxy) ### 10.3 Phase 3 Completion - [ ] Streaming generation with < 200ms TTFA - [ ] GPU memory usage < 6GB - [ ] Integration tests pass - [ ] Listen page bidirectional speech works - [ ] Latency benchmarks documented ### 10.4 Final Acceptance Criteria 1. **Functional:** End-to-end TTS streaming via WebSocket, pure Rust, no Python 2. **Performance:** TTFA < 200ms, subsequent chunks < 100ms 3. **Quality:** Synthesized speech is intelligible and recognizable as Makima 4. **Reliability:** Error handling is robust; graceful degradation on GPU failure 5. **Architecture:** Clean `tts/` module with trait-based engine selection --- ## Appendix A: Quick Start Commands ### Development ```bash # Build with candle GPU support cd makima cargo build --features cuda # or --features metal for macOS # Run server with TTS enabled TTS_ENGINE=qwen3 TTS_DEVICE=cuda:0 cargo run # Run TTS-specific tests cargo test tts ``` ### Benchmarks ```bash # Run latency benchmarks (requires GPU) cargo bench --bench tts_latency ``` ## Appendix B: Reference Implementations - [candle-transformers qwen2 model](https://docs.rs/candle-transformers/latest/candle_transformers/models/qwen2/index.html) — base attention/MLP layers - [qwen3-rs](https://github.com/reinterpretcat/qwen3-rs) — educational Qwen3 in Rust - [Crane](https://github.com/lucasjinreal/Crane) — Rust TTS engine (Qwen3-TTS on roadmap) - [docs/research/rust-native-tts-research.md](../research/rust-native-tts-research.md) — full feasibility analysis