Add Qwen3-TTS streaming endpoint for voice synthesis (#40)

* Task completion checkpoint * Task completion checkpoint * Task completion checkpoint * Add Qwen3-TTS research document for live TTS replacement Research findings for replacing Chatterbox TTS with Qwen3-TTS-12Hz-0.6B-Base: - Current TTS: Chatterbox-Turbo-ONNX with batch-only generation, no streaming - Qwen3-TTS: 97ms end-to-end latency, streaming support, 3-second voice cloning - Voice cloning: Requires 3s reference audio + transcript (Makima voice planned) - Integration: Python service with WebSocket bridge (no ONNX export available) - Languages: 10 supported including English and Japanese Document includes: - Current architecture analysis (makima/src/tts.rs) - Qwen3-TTS capabilities and requirements - Feasibility assessment for live/streaming TTS - Audio clip requirements for voice cloning - Preliminary technical approach with architecture diagrams Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * [WIP] Heartbeat checkpoint - 2026-01-27 03:11:15 UTC * Add Qwen3-TTS research documentation Comprehensive research on replacing Chatterbox TTS with Qwen3-TTS-12Hz-0.6B-Base: - Current TTS implementation analysis (Chatterbox-Turbo-ONNX in makima/src/tts.rs) - Qwen3-TTS capabilities: 97ms streaming latency, voice cloning with 3s reference - Cross-lingual support: Japanese voice (Makima/Tomori Kusunoki) speaking English - Python microservice architecture recommendation (FastAPI + WebSocket) - Implementation phases and technical approach - Hardware requirements and dependencies Key findings: - Live/streaming TTS is highly feasible with 97ms latency - Voice cloning fully supported with 0.95 speaker similarity - Recommended: Python microservice with WebSocket streaming Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add comprehensive Qwen3-TTS integration specification This specification document defines the complete integration of Qwen3-TTS-12Hz-0.6B-Base as a replacement for the existing Chatterbox-Turbo TTS implementation. The document covers: ## Functional Requirements - WebSocket endpoint /api/v1/speak for streaming TTS - Voice cloning with default Makima voice (Japanese VA speaking English) - Support for custom voice references - Detailed client-to-server and server-to-client message protocols - Integration with Listen page for bidirectional speech ## Non-Functional Requirements - Latency targets: < 200ms first audio byte - Audio quality: 24kHz, mono, PCM16/PCM32f - Hardware requirements: CUDA GPU with 4-8GB VRAM - Scalability: 10 concurrent sessions per GPU ## Architecture Specification - Python TTS microservice with FastAPI/WebSocket - Rust proxy endpoint in makima server - Voice prompt caching mechanism (LRU cache) - Error handling and recovery strategies ## API Contract - Complete WebSocket message format definitions (TypeScript) - Error codes and responses (TTS_UNAVAILABLE, SYNTHESIS_ERROR, etc.) - Session state machine and lifecycle management ## Voice Asset Requirements - Makima voice clip specifications (5-10s WAV, transcript required) - Storage location: models/voices/makima/ - Metadata format for voice management ## Testing Strategy - Unit tests for Python TTS service and Rust proxy - Integration tests for WebSocket flow - Latency benchmarks with performance targets - Test data fixtures for various text lengths Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add Qwen3-TTS implementation plan Comprehensive implementation plan for replacing Chatterbox-TTS with Qwen3-TTS streaming TTS service, including: - Task breakdown with estimated hours for each phase - Phase 1: Python TTS microservice (FastAPI, WebSocket) - Phase 2: Rust proxy integration (speak.rs, tts_client.rs) - Detailed file changes and new module structure - Testing plan with unit, integration, and latency benchmarks - Risk assessment with mitigation strategies - Success criteria for each phase Based on specification in docs/specs/qwen3-tts-spec.md Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add author and research references to TTS implementation plan Add links to research documentation and author attribution. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * [WIP] Heartbeat checkpoint - 2026-01-27 03:25:06 UTC * Add Python TTS service project structure (Phase 1.1-1.3) Create the initial makima-tts Python service directory structure with: - pyproject.toml with FastAPI, Qwen-TTS, and torch dependencies - config.py with pydantic-settings TTSConfig class - models.py with Pydantic message models (Start, Speak, Stop, Ready, etc.) This implements tasks P1.1, P1.2, and P1.3 from the Qwen3-TTS implementation plan. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add TTS engine and voice manager for Qwen3-TTS (Phase 1.4-1.5) Implement core TTS functionality: - tts_engine.py: Qwen3-TTS wrapper with streaming audio chunk generation - voice_manager.py: Voice prompt caching with LRU eviction and TTL support Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * [WIP] Heartbeat checkpoint - 2026-01-27 03:30:06 UTC * Add TTS proxy client and message types (Phase 2.1, 2.2, 2.4) - Add tts_client.rs with TtsConfig, TtsCircuitBreaker, TtsError, TtsProxyClient, and TtsConnection structs for WebSocket proxying - Add TTS message types to messages.rs (TtsAudioEncoding, TtsPriority, TtsStartMessage, TtsSpeakMessage, TtsStopMessage, TtsClientMessage, TtsReadyMessage, TtsAudioChunkMessage, TtsCompleteMessage, TtsErrorMessage, TtsStoppedMessage, TtsServerMessage) - Export tts_client module from server mod.rs - tokio-tungstenite already present in Cargo.toml Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add TTS WebSocket handler and route (Phase 2.3, 2.5, 2.6) - Create speak.rs WebSocket handler that proxies to Python TTS service - Add TtsState fields (tts_client, tts_config) to AppState - Add with_tts() builder and is_tts_healthy() methods to AppState - Register /api/v1/speak route in the router - Add speak module export in handlers/mod.rs The handler forwards WebSocket messages bidirectionally between the client and the Python TTS microservice with proper error handling. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add Makima voice profile assets for TTS voice cloning Creates the voice assets directory structure with: - manifest.json containing voice configuration (voice_id, speaker, language, reference audio path, and Japanese transcript placeholder) - README.md with instructions for obtaining voice reference audio Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add Rust-native Qwen3-TTS integration research document Research findings for integrating Qwen3-TTS-12Hz-0.6B-Base directly into the makima Rust codebase without Python. Key conclusions: - ONNX export is not viable (unsupported architecture) - Candle (HF Rust ML framework) is the recommended approach - Model weights available in safetensors format (2.52GB total) - Three components needed: LM backbone, code predictor, speech tokenizer - Crane project has Qwen3-TTS as highest priority (potential upstream) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * [WIP] Heartbeat checkpoint - 2026-01-27 11:21:43 UTC * [WIP] Heartbeat checkpoint - 2026-01-27 11:24:19 UTC * [WIP] Heartbeat checkpoint - 2026-01-27 11:26:43 UTC * feat: implement Rust-native Qwen3-TTS using candle framework Replace monolithic tts.rs with modular tts/ directory structure: - tts/mod.rs: TtsEngine trait, TtsEngineFactory, shared types (AudioChunk, TtsError), and utility functions (save_wav, resample, argmax) - tts/chatterbox.rs: existing ONNX-based ChatterboxTTS adapted to implement TtsEngine trait with Mutex-wrapped sessions for Send+Sync - tts/qwen3/mod.rs: Qwen3Tts entry point with HuggingFace model loading - tts/qwen3/config.rs: Qwen3TtsConfig parsing from HF config.json - tts/qwen3/model.rs: 28-layer Qwen3 transformer with RoPE, GQA (16 heads, 8 KV heads), SiLU MLP, RMS norm, and KV cache - tts/qwen3/code_predictor.rs: 5-layer MTP module predicting 16 codebooks - tts/qwen3/speech_tokenizer.rs: ConvNet encoder/decoder with 16-layer RVQ - tts/qwen3/generate.rs: autoregressive generation loop with streaming support Add candle-core, candle-nn, candle-transformers, safetensors to Cargo.toml. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat: integrate TTS engine into speak WebSocket handler - Update speak.rs handler to use TTS engine directly from SharedState instead of returning a stub "not implemented" error - Add TtsEngine (OnceCell lazy-loaded) to AppState in state.rs with get_tts_engine() method for lazy initialization on first connection - Implement full WebSocket protocol: client sends JSON speak/cancel/stop messages, server streams binary PCM audio chunks and audio_end signals - Create voices/makima/manifest.json for Makima voice profile configuration - All files compile successfully with zero errors Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat: add /speak TTS page with WebSocket audio playback Add a new /speak frontend page for text-to-speech via WebSocket. The page accepts text input and streams synthesized PCM audio through the Web Audio API. Includes model loading indicator, cancel support, and connection status. Also adds a loading bar to the listen page ControlPanel during WebSocket connection. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
author: soryu <soryu@soryu.co> 2026-01-28 02:54:17 +0000
committer: GitHub <noreply@github.com> 2026-01-28 02:54:17 +0000
commit: eabd1304cce0e053cd32ec910d2f0ea429e8af14 (patch)
tree: fca3b08810a1dc0c0c610a8189a466cc23d5c547 /docs/plans/qwen3-tts-implementation-plan.md
parent: c618174e60e4632d36d7352d83399508c72b2f42 (diff)
download: soryu-eabd1304cce0e053cd32ec910d2f0ea429e8af14.tar.gz
soryu-eabd1304cce0e053cd32ec910d2f0ea429e8af14.zip
1 files changed, 514 insertions, 0 deletions
diff --git a/docs/plans/qwen3-tts-implementation-plan.md b/docs/plans/qwen3-tts-implementation-plan.md
new file mode 100644
index 0000000..76ecb33
--- /dev/null
+++ b/docs/plans/qwen3-tts-implementation-plan.md
@@ -0,0 +1,514 @@
+# Qwen3-TTS Implementation Plan — Pure Rust (Candle)
+
+**Version:** 2.0
+**Created:** 2026-01-27
+**Status:** Final
+**Authors:** makima development team
+**Spec Reference:** [docs/specs/qwen3-tts-spec.md](../specs/qwen3-tts-spec.md)
+**Research:** [docs/research/rust-native-tts-research.md](../research/rust-native-tts-research.md)
+
+---
+
+## Table of Contents
+
+1. [Overview](#1-overview)
+2. [Task Breakdown](#2-task-breakdown)
+3. [File Changes](#3-file-changes)
+4. [Phase 1: Candle-Based TTS Module](#4-phase-1-candle-based-tts-module)
+5. [Phase 2: WebSocket Handler + Voice Assets](#5-phase-2-websocket-handler--voice-assets)
+6. [Phase 3: Optimization + Integration](#6-phase-3-optimization--integration)
+7. [Testing Plan](#7-testing-plan)
+8. [Risk Assessment](#8-risk-assessment)
+9. [Dependencies & Prerequisites](#9-dependencies--prerequisites)
+10. [Success Criteria](#10-success-criteria)
+
+---
+
+## 1. Overview
+
+This plan details the implementation of Qwen3-TTS integration for the makima system as a **pure Rust** solution using the **candle** ML framework. There is no Python microservice and no proxy pattern — the TTS model runs directly inside the main makima process, loading safetensors weights via candle.
+
+### Key Objectives
+
+1. Implement Qwen3-TTS model inference natively in Rust using candle
+2. Create a `makima/src/tts/` module with TTS trait, Chatterbox adapter, and Qwen3 submodule
+3. Update the `/api/v1/speak` WebSocket handler to call the TTS engine directly
+4. Enable streaming audio delivery with <200ms time-to-first-audio (TTFA)
+5. Support voice cloning with default Makima voice
+
+### Architecture Summary
+
+```
+Client Browser
+     │
+     │ WebSocket: /api/v1/speak
+     ▼
+Makima Server (Rust/Axum)
+     │
+     │ speak.rs handler → TTS Engine (in-process)
+     │
+     │ candle-based Qwen3-TTS inference
+     │ (safetensors weights loaded directly)
+     ▼
+Audio Stream back to client
+```
+
+**Key architectural decisions:**
+- **No Python.** All inference runs in Rust via candle.
+- **No microservice.** TTS runs in-process, no separate service to deploy.
+- **No proxy.** The speak handler calls the TTS engine directly.
+- **Lazy loading.** Models loaded on first TTS request (like listen.rs pattern).
+- **SafeTensors.** Weights loaded directly — no ONNX conversion needed.
+
+---
+
+## 2. Task Breakdown
+
+### Phase 1: Candle-Based TTS Module (Priority: Critical)
+
+| ID | Task | Depends On | Estimated Hours |
+|----|------|------------|-----------------|
+| P1.1 | Create `makima/src/tts/mod.rs` — TTS trait + factory + types | - | 3 |
+| P1.2 | Move existing `tts.rs` to `makima/src/tts/chatterbox.rs` | P1.1 | 2 |
+| P1.3 | Create `makima/src/tts/qwen3/config.rs` — Model config parsing | P1.1 | 2 |
+| P1.4 | Implement `makima/src/tts/qwen3/model.rs` — 28-layer LM backbone | P1.3 | 12 |
+| P1.5 | Implement `makima/src/tts/qwen3/code_predictor.rs` — MTP module | P1.4 | 8 |
+| P1.6 | Implement `makima/src/tts/qwen3/speech_tokenizer.rs` — ConvNet codec | P1.3 | 10 |
+| P1.7 | Implement `makima/src/tts/qwen3/generate.rs` — Autoregressive generation | P1.4, P1.5, P1.6 | 8 |
+| P1.8 | Create `makima/src/tts/qwen3/mod.rs` — Public API | P1.7 | 3 |
+| P1.9 | Add candle dependencies to `Cargo.toml` | - | 1 |
+| P1.10 | Unit tests for config, model layers, tokenizer | P1.4-P1.6 | 6 |
+
+**Phase 1 Total: ~55 hours**
+
+### Phase 2: WebSocket Handler + Voice Assets (Priority: High)
+
+| ID | Task | Depends On | Estimated Hours |
+|----|------|------------|-----------------|
+| P2.1 | Rewrite `speak.rs` — Direct TTS handler (remove proxy) | P1.8 | 6 |
+| P2.2 | Add TTS models to `SharedState` (lazy loading via `OnceCell`) | P1.8 | 3 |
+| P2.3 | Implement voice prompt caching (LRU) | P1.8 | 3 |
+| P2.4 | Remove `tts_client.rs` (no longer needed) | P2.1 | 1 |
+| P2.5 | Update `state.rs` — Remove TTS proxy fields, add TTS model fields | P2.2 | 2 |
+| P2.6 | Update `mod.rs` — Remove `tts_client` module | P2.4 | 0.5 |
+| P2.7 | Create voice manifest structure (`models/voices/makima/`) | - | 1 |
+| P2.8 | Acquire Makima voice reference audio | - | 2 |
+| P2.9 | Test voice cloning quality | P1.8, P2.8 | 2 |
+
+**Phase 2 Total: ~20.5 hours**
+
+### Phase 3: Optimization + Integration (Priority: Medium)
+
+| ID | Task | Depends On | Estimated Hours |
+|----|------|------------|-----------------|
+| P3.1 | Implement streaming generation (token-by-token waveform decode) | P2.1 | 6 |
+| P3.2 | GPU memory optimization (bf16, cache management) | P3.1 | 4 |
+| P3.3 | Listen page integration for bidirectional speech | P2.1 | 4 |
+| P3.4 | Latency benchmarks | P3.1 | 3 |
+| P3.5 | Integration tests (WebSocket end-to-end) | P2.1 | 4 |
+| P3.6 | Documentation | P3.5 | 2 |
+
+**Phase 3 Total: ~23 hours**
+
+---
+
+## 3. File Changes
+
+### New Files
+
+```
+makima/src/tts/
+├── mod.rs                      // TTS trait, factory, shared types
+├── chatterbox.rs               // Existing ONNX-based Chatterbox (moved from tts.rs)
+└── qwen3/
+    ├── mod.rs                  // Qwen3TTS public API
+    ├── model.rs                // Qwen3 LM transformer (28 layers)
+    ├── code_predictor.rs       // MTP module (5 layers, 16 codebooks)
+    ├── speech_tokenizer.rs     // Encoder + Decoder (causal ConvNet + RVQ)
+    ├── config.rs               // Model config from config.json / safetensors
+    └── generate.rs             // Autoregressive generation loop with KV cache
+```
+
+### Modified Files
+
+| File | Change Description |
+|------|-------------------|
+| `makima/src/server/handlers/speak.rs` | Rewrite: direct TTS engine call instead of proxy |
+| `makima/src/server/state.rs` | Remove `tts_client`/`tts_config` fields, add `tts_models: OnceCell<TtsModels>` |
+| `makima/src/server/mod.rs` | Remove `pub mod tts_client;` |
+| `makima/src/server/handlers/mod.rs` | No change (speak already exported) |
+| `makima/Cargo.toml` | Add candle-core, candle-nn, candle-transformers; remove tokio-tungstenite if unused |
+| `makima/src/lib.rs` or `main.rs` | Add `pub mod tts;` |
+
+### Deleted Files
+
+| File | Reason |
+|------|--------|
+| `makima/src/server/tts_client.rs` | No longer needed — no proxy pattern |
+| `tts-service/` (entire directory) | Python service rejected; pure Rust solution |
+
+---
+
+## 4. Phase 1: Candle-Based TTS Module
+
+### 4.1 TTS Trait and Factory (`tts/mod.rs`)
+
+```rust
+use async_trait::async_trait;
+
+/// Audio chunk for streaming output.
+pub struct AudioChunk {
+    pub samples: Vec<f32>,
+    pub sample_rate: u32,
+    pub is_final: bool,
+}
+
+/// TTS engine trait — implemented by Chatterbox and Qwen3.
+#[async_trait]
+pub trait TtsEngine: Send + Sync {
+    /// Generate audio from text.
+    async fn generate(
+        &self,
+        text: &str,
+        voice_id: &str,
+        language: &str,
+    ) -> Result<Vec<AudioChunk>, TtsError>;
+
+    /// Pre-load a voice prompt.
+    async fn load_voice(&self, voice_id: &str) -> Result<(), TtsError>;
+
+    /// Check if the engine is ready.
+    fn is_ready(&self) -> bool;
+}
+```
+
+### 4.2 Qwen3 LM Backbone (`tts/qwen3/model.rs`)
+
+Extend candle-transformers' Qwen2 model implementation:
+
+- **28 transformer layers** with RoPE, GQA (16 heads, 8 KV heads), head dim 128
+- **Hidden size:** 1024, **intermediate size:** 3072
+- **Input:** text tokens + reference audio codes (concatenated)
+- **Output:** zeroth codebook token logits
+
+**Key implementation detail:** The existing `candle_transformers::models::qwen2` module provides the base attention and MLP layers. We extend this with:
+- TTS-specific input embedding (text + audio token embeddings)
+- Speaker encoder concatenation
+- Code predictor output head (instead of standard LM head)
+
+### 4.3 Code Predictor (`tts/qwen3/code_predictor.rs`)
+
+- **5-layer** transformer module
+- **Input:** hidden states from the main LM
+- **Output:** 16 codebook predictions (vocab size 2048 each)
+- After the main LM predicts the zeroth codebook token, this module predicts the remaining 15 codebook layers in parallel
+
+### 4.4 Speech Tokenizer (`tts/qwen3/speech_tokenizer.rs`)
+
+Two sub-components:
+
+**Encoder** (used for voice cloning):
+- Causal 1D ConvNet converting reference audio waveform → discrete multi-codebook tokens
+- 16-layer RVQ (Residual Vector Quantization)
+- First codebook = semantic (WavLM-guided), remaining 15 = acoustic
+
+**Decoder** (used for audio output):
+- Causal 1D ConvNet reconstructing waveforms from discrete codes
+- Input: 16 codebook indices → lookup embeddings → ConvNet → waveform
+- Output: 24kHz mono audio
+
+**candle implementation notes:**
+- `candle_nn::Conv1d` for all convolution layers
+- `candle_nn::Embedding` for codebook lookups
+- Weight normalization handled manually
+
+### 4.5 Autoregressive Generation (`tts/qwen3/generate.rs`)
+
+```rust
+pub async fn generate(
+    model: &Qwen3Model,
+    code_predictor: &CodePredictor,
+    speech_tokenizer: &SpeechTokenizer,
+    text_tokens: &[u32],
+    voice_prompt: &VoicePrompt,
+) -> Result<Vec<AudioChunk>, TtsError> {
+    // 1. Encode reference audio → speaker embedding + audio codes
+    let speaker_emb = speech_tokenizer.encode(&voice_prompt.audio)?;
+
+    // 2. Prepare input: [text_tokens, audio_codes]
+    let input = prepare_input(text_tokens, &speaker_emb)?;
+
+    // 3. Autoregressive loop with KV cache
+    let mut kv_cache = KvCache::new(model.num_layers());
+    let mut generated_codes = Vec::new();
+
+    loop {
+        let logits = model.forward(&input, &mut kv_cache)?;
+        let next_token = sample_token(&logits);
+
+        if next_token == EOS_TOKEN { break; }
+        generated_codes.push(next_token);
+
+        // 4. Code predictor: predict remaining 15 codebooks
+        let all_codes = code_predictor.predict(&model.last_hidden_state(), next_token)?;
+
+        // 5. Decode to audio (can be done incrementally for streaming)
+        let chunk = speech_tokenizer.decode(&all_codes)?;
+        // yield chunk for streaming
+    }
+}
+```
+
+---
+
+## 5. Phase 2: WebSocket Handler + Voice Assets
+
+### 5.1 Speak Handler (Rewritten)
+
+```rust
+// makima/src/server/handlers/speak.rs
+//
+// Direct TTS handler — no proxy, no external service.
+
+pub async fn websocket_handler(
+    ws: WebSocketUpgrade,
+    State(state): State<SharedState>,
+) -> Response {
+    ws.on_upgrade(|socket| handle_speak_socket(socket, state))
+}
+
+async fn handle_speak_socket(socket: WebSocket, state: SharedState) {
+    let session_id = Uuid::new_v4().to_string();
+
+    // Lazy-load TTS models (like listen.rs does for STT)
+    let tts = match state.get_tts_models().await {
+        Ok(tts) => tts,
+        Err(e) => {
+            send_error(&mut socket, "MODEL_LOADING", &e.to_string()).await;
+            return;
+        }
+    };
+
+    // Session loop: parse JSON messages, dispatch to TTS engine
+    let (mut sender, mut receiver) = socket.split();
+
+    while let Some(msg) = receiver.next().await {
+        match parse_client_message(msg) {
+            ClientMessage::Start(config) => {
+                // Load voice, send Ready
+            }
+            ClientMessage::Speak(text) => {
+                // Run inference, stream audio chunks
+                for chunk in tts.engine.generate(&text, voice_id, language).await? {
+                    sender.send(Message::Binary(chunk.to_pcm16())).await?;
+                }
+                sender.send(complete_message()).await?;
+            }
+            ClientMessage::Stop => break,
+            ClientMessage::Cancel => { /* abort current generation */ }
+        }
+    }
+}
+```
+
+### 5.2 State Changes
+
+Remove from `AppState`:
+- `tts_client: Option<Arc<TtsProxyClient>>`
+- `tts_config: Option<TtsConfig>`
+- `with_tts()` method
+- `is_tts_healthy()` method
+
+Add to `AppState`:
+- `tts_models: OnceCell<TtsModels>` — lazily loaded TTS engine
+- `get_tts_models()` method (async, like `get_ml_models()`)
+
+### 5.3 Voice Assets
+
+```
+models/voices/makima/
+├── manifest.json       # Voice metadata
+├── reference.wav       # 5-15 second reference audio
+└── transcript.txt      # Exact transcript of reference audio
+```
+
+---
+
+## 6. Phase 3: Optimization + Integration
+
+### 6.1 Streaming Generation
+
+The 12Hz model's causal architecture enables token-by-token waveform generation:
+- Each token = ~80ms of audio (12.5 Hz)
+- After generating each token, decode immediately and send audio chunk
+- Client receives audio before full generation completes
+
+### 6.2 GPU Memory Optimization
+
+- Load weights in bf16/f16 (candle supports both)
+- Implement KV cache with fixed maximum size
+- Clear cache between sessions
+- CPU fallback when GPU is unavailable
+
+### 6.3 Listen Page Integration
+
+Following the pattern in `listen.rs`:
+- TTS model protected behind `tokio::sync::Mutex`
+- WebSocket endpoint emits audio chunks as tokens are generated
+- Bidirectional: STT (listen) → process → TTS (speak) loop
+
+---
+
+## 7. Testing Plan
+
+### 7.1 Unit Tests
+
+| Test Area | Coverage | Key Tests |
+|-----------|----------|-----------|
+| Config parsing | 100% | Load config from JSON, validate fields |
+| Model layers | 80% | Attention, MLP, Conv1d shapes |
+| Code predictor | 85% | Multi-codebook output shapes |
+| Speech tokenizer | 80% | Encode/decode round-trip |
+| Voice cache | 95% | LRU eviction, TTL expiration |
+| Message parsing | 100% | All client/server message types |
+
+### 7.2 Integration Tests
+
+| Test | Description |
+|------|-------------|
+| WebSocket flow | Connect → Start → Speak → Audio chunks → Complete → Stop |
+| Error handling | Invalid text, unknown voice, model loading failure |
+| Cancellation | Cancel mid-generation |
+| Voice cloning | Generate with custom reference audio |
+
+### 7.3 Latency Benchmarks
+
+| Metric | Target | Acceptable | Warning |
+|--------|--------|------------|---------|
+| First Audio (short text) | < 150ms | < 200ms | > 300ms |
+| First Audio (medium text) | < 200ms | < 300ms | > 500ms |
+| First Audio (long text) | < 300ms | < 500ms | > 800ms |
+| Inter-chunk latency | < 30ms | < 50ms | > 100ms |
+| GPU memory | < 4GB | < 6GB | > 8GB |
+
+---
+
+## 8. Risk Assessment
+
+### 8.1 Technical Risks
+
+| Risk | Likelihood | Impact | Mitigation |
+|------|------------|--------|------------|
+| **Candle implementation takes longer** | Medium | Medium | Reference Crane's Spark-TTS; use qwen3-rs as LM reference |
+| **Speech tokenizer ConvNet is complex** | Medium | High | Study PyTorch source; ConvNet layers are simpler than transformers |
+| **Model quality differs from PyTorch** | Low | High | Validate with reference audio; ensure bf16 precision |
+| **Crane ships Qwen3-TTS first** | Medium | Positive | Adopt their implementation or use as reference |
+| **GPU memory issues** | Low | Medium | 0.6B model is small (~2.5GB); fits in 4GB VRAM |
+
+### 8.2 Contingency Plans
+
+| Scenario | Response |
+|----------|----------|
+| Candle implementation blocked | Use Crane crate as dependency if they ship Qwen3-TTS |
+| ConvNet decoder too complex | Implement simplified decoder; optimize later |
+| Latency exceeds targets | Start with batch mode + chunked delivery (acceptable UX) |
+| No GPU available | CPU fallback with candle's MKL support (degraded performance) |
+
+---
+
+## 9. Dependencies & Prerequisites
+
+### 9.1 Rust Dependencies
+
+Add to `Cargo.toml`:
+
+```toml
+[dependencies]
+candle-core = "0.8"
+candle-nn = "0.8"
+candle-transformers = "0.8"
+# Keep existing: tokenizers, hf-hub, safetensors, ndarray
+```
+
+### 9.2 Hardware Requirements
+
+| Component | Minimum | Recommended |
+|-----------|---------|-------------|
+| GPU | CUDA 4GB VRAM / Metal (macOS) | NVIDIA RTX 3060+ (8GB+) |
+| RAM | 8GB | 16GB |
+| Storage | 5GB (model weights) | 10GB |
+
+### 9.3 Voice Asset Prerequisites
+
+Before Phase 2 voice testing:
+1. Makima voice reference audio (5-15 seconds, clean speech)
+2. Accurate transcript of reference audio
+3. Format: WAV 24kHz mono, 16-bit PCM
+
+---
+
+## 10. Success Criteria
+
+### 10.1 Phase 1 Completion
+
+- [ ] candle-based Qwen3 model loads safetensors weights
+- [ ] Forward pass produces valid logits
+- [ ] Speech tokenizer encodes/decodes audio
+- [ ] Code predictor generates 16 codebook layers
+- [ ] Unit tests pass with > 80% coverage
+
+### 10.2 Phase 2 Completion
+
+- [ ] `/api/v1/speak` endpoint produces audio from text
+- [ ] No Python service required
+- [ ] Voice cloning works with reference audio
+- [ ] Error handling returns appropriate codes
+- [ ] speak.rs calls TTS engine directly (no proxy)
+
+### 10.3 Phase 3 Completion
+
+- [ ] Streaming generation with < 200ms TTFA
+- [ ] GPU memory usage < 6GB
+- [ ] Integration tests pass
+- [ ] Listen page bidirectional speech works
+- [ ] Latency benchmarks documented
+
+### 10.4 Final Acceptance Criteria
+
+1. **Functional:** End-to-end TTS streaming via WebSocket, pure Rust, no Python
+2. **Performance:** TTFA < 200ms, subsequent chunks < 100ms
+3. **Quality:** Synthesized speech is intelligible and recognizable as Makima
+4. **Reliability:** Error handling is robust; graceful degradation on GPU failure
+5. **Architecture:** Clean `tts/` module with trait-based engine selection
+
+---
+
+## Appendix A: Quick Start Commands
+
+### Development
+
+```bash
+# Build with candle GPU support
+cd makima
+cargo build --features cuda   # or --features metal for macOS
+
+# Run server with TTS enabled
+TTS_ENGINE=qwen3 TTS_DEVICE=cuda:0 cargo run
+
+# Run TTS-specific tests
+cargo test tts
+```
+
+### Benchmarks
+
+```bash
+# Run latency benchmarks (requires GPU)
+cargo bench --bench tts_latency
+```
+
+## Appendix B: Reference Implementations
+
+- [candle-transformers qwen2 model](https://docs.rs/candle-transformers/latest/candle_transformers/models/qwen2/index.html) — base attention/MLP layers
+- [qwen3-rs](https://github.com/reinterpretcat/qwen3-rs) — educational Qwen3 in Rust
+- [Crane](https://github.com/lucasjinreal/Crane) — Rust TTS engine (Qwen3-TTS on roadmap)
+- [docs/research/rust-native-tts-research.md](../research/rust-native-tts-research.md) — full feasibility analysis
author	soryu <soryu@soryu.co>	2026-01-28 02:54:17 +0000
committer	GitHub <noreply@github.com>	2026-01-28 02:54:17 +0000
commit	eabd1304cce0e053cd32ec910d2f0ea429e8af14 (patch)
tree	fca3b08810a1dc0c0c610a8189a466cc23d5c547 /docs/plans/qwen3-tts-implementation-plan.md
parent	c618174e60e4632d36d7352d83399508c72b2f42 (diff)
download	soryu-eabd1304cce0e053cd32ec910d2f0ea429e8af14.tar.gz soryu-eabd1304cce0e053cd32ec910d2f0ea429e8af14.zip