diff options
| author | soryu <soryu@soryu.co> | 2026-01-28 02:54:17 +0000 |
|---|---|---|
| committer | GitHub <noreply@github.com> | 2026-01-28 02:54:17 +0000 |
| commit | eabd1304cce0e053cd32ec910d2f0ea429e8af14 (patch) | |
| tree | fca3b08810a1dc0c0c610a8189a466cc23d5c547 /docs/plans/qwen3-tts-implementation-plan.md | |
| parent | c618174e60e4632d36d7352d83399508c72b2f42 (diff) | |
| download | soryu-eabd1304cce0e053cd32ec910d2f0ea429e8af14.tar.gz soryu-eabd1304cce0e053cd32ec910d2f0ea429e8af14.zip | |
Add Qwen3-TTS streaming endpoint for voice synthesis (#40)
* Task completion checkpoint
* Task completion checkpoint
* Task completion checkpoint
* Add Qwen3-TTS research document for live TTS replacement
Research findings for replacing Chatterbox TTS with Qwen3-TTS-12Hz-0.6B-Base:
- Current TTS: Chatterbox-Turbo-ONNX with batch-only generation, no streaming
- Qwen3-TTS: 97ms end-to-end latency, streaming support, 3-second voice cloning
- Voice cloning: Requires 3s reference audio + transcript (Makima voice planned)
- Integration: Python service with WebSocket bridge (no ONNX export available)
- Languages: 10 supported including English and Japanese
Document includes:
- Current architecture analysis (makima/src/tts.rs)
- Qwen3-TTS capabilities and requirements
- Feasibility assessment for live/streaming TTS
- Audio clip requirements for voice cloning
- Preliminary technical approach with architecture diagrams
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* [WIP] Heartbeat checkpoint - 2026-01-27 03:11:15 UTC
* Add Qwen3-TTS research documentation
Comprehensive research on replacing Chatterbox TTS with Qwen3-TTS-12Hz-0.6B-Base:
- Current TTS implementation analysis (Chatterbox-Turbo-ONNX in makima/src/tts.rs)
- Qwen3-TTS capabilities: 97ms streaming latency, voice cloning with 3s reference
- Cross-lingual support: Japanese voice (Makima/Tomori Kusunoki) speaking English
- Python microservice architecture recommendation (FastAPI + WebSocket)
- Implementation phases and technical approach
- Hardware requirements and dependencies
Key findings:
- Live/streaming TTS is highly feasible with 97ms latency
- Voice cloning fully supported with 0.95 speaker similarity
- Recommended: Python microservice with WebSocket streaming
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* Add comprehensive Qwen3-TTS integration specification
This specification document defines the complete integration of
Qwen3-TTS-12Hz-0.6B-Base as a replacement for the existing Chatterbox-Turbo
TTS implementation. The document covers:
## Functional Requirements
- WebSocket endpoint /api/v1/speak for streaming TTS
- Voice cloning with default Makima voice (Japanese VA speaking English)
- Support for custom voice references
- Detailed client-to-server and server-to-client message protocols
- Integration with Listen page for bidirectional speech
## Non-Functional Requirements
- Latency targets: < 200ms first audio byte
- Audio quality: 24kHz, mono, PCM16/PCM32f
- Hardware requirements: CUDA GPU with 4-8GB VRAM
- Scalability: 10 concurrent sessions per GPU
## Architecture Specification
- Python TTS microservice with FastAPI/WebSocket
- Rust proxy endpoint in makima server
- Voice prompt caching mechanism (LRU cache)
- Error handling and recovery strategies
## API Contract
- Complete WebSocket message format definitions (TypeScript)
- Error codes and responses (TTS_UNAVAILABLE, SYNTHESIS_ERROR, etc.)
- Session state machine and lifecycle management
## Voice Asset Requirements
- Makima voice clip specifications (5-10s WAV, transcript required)
- Storage location: models/voices/makima/
- Metadata format for voice management
## Testing Strategy
- Unit tests for Python TTS service and Rust proxy
- Integration tests for WebSocket flow
- Latency benchmarks with performance targets
- Test data fixtures for various text lengths
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* Add Qwen3-TTS implementation plan
Comprehensive implementation plan for replacing Chatterbox-TTS with
Qwen3-TTS streaming TTS service, including:
- Task breakdown with estimated hours for each phase
- Phase 1: Python TTS microservice (FastAPI, WebSocket)
- Phase 2: Rust proxy integration (speak.rs, tts_client.rs)
- Detailed file changes and new module structure
- Testing plan with unit, integration, and latency benchmarks
- Risk assessment with mitigation strategies
- Success criteria for each phase
Based on specification in docs/specs/qwen3-tts-spec.md
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* Add author and research references to TTS implementation plan
Add links to research documentation and author attribution.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* [WIP] Heartbeat checkpoint - 2026-01-27 03:25:06 UTC
* Add Python TTS service project structure (Phase 1.1-1.3)
Create the initial makima-tts Python service directory structure with:
- pyproject.toml with FastAPI, Qwen-TTS, and torch dependencies
- config.py with pydantic-settings TTSConfig class
- models.py with Pydantic message models (Start, Speak, Stop, Ready, etc.)
This implements tasks P1.1, P1.2, and P1.3 from the Qwen3-TTS implementation plan.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* Add TTS engine and voice manager for Qwen3-TTS (Phase 1.4-1.5)
Implement core TTS functionality:
- tts_engine.py: Qwen3-TTS wrapper with streaming audio chunk generation
- voice_manager.py: Voice prompt caching with LRU eviction and TTL support
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* [WIP] Heartbeat checkpoint - 2026-01-27 03:30:06 UTC
* Add TTS proxy client and message types (Phase 2.1, 2.2, 2.4)
- Add tts_client.rs with TtsConfig, TtsCircuitBreaker, TtsError,
TtsProxyClient, and TtsConnection structs for WebSocket proxying
- Add TTS message types to messages.rs (TtsAudioEncoding, TtsPriority,
TtsStartMessage, TtsSpeakMessage, TtsStopMessage, TtsClientMessage,
TtsReadyMessage, TtsAudioChunkMessage, TtsCompleteMessage,
TtsErrorMessage, TtsStoppedMessage, TtsServerMessage)
- Export tts_client module from server mod.rs
- tokio-tungstenite already present in Cargo.toml
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* Add TTS WebSocket handler and route (Phase 2.3, 2.5, 2.6)
- Create speak.rs WebSocket handler that proxies to Python TTS service
- Add TtsState fields (tts_client, tts_config) to AppState
- Add with_tts() builder and is_tts_healthy() methods to AppState
- Register /api/v1/speak route in the router
- Add speak module export in handlers/mod.rs
The handler forwards WebSocket messages bidirectionally between
the client and the Python TTS microservice with proper error handling.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* Add Makima voice profile assets for TTS voice cloning
Creates the voice assets directory structure with:
- manifest.json containing voice configuration (voice_id, speaker,
language, reference audio path, and Japanese transcript placeholder)
- README.md with instructions for obtaining voice reference audio
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* Add Rust-native Qwen3-TTS integration research document
Research findings for integrating Qwen3-TTS-12Hz-0.6B-Base directly into
the makima Rust codebase without Python. Key conclusions:
- ONNX export is not viable (unsupported architecture)
- Candle (HF Rust ML framework) is the recommended approach
- Model weights available in safetensors format (2.52GB total)
- Three components needed: LM backbone, code predictor, speech tokenizer
- Crane project has Qwen3-TTS as highest priority (potential upstream)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* [WIP] Heartbeat checkpoint - 2026-01-27 11:21:43 UTC
* [WIP] Heartbeat checkpoint - 2026-01-27 11:24:19 UTC
* [WIP] Heartbeat checkpoint - 2026-01-27 11:26:43 UTC
* feat: implement Rust-native Qwen3-TTS using candle framework
Replace monolithic tts.rs with modular tts/ directory structure:
- tts/mod.rs: TtsEngine trait, TtsEngineFactory, shared types (AudioChunk,
TtsError), and utility functions (save_wav, resample, argmax)
- tts/chatterbox.rs: existing ONNX-based ChatterboxTTS adapted to implement
TtsEngine trait with Mutex-wrapped sessions for Send+Sync
- tts/qwen3/mod.rs: Qwen3Tts entry point with HuggingFace model loading
- tts/qwen3/config.rs: Qwen3TtsConfig parsing from HF config.json
- tts/qwen3/model.rs: 28-layer Qwen3 transformer with RoPE, GQA (16 heads,
8 KV heads), SiLU MLP, RMS norm, and KV cache
- tts/qwen3/code_predictor.rs: 5-layer MTP module predicting 16 codebooks
- tts/qwen3/speech_tokenizer.rs: ConvNet encoder/decoder with 16-layer RVQ
- tts/qwen3/generate.rs: autoregressive generation loop with streaming support
Add candle-core, candle-nn, candle-transformers, safetensors to Cargo.toml.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* feat: integrate TTS engine into speak WebSocket handler
- Update speak.rs handler to use TTS engine directly from SharedState
instead of returning a stub "not implemented" error
- Add TtsEngine (OnceCell lazy-loaded) to AppState in state.rs with
get_tts_engine() method for lazy initialization on first connection
- Implement full WebSocket protocol: client sends JSON speak/cancel/stop
messages, server streams binary PCM audio chunks and audio_end signals
- Create voices/makima/manifest.json for Makima voice profile configuration
- All files compile successfully with zero errors
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* feat: add /speak TTS page with WebSocket audio playback
Add a new /speak frontend page for text-to-speech via WebSocket.
The page accepts text input and streams synthesized PCM audio through
the Web Audio API. Includes model loading indicator, cancel support,
and connection status. Also adds a loading bar to the listen page
ControlPanel during WebSocket connection.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Diffstat (limited to 'docs/plans/qwen3-tts-implementation-plan.md')
| -rw-r--r-- | docs/plans/qwen3-tts-implementation-plan.md | 514 |
1 files changed, 514 insertions, 0 deletions
diff --git a/docs/plans/qwen3-tts-implementation-plan.md b/docs/plans/qwen3-tts-implementation-plan.md new file mode 100644 index 0000000..76ecb33 --- /dev/null +++ b/docs/plans/qwen3-tts-implementation-plan.md @@ -0,0 +1,514 @@ +# Qwen3-TTS Implementation Plan — Pure Rust (Candle) + +**Version:** 2.0 +**Created:** 2026-01-27 +**Status:** Final +**Authors:** makima development team +**Spec Reference:** [docs/specs/qwen3-tts-spec.md](../specs/qwen3-tts-spec.md) +**Research:** [docs/research/rust-native-tts-research.md](../research/rust-native-tts-research.md) + +--- + +## Table of Contents + +1. [Overview](#1-overview) +2. [Task Breakdown](#2-task-breakdown) +3. [File Changes](#3-file-changes) +4. [Phase 1: Candle-Based TTS Module](#4-phase-1-candle-based-tts-module) +5. [Phase 2: WebSocket Handler + Voice Assets](#5-phase-2-websocket-handler--voice-assets) +6. [Phase 3: Optimization + Integration](#6-phase-3-optimization--integration) +7. [Testing Plan](#7-testing-plan) +8. [Risk Assessment](#8-risk-assessment) +9. [Dependencies & Prerequisites](#9-dependencies--prerequisites) +10. [Success Criteria](#10-success-criteria) + +--- + +## 1. Overview + +This plan details the implementation of Qwen3-TTS integration for the makima system as a **pure Rust** solution using the **candle** ML framework. There is no Python microservice and no proxy pattern — the TTS model runs directly inside the main makima process, loading safetensors weights via candle. + +### Key Objectives + +1. Implement Qwen3-TTS model inference natively in Rust using candle +2. Create a `makima/src/tts/` module with TTS trait, Chatterbox adapter, and Qwen3 submodule +3. Update the `/api/v1/speak` WebSocket handler to call the TTS engine directly +4. Enable streaming audio delivery with <200ms time-to-first-audio (TTFA) +5. Support voice cloning with default Makima voice + +### Architecture Summary + +``` +Client Browser + │ + │ WebSocket: /api/v1/speak + ▼ +Makima Server (Rust/Axum) + │ + │ speak.rs handler → TTS Engine (in-process) + │ + │ candle-based Qwen3-TTS inference + │ (safetensors weights loaded directly) + ▼ +Audio Stream back to client +``` + +**Key architectural decisions:** +- **No Python.** All inference runs in Rust via candle. +- **No microservice.** TTS runs in-process, no separate service to deploy. +- **No proxy.** The speak handler calls the TTS engine directly. +- **Lazy loading.** Models loaded on first TTS request (like listen.rs pattern). +- **SafeTensors.** Weights loaded directly — no ONNX conversion needed. + +--- + +## 2. Task Breakdown + +### Phase 1: Candle-Based TTS Module (Priority: Critical) + +| ID | Task | Depends On | Estimated Hours | +|----|------|------------|-----------------| +| P1.1 | Create `makima/src/tts/mod.rs` — TTS trait + factory + types | - | 3 | +| P1.2 | Move existing `tts.rs` to `makima/src/tts/chatterbox.rs` | P1.1 | 2 | +| P1.3 | Create `makima/src/tts/qwen3/config.rs` — Model config parsing | P1.1 | 2 | +| P1.4 | Implement `makima/src/tts/qwen3/model.rs` — 28-layer LM backbone | P1.3 | 12 | +| P1.5 | Implement `makima/src/tts/qwen3/code_predictor.rs` — MTP module | P1.4 | 8 | +| P1.6 | Implement `makima/src/tts/qwen3/speech_tokenizer.rs` — ConvNet codec | P1.3 | 10 | +| P1.7 | Implement `makima/src/tts/qwen3/generate.rs` — Autoregressive generation | P1.4, P1.5, P1.6 | 8 | +| P1.8 | Create `makima/src/tts/qwen3/mod.rs` — Public API | P1.7 | 3 | +| P1.9 | Add candle dependencies to `Cargo.toml` | - | 1 | +| P1.10 | Unit tests for config, model layers, tokenizer | P1.4-P1.6 | 6 | + +**Phase 1 Total: ~55 hours** + +### Phase 2: WebSocket Handler + Voice Assets (Priority: High) + +| ID | Task | Depends On | Estimated Hours | +|----|------|------------|-----------------| +| P2.1 | Rewrite `speak.rs` — Direct TTS handler (remove proxy) | P1.8 | 6 | +| P2.2 | Add TTS models to `SharedState` (lazy loading via `OnceCell`) | P1.8 | 3 | +| P2.3 | Implement voice prompt caching (LRU) | P1.8 | 3 | +| P2.4 | Remove `tts_client.rs` (no longer needed) | P2.1 | 1 | +| P2.5 | Update `state.rs` — Remove TTS proxy fields, add TTS model fields | P2.2 | 2 | +| P2.6 | Update `mod.rs` — Remove `tts_client` module | P2.4 | 0.5 | +| P2.7 | Create voice manifest structure (`models/voices/makima/`) | - | 1 | +| P2.8 | Acquire Makima voice reference audio | - | 2 | +| P2.9 | Test voice cloning quality | P1.8, P2.8 | 2 | + +**Phase 2 Total: ~20.5 hours** + +### Phase 3: Optimization + Integration (Priority: Medium) + +| ID | Task | Depends On | Estimated Hours | +|----|------|------------|-----------------| +| P3.1 | Implement streaming generation (token-by-token waveform decode) | P2.1 | 6 | +| P3.2 | GPU memory optimization (bf16, cache management) | P3.1 | 4 | +| P3.3 | Listen page integration for bidirectional speech | P2.1 | 4 | +| P3.4 | Latency benchmarks | P3.1 | 3 | +| P3.5 | Integration tests (WebSocket end-to-end) | P2.1 | 4 | +| P3.6 | Documentation | P3.5 | 2 | + +**Phase 3 Total: ~23 hours** + +--- + +## 3. File Changes + +### New Files + +``` +makima/src/tts/ +├── mod.rs // TTS trait, factory, shared types +├── chatterbox.rs // Existing ONNX-based Chatterbox (moved from tts.rs) +└── qwen3/ + ├── mod.rs // Qwen3TTS public API + ├── model.rs // Qwen3 LM transformer (28 layers) + ├── code_predictor.rs // MTP module (5 layers, 16 codebooks) + ├── speech_tokenizer.rs // Encoder + Decoder (causal ConvNet + RVQ) + ├── config.rs // Model config from config.json / safetensors + └── generate.rs // Autoregressive generation loop with KV cache +``` + +### Modified Files + +| File | Change Description | +|------|-------------------| +| `makima/src/server/handlers/speak.rs` | Rewrite: direct TTS engine call instead of proxy | +| `makima/src/server/state.rs` | Remove `tts_client`/`tts_config` fields, add `tts_models: OnceCell<TtsModels>` | +| `makima/src/server/mod.rs` | Remove `pub mod tts_client;` | +| `makima/src/server/handlers/mod.rs` | No change (speak already exported) | +| `makima/Cargo.toml` | Add candle-core, candle-nn, candle-transformers; remove tokio-tungstenite if unused | +| `makima/src/lib.rs` or `main.rs` | Add `pub mod tts;` | + +### Deleted Files + +| File | Reason | +|------|--------| +| `makima/src/server/tts_client.rs` | No longer needed — no proxy pattern | +| `tts-service/` (entire directory) | Python service rejected; pure Rust solution | + +--- + +## 4. Phase 1: Candle-Based TTS Module + +### 4.1 TTS Trait and Factory (`tts/mod.rs`) + +```rust +use async_trait::async_trait; + +/// Audio chunk for streaming output. +pub struct AudioChunk { + pub samples: Vec<f32>, + pub sample_rate: u32, + pub is_final: bool, +} + +/// TTS engine trait — implemented by Chatterbox and Qwen3. +#[async_trait] +pub trait TtsEngine: Send + Sync { + /// Generate audio from text. + async fn generate( + &self, + text: &str, + voice_id: &str, + language: &str, + ) -> Result<Vec<AudioChunk>, TtsError>; + + /// Pre-load a voice prompt. + async fn load_voice(&self, voice_id: &str) -> Result<(), TtsError>; + + /// Check if the engine is ready. + fn is_ready(&self) -> bool; +} +``` + +### 4.2 Qwen3 LM Backbone (`tts/qwen3/model.rs`) + +Extend candle-transformers' Qwen2 model implementation: + +- **28 transformer layers** with RoPE, GQA (16 heads, 8 KV heads), head dim 128 +- **Hidden size:** 1024, **intermediate size:** 3072 +- **Input:** text tokens + reference audio codes (concatenated) +- **Output:** zeroth codebook token logits + +**Key implementation detail:** The existing `candle_transformers::models::qwen2` module provides the base attention and MLP layers. We extend this with: +- TTS-specific input embedding (text + audio token embeddings) +- Speaker encoder concatenation +- Code predictor output head (instead of standard LM head) + +### 4.3 Code Predictor (`tts/qwen3/code_predictor.rs`) + +- **5-layer** transformer module +- **Input:** hidden states from the main LM +- **Output:** 16 codebook predictions (vocab size 2048 each) +- After the main LM predicts the zeroth codebook token, this module predicts the remaining 15 codebook layers in parallel + +### 4.4 Speech Tokenizer (`tts/qwen3/speech_tokenizer.rs`) + +Two sub-components: + +**Encoder** (used for voice cloning): +- Causal 1D ConvNet converting reference audio waveform → discrete multi-codebook tokens +- 16-layer RVQ (Residual Vector Quantization) +- First codebook = semantic (WavLM-guided), remaining 15 = acoustic + +**Decoder** (used for audio output): +- Causal 1D ConvNet reconstructing waveforms from discrete codes +- Input: 16 codebook indices → lookup embeddings → ConvNet → waveform +- Output: 24kHz mono audio + +**candle implementation notes:** +- `candle_nn::Conv1d` for all convolution layers +- `candle_nn::Embedding` for codebook lookups +- Weight normalization handled manually + +### 4.5 Autoregressive Generation (`tts/qwen3/generate.rs`) + +```rust +pub async fn generate( + model: &Qwen3Model, + code_predictor: &CodePredictor, + speech_tokenizer: &SpeechTokenizer, + text_tokens: &[u32], + voice_prompt: &VoicePrompt, +) -> Result<Vec<AudioChunk>, TtsError> { + // 1. Encode reference audio → speaker embedding + audio codes + let speaker_emb = speech_tokenizer.encode(&voice_prompt.audio)?; + + // 2. Prepare input: [text_tokens, audio_codes] + let input = prepare_input(text_tokens, &speaker_emb)?; + + // 3. Autoregressive loop with KV cache + let mut kv_cache = KvCache::new(model.num_layers()); + let mut generated_codes = Vec::new(); + + loop { + let logits = model.forward(&input, &mut kv_cache)?; + let next_token = sample_token(&logits); + + if next_token == EOS_TOKEN { break; } + generated_codes.push(next_token); + + // 4. Code predictor: predict remaining 15 codebooks + let all_codes = code_predictor.predict(&model.last_hidden_state(), next_token)?; + + // 5. Decode to audio (can be done incrementally for streaming) + let chunk = speech_tokenizer.decode(&all_codes)?; + // yield chunk for streaming + } +} +``` + +--- + +## 5. Phase 2: WebSocket Handler + Voice Assets + +### 5.1 Speak Handler (Rewritten) + +```rust +// makima/src/server/handlers/speak.rs +// +// Direct TTS handler — no proxy, no external service. + +pub async fn websocket_handler( + ws: WebSocketUpgrade, + State(state): State<SharedState>, +) -> Response { + ws.on_upgrade(|socket| handle_speak_socket(socket, state)) +} + +async fn handle_speak_socket(socket: WebSocket, state: SharedState) { + let session_id = Uuid::new_v4().to_string(); + + // Lazy-load TTS models (like listen.rs does for STT) + let tts = match state.get_tts_models().await { + Ok(tts) => tts, + Err(e) => { + send_error(&mut socket, "MODEL_LOADING", &e.to_string()).await; + return; + } + }; + + // Session loop: parse JSON messages, dispatch to TTS engine + let (mut sender, mut receiver) = socket.split(); + + while let Some(msg) = receiver.next().await { + match parse_client_message(msg) { + ClientMessage::Start(config) => { + // Load voice, send Ready + } + ClientMessage::Speak(text) => { + // Run inference, stream audio chunks + for chunk in tts.engine.generate(&text, voice_id, language).await? { + sender.send(Message::Binary(chunk.to_pcm16())).await?; + } + sender.send(complete_message()).await?; + } + ClientMessage::Stop => break, + ClientMessage::Cancel => { /* abort current generation */ } + } + } +} +``` + +### 5.2 State Changes + +Remove from `AppState`: +- `tts_client: Option<Arc<TtsProxyClient>>` +- `tts_config: Option<TtsConfig>` +- `with_tts()` method +- `is_tts_healthy()` method + +Add to `AppState`: +- `tts_models: OnceCell<TtsModels>` — lazily loaded TTS engine +- `get_tts_models()` method (async, like `get_ml_models()`) + +### 5.3 Voice Assets + +``` +models/voices/makima/ +├── manifest.json # Voice metadata +├── reference.wav # 5-15 second reference audio +└── transcript.txt # Exact transcript of reference audio +``` + +--- + +## 6. Phase 3: Optimization + Integration + +### 6.1 Streaming Generation + +The 12Hz model's causal architecture enables token-by-token waveform generation: +- Each token = ~80ms of audio (12.5 Hz) +- After generating each token, decode immediately and send audio chunk +- Client receives audio before full generation completes + +### 6.2 GPU Memory Optimization + +- Load weights in bf16/f16 (candle supports both) +- Implement KV cache with fixed maximum size +- Clear cache between sessions +- CPU fallback when GPU is unavailable + +### 6.3 Listen Page Integration + +Following the pattern in `listen.rs`: +- TTS model protected behind `tokio::sync::Mutex` +- WebSocket endpoint emits audio chunks as tokens are generated +- Bidirectional: STT (listen) → process → TTS (speak) loop + +--- + +## 7. Testing Plan + +### 7.1 Unit Tests + +| Test Area | Coverage | Key Tests | +|-----------|----------|-----------| +| Config parsing | 100% | Load config from JSON, validate fields | +| Model layers | 80% | Attention, MLP, Conv1d shapes | +| Code predictor | 85% | Multi-codebook output shapes | +| Speech tokenizer | 80% | Encode/decode round-trip | +| Voice cache | 95% | LRU eviction, TTL expiration | +| Message parsing | 100% | All client/server message types | + +### 7.2 Integration Tests + +| Test | Description | +|------|-------------| +| WebSocket flow | Connect → Start → Speak → Audio chunks → Complete → Stop | +| Error handling | Invalid text, unknown voice, model loading failure | +| Cancellation | Cancel mid-generation | +| Voice cloning | Generate with custom reference audio | + +### 7.3 Latency Benchmarks + +| Metric | Target | Acceptable | Warning | +|--------|--------|------------|---------| +| First Audio (short text) | < 150ms | < 200ms | > 300ms | +| First Audio (medium text) | < 200ms | < 300ms | > 500ms | +| First Audio (long text) | < 300ms | < 500ms | > 800ms | +| Inter-chunk latency | < 30ms | < 50ms | > 100ms | +| GPU memory | < 4GB | < 6GB | > 8GB | + +--- + +## 8. Risk Assessment + +### 8.1 Technical Risks + +| Risk | Likelihood | Impact | Mitigation | +|------|------------|--------|------------| +| **Candle implementation takes longer** | Medium | Medium | Reference Crane's Spark-TTS; use qwen3-rs as LM reference | +| **Speech tokenizer ConvNet is complex** | Medium | High | Study PyTorch source; ConvNet layers are simpler than transformers | +| **Model quality differs from PyTorch** | Low | High | Validate with reference audio; ensure bf16 precision | +| **Crane ships Qwen3-TTS first** | Medium | Positive | Adopt their implementation or use as reference | +| **GPU memory issues** | Low | Medium | 0.6B model is small (~2.5GB); fits in 4GB VRAM | + +### 8.2 Contingency Plans + +| Scenario | Response | +|----------|----------| +| Candle implementation blocked | Use Crane crate as dependency if they ship Qwen3-TTS | +| ConvNet decoder too complex | Implement simplified decoder; optimize later | +| Latency exceeds targets | Start with batch mode + chunked delivery (acceptable UX) | +| No GPU available | CPU fallback with candle's MKL support (degraded performance) | + +--- + +## 9. Dependencies & Prerequisites + +### 9.1 Rust Dependencies + +Add to `Cargo.toml`: + +```toml +[dependencies] +candle-core = "0.8" +candle-nn = "0.8" +candle-transformers = "0.8" +# Keep existing: tokenizers, hf-hub, safetensors, ndarray +``` + +### 9.2 Hardware Requirements + +| Component | Minimum | Recommended | +|-----------|---------|-------------| +| GPU | CUDA 4GB VRAM / Metal (macOS) | NVIDIA RTX 3060+ (8GB+) | +| RAM | 8GB | 16GB | +| Storage | 5GB (model weights) | 10GB | + +### 9.3 Voice Asset Prerequisites + +Before Phase 2 voice testing: +1. Makima voice reference audio (5-15 seconds, clean speech) +2. Accurate transcript of reference audio +3. Format: WAV 24kHz mono, 16-bit PCM + +--- + +## 10. Success Criteria + +### 10.1 Phase 1 Completion + +- [ ] candle-based Qwen3 model loads safetensors weights +- [ ] Forward pass produces valid logits +- [ ] Speech tokenizer encodes/decodes audio +- [ ] Code predictor generates 16 codebook layers +- [ ] Unit tests pass with > 80% coverage + +### 10.2 Phase 2 Completion + +- [ ] `/api/v1/speak` endpoint produces audio from text +- [ ] No Python service required +- [ ] Voice cloning works with reference audio +- [ ] Error handling returns appropriate codes +- [ ] speak.rs calls TTS engine directly (no proxy) + +### 10.3 Phase 3 Completion + +- [ ] Streaming generation with < 200ms TTFA +- [ ] GPU memory usage < 6GB +- [ ] Integration tests pass +- [ ] Listen page bidirectional speech works +- [ ] Latency benchmarks documented + +### 10.4 Final Acceptance Criteria + +1. **Functional:** End-to-end TTS streaming via WebSocket, pure Rust, no Python +2. **Performance:** TTFA < 200ms, subsequent chunks < 100ms +3. **Quality:** Synthesized speech is intelligible and recognizable as Makima +4. **Reliability:** Error handling is robust; graceful degradation on GPU failure +5. **Architecture:** Clean `tts/` module with trait-based engine selection + +--- + +## Appendix A: Quick Start Commands + +### Development + +```bash +# Build with candle GPU support +cd makima +cargo build --features cuda # or --features metal for macOS + +# Run server with TTS enabled +TTS_ENGINE=qwen3 TTS_DEVICE=cuda:0 cargo run + +# Run TTS-specific tests +cargo test tts +``` + +### Benchmarks + +```bash +# Run latency benchmarks (requires GPU) +cargo bench --bench tts_latency +``` + +## Appendix B: Reference Implementations + +- [candle-transformers qwen2 model](https://docs.rs/candle-transformers/latest/candle_transformers/models/qwen2/index.html) — base attention/MLP layers +- [qwen3-rs](https://github.com/reinterpretcat/qwen3-rs) — educational Qwen3 in Rust +- [Crane](https://github.com/lucasjinreal/Crane) — Rust TTS engine (Qwen3-TTS on roadmap) +- [docs/research/rust-native-tts-research.md](../research/rust-native-tts-research.md) — full feasibility analysis |
