path: root/docs/plans/qwen3-tts-implementation-plan.md



# Qwen3-TTS Implementation Plan — Pure Rust (Candle)

**Version:** 2.0
**Created:** 2026-01-27
**Status:** Final
**Authors:** makima development team
**Spec Reference:** [docs/specs/qwen3-tts-spec.md](../specs/qwen3-tts-spec.md)
**Research:** [docs/research/rust-native-tts-research.md](../research/rust-native-tts-research.md)

---

## Table of Contents

1. [Overview](#1-overview)
2. [Task Breakdown](#2-task-breakdown)
3. [File Changes](#3-file-changes)
4. [Phase 1: Candle-Based TTS Module](#4-phase-1-candle-based-tts-module)
5. [Phase 2: WebSocket Handler + Voice Assets](#5-phase-2-websocket-handler--voice-assets)
6. [Phase 3: Optimization + Integration](#6-phase-3-optimization--integration)
7. [Testing Plan](#7-testing-plan)
8. [Risk Assessment](#8-risk-assessment)
9. [Dependencies & Prerequisites](#9-dependencies--prerequisites)
10. [Success Criteria](#10-success-criteria)

---

## 1. Overview

This plan details the implementation of Qwen3-TTS integration for the makima system as a **pure Rust** solution using the **candle** ML framework. There is no Python microservice and no proxy pattern — the TTS model runs directly inside the main makima process, loading safetensors weights via candle.

### Key Objectives

1. Implement Qwen3-TTS model inference natively in Rust using candle
2. Create a `makima/src/tts/` module with TTS trait, Chatterbox adapter, and Qwen3 submodule
3. Update the `/api/v1/speak` WebSocket handler to call the TTS engine directly
4. Enable streaming audio delivery with <200ms time-to-first-audio (TTFA)
5. Support voice cloning with default Makima voice

### Architecture Summary

```
Client Browser
     │
     │ WebSocket: /api/v1/speak
     ▼
Makima Server (Rust/Axum)
     │
     │ speak.rs handler → TTS Engine (in-process)
     │
     │ candle-based Qwen3-TTS inference
     │ (safetensors weights loaded directly)
     ▼
Audio Stream back to client
```

**Key architectural decisions:**
- **No Python.** All inference runs in Rust via candle.
- **No microservice.** TTS runs in-process, no separate service to deploy.
- **No proxy.** The speak handler calls the TTS engine directly.
- **Lazy loading.** Models loaded on first TTS request (like listen.rs pattern).
- **SafeTensors.** Weights loaded directly — no ONNX conversion needed.

---

## 2. Task Breakdown

### Phase 1: Candle-Based TTS Module (Priority: Critical)

| ID | Task | Depends On | Estimated Hours |
|----|------|------------|-----------------|
| P1.1 | Create `makima/src/tts/mod.rs` — TTS trait + factory + types | - | 3 |
| P1.2 | Move existing `tts.rs` to `makima/src/tts/chatterbox.rs` | P1.1 | 2 |
| P1.3 | Create `makima/src/tts/qwen3/config.rs` — Model config parsing | P1.1 | 2 |
| P1.4 | Implement `makima/src/tts/qwen3/model.rs` — 28-layer LM backbone | P1.3 | 12 |
| P1.5 | Implement `makima/src/tts/qwen3/code_predictor.rs` — MTP module | P1.4 | 8 |
| P1.6 | Implement `makima/src/tts/qwen3/speech_tokenizer.rs` — ConvNet codec | P1.3 | 10 |
| P1.7 | Implement `makima/src/tts/qwen3/generate.rs` — Autoregressive generation | P1.4, P1.5, P1.6 | 8 |
| P1.8 | Create `makima/src/tts/qwen3/mod.rs` — Public API | P1.7 | 3 |
| P1.9 | Add candle dependencies to `Cargo.toml` | - | 1 |
| P1.10 | Unit tests for config, model layers, tokenizer | P1.4-P1.6 | 6 |

**Phase 1 Total: ~55 hours**

### Phase 2: WebSocket Handler + Voice Assets (Priority: High)

| ID | Task | Depends On | Estimated Hours |
|----|------|------------|-----------------|
| P2.1 | Rewrite `speak.rs` — Direct TTS handler (remove proxy) | P1.8 | 6 |
| P2.2 | Add TTS models to `SharedState` (lazy loading via `OnceCell`) | P1.8 | 3 |
| P2.3 | Implement voice prompt caching (LRU) | P1.8 | 3 |
| P2.4 | Remove `tts_client.rs` (no longer needed) | P2.1 | 1 |
| P2.5 | Update `state.rs` — Remove TTS proxy fields, add TTS model fields | P2.2 | 2 |
| P2.6 | Update `mod.rs` — Remove `tts_client` module | P2.4 | 0.5 |
| P2.7 | Create voice manifest structure (`models/voices/makima/`) | - | 1 |
| P2.8 | Acquire Makima voice reference audio | - | 2 |
| P2.9 | Test voice cloning quality | P1.8, P2.8 | 2 |

**Phase 2 Total: ~20.5 hours**

### Phase 3: Optimization + Integration (Priority: Medium)

| ID | Task | Depends On | Estimated Hours |
|----|------|------------|-----------------|
| P3.1 | Implement streaming generation (token-by-token waveform decode) | P2.1 | 6 |
| P3.2 | GPU memory optimization (bf16, cache management) | P3.1 | 4 |
| P3.3 | Listen page integration for bidirectional speech | P2.1 | 4 |
| P3.4 | Latency benchmarks | P3.1 | 3 |
| P3.5 | Integration tests (WebSocket end-to-end) | P2.1 | 4 |
| P3.6 | Documentation | P3.5 | 2 |

**Phase 3 Total: ~23 hours**

---

## 3. File Changes

### New Files

```
makima/src/tts/
├── mod.rs                      // TTS trait, factory, shared types
├── chatterbox.rs               // Existing ONNX-based Chatterbox (moved from tts.rs)
└── qwen3/
    ├── mod.rs                  // Qwen3TTS public API
    ├── model.rs                // Qwen3 LM transformer (28 layers)
    ├── code_predictor.rs       // MTP module (5 layers, 16 codebooks)
    ├── speech_tokenizer.rs     // Encoder + Decoder (causal ConvNet + RVQ)
    ├── config.rs               // Model config from config.json / safetensors
    └── generate.rs             // Autoregressive generation loop with KV cache
```

### Modified Files

| File | Change Description |
|------|-------------------|
| `makima/src/server/handlers/speak.rs` | Rewrite: direct TTS engine call instead of proxy |
| `makima/src/server/state.rs` | Remove `tts_client`/`tts_config` fields, add `tts_models: OnceCell<TtsModels>` |
| `makima/src/server/mod.rs` | Remove `pub mod tts_client;` |
| `makima/src/server/handlers/mod.rs` | No change (speak already exported) |
| `makima/Cargo.toml` | Add candle-core, candle-nn, candle-transformers; remove tokio-tungstenite if unused |
| `makima/src/lib.rs` or `main.rs` | Add `pub mod tts;` |

### Deleted Files

| File | Reason |
|------|--------|
| `makima/src/server/tts_client.rs` | No longer needed — no proxy pattern |
| `tts-service/` (entire directory) | Python service rejected; pure Rust solution |

---

## 4. Phase 1: Candle-Based TTS Module

### 4.1 TTS Trait and Factory (`tts/mod.rs`)

```rust
use async_trait::async_trait;

/// Audio chunk for streaming output.
pub struct AudioChunk {
    pub samples: Vec<f32>,
    pub sample_rate: u32,
    pub is_final: bool,
}

/// TTS engine trait — implemented by Chatterbox and Qwen3.
#[async_trait]
pub trait TtsEngine: Send + Sync {
    /// Generate audio from text.
    async fn generate(
        &self,
        text: &str,
        voice_id: &str,
        language: &str,
    ) -> Result<Vec<AudioChunk>, TtsError>;

    /// Pre-load a voice prompt.
    async fn load_voice(&self, voice_id: &str) -> Result<(), TtsError>;

    /// Check if the engine is ready.
    fn is_ready(&self) -> bool;
}
```

### 4.2 Qwen3 LM Backbone (`tts/qwen3/model.rs`)

Extend candle-transformers' Qwen2 model implementation:

- **28 transformer layers** with RoPE, GQA (16 heads, 8 KV heads), head dim 128
- **Hidden size:** 1024, **intermediate size:** 3072
- **Input:** text tokens + reference audio codes (concatenated)
- **Output:** zeroth codebook token logits

**Key implementation detail:** The existing `candle_transformers::models::qwen2` module provides the base attention and MLP layers. We extend this with:
- TTS-specific input embedding (text + audio token embeddings)
- Speaker encoder concatenation
- Code predictor output head (instead of standard LM head)

### 4.3 Code Predictor (`tts/qwen3/code_predictor.rs`)

- **5-layer** transformer module
- **Input:** hidden states from the main LM
- **Output:** 16 codebook predictions (vocab size 2048 each)
- After the main LM predicts the zeroth codebook token, this module predicts the remaining 15 codebook layers in parallel

### 4.4 Speech Tokenizer (`tts/qwen3/speech_tokenizer.rs`)

Two sub-components:

**Encoder** (used for voice cloning):
- Causal 1D ConvNet converting reference audio waveform → discrete multi-codebook tokens
- 16-layer RVQ (Residual Vector Quantization)
- First codebook = semantic (WavLM-guided), remaining 15 = acoustic

**Decoder** (used for audio output):
- Causal 1D ConvNet reconstructing waveforms from discrete codes
- Input: 16 codebook indices → lookup embeddings → ConvNet → waveform
- Output: 24kHz mono audio

**candle implementation notes:**
- `candle_nn::Conv1d` for all convolution layers
- `candle_nn::Embedding` for codebook lookups
- Weight normalization handled manually

### 4.5 Autoregressive Generation (`tts/qwen3/generate.rs`)

```rust
pub async fn generate(
    model: &Qwen3Model,
    code_predictor: &CodePredictor,
    speech_tokenizer: &SpeechTokenizer,
    text_tokens: &[u32],
    voice_prompt: &VoicePrompt,
) -> Result<Vec<AudioChunk>, TtsError> {
    // 1. Encode reference audio → speaker embedding + audio codes
    let speaker_emb = speech_tokenizer.encode(&voice_prompt.audio)?;

    // 2. Prepare input: [text_tokens, audio_codes]
    let input = prepare_input(text_tokens, &speaker_emb)?;

    // 3. Autoregressive loop with KV cache
    let mut kv_cache = KvCache::new(model.num_layers());
    let mut generated_codes = Vec::new();

    loop {
        let logits = model.forward(&input, &mut kv_cache)?;
        let next_token = sample_token(&logits);

        if next_token == EOS_TOKEN { break; }
        generated_codes.push(next_token);

        // 4. Code predictor: predict remaining 15 codebooks
        let all_codes = code_predictor.predict(&model.last_hidden_state(), next_token)?;

        // 5. Decode to audio (can be done incrementally for streaming)
        let chunk = speech_tokenizer.decode(&all_codes)?;
        // yield chunk for streaming
    }
}
```

---

## 5. Phase 2: WebSocket Handler + Voice Assets

### 5.1 Speak Handler (Rewritten)

```rust
// makima/src/server/handlers/speak.rs
//
// Direct TTS handler — no proxy, no external service.

pub async fn websocket_handler(
    ws: WebSocketUpgrade,
    State(state): State<SharedState>,
) -> Response {
    ws.on_upgrade(|socket| handle_speak_socket(socket, state))
}

async fn handle_speak_socket(socket: WebSocket, state: SharedState) {
    let session_id = Uuid::new_v4().to_string();

    // Lazy-load TTS models (like listen.rs does for STT)
    let tts = match state.get_tts_models().await {
        Ok(tts) => tts,
        Err(e) => {
            send_error(&mut socket, "MODEL_LOADING", &e.to_string()).await;
            return;
        }
    };

    // Session loop: parse JSON messages, dispatch to TTS engine
    let (mut sender, mut receiver) = socket.split();

    while let Some(msg) = receiver.next().await {
        match parse_client_message(msg) {
            ClientMessage::Start(config) => {
                // Load voice, send Ready
            }
            ClientMessage::Speak(text) => {
                // Run inference, stream audio chunks
                for chunk in tts.engine.generate(&text, voice_id, language).await? {
                    sender.send(Message::Binary(chunk.to_pcm16())).await?;
                }
                sender.send(complete_message()).await?;
            }
            ClientMessage::Stop => break,
            ClientMessage::Cancel => { /* abort current generation */ }
        }
    }
}
```

### 5.2 State Changes

Remove from `AppState`:
- `tts_client: Option<Arc<TtsProxyClient>>`
- `tts_config: Option<TtsConfig>`
- `with_tts()` method
- `is_tts_healthy()` method

Add to `AppState`:
- `tts_models: OnceCell<TtsModels>` — lazily loaded TTS engine
- `get_tts_models()` method (async, like `get_ml_models()`)

### 5.3 Voice Assets

```
models/voices/makima/
├── manifest.json       # Voice metadata
├── reference.wav       # 5-15 second reference audio
└── transcript.txt      # Exact transcript of reference audio
```

---

## 6. Phase 3: Optimization + Integration

### 6.1 Streaming Generation

The 12Hz model's causal architecture enables token-by-token waveform generation:
- Each token = ~80ms of audio (12.5 Hz)
- After generating each token, decode immediately and send audio chunk
- Client receives audio before full generation completes

### 6.2 GPU Memory Optimization

- Load weights in bf16/f16 (candle supports both)
- Implement KV cache with fixed maximum size
- Clear cache between sessions
- CPU fallback when GPU is unavailable

### 6.3 Listen Page Integration

Following the pattern in `listen.rs`:
- TTS model protected behind `tokio::sync::Mutex`
- WebSocket endpoint emits audio chunks as tokens are generated
- Bidirectional: STT (listen) → process → TTS (speak) loop

---

## 7. Testing Plan

### 7.1 Unit Tests

| Test Area | Coverage | Key Tests |
|-----------|----------|-----------|
| Config parsing | 100% | Load config from JSON, validate fields |
| Model layers | 80% | Attention, MLP, Conv1d shapes |
| Code predictor | 85% | Multi-codebook output shapes |
| Speech tokenizer | 80% | Encode/decode round-trip |
| Voice cache | 95% | LRU eviction, TTL expiration |
| Message parsing | 100% | All client/server message types |

### 7.2 Integration Tests

| Test | Description |
|------|-------------|
| WebSocket flow | Connect → Start → Speak → Audio chunks → Complete → Stop |
| Error handling | Invalid text, unknown voice, model loading failure |
| Cancellation | Cancel mid-generation |
| Voice cloning | Generate with custom reference audio |

### 7.3 Latency Benchmarks

| Metric | Target | Acceptable | Warning |
|--------|--------|------------|---------|
| First Audio (short text) | < 150ms | < 200ms | > 300ms |
| First Audio (medium text) | < 200ms | < 300ms | > 500ms |
| First Audio (long text) | < 300ms | < 500ms | > 800ms |
| Inter-chunk latency | < 30ms | < 50ms | > 100ms |
| GPU memory | < 4GB | < 6GB | > 8GB |

---

## 8. Risk Assessment

### 8.1 Technical Risks

| Risk | Likelihood | Impact | Mitigation |
|------|------------|--------|------------|
| **Candle implementation takes longer** | Medium | Medium | Reference Crane's Spark-TTS; use qwen3-rs as LM reference |
| **Speech tokenizer ConvNet is complex** | Medium | High | Study PyTorch source; ConvNet layers are simpler than transformers |
| **Model quality differs from PyTorch** | Low | High | Validate with reference audio; ensure bf16 precision |
| **Crane ships Qwen3-TTS first** | Medium | Positive | Adopt their implementation or use as reference |
| **GPU memory issues** | Low | Medium | 0.6B model is small (~2.5GB); fits in 4GB VRAM |

### 8.2 Contingency Plans

| Scenario | Response |
|----------|----------|
| Candle implementation blocked | Use Crane crate as dependency if they ship Qwen3-TTS |
| ConvNet decoder too complex | Implement simplified decoder; optimize later |
| Latency exceeds targets | Start with batch mode + chunked delivery (acceptable UX) |
| No GPU available | CPU fallback with candle's MKL support (degraded performance) |

---

## 9. Dependencies & Prerequisites

### 9.1 Rust Dependencies

Add to `Cargo.toml`:

```toml
[dependencies]
candle-core = "0.8"
candle-nn = "0.8"
candle-transformers = "0.8"
# Keep existing: tokenizers, hf-hub, safetensors, ndarray
```

### 9.2 Hardware Requirements

| Component | Minimum | Recommended |
|-----------|---------|-------------|
| GPU | CUDA 4GB VRAM / Metal (macOS) | NVIDIA RTX 3060+ (8GB+) |
| RAM | 8GB | 16GB |
| Storage | 5GB (model weights) | 10GB |

### 9.3 Voice Asset Prerequisites

Before Phase 2 voice testing:
1. Makima voice reference audio (5-15 seconds, clean speech)
2. Accurate transcript of reference audio
3. Format: WAV 24kHz mono, 16-bit PCM

---

## 10. Success Criteria

### 10.1 Phase 1 Completion

- [ ] candle-based Qwen3 model loads safetensors weights
- [ ] Forward pass produces valid logits
- [ ] Speech tokenizer encodes/decodes audio
- [ ] Code predictor generates 16 codebook layers
- [ ] Unit tests pass with > 80% coverage

### 10.2 Phase 2 Completion

- [ ] `/api/v1/speak` endpoint produces audio from text
- [ ] No Python service required
- [ ] Voice cloning works with reference audio
- [ ] Error handling returns appropriate codes
- [ ] speak.rs calls TTS engine directly (no proxy)

### 10.3 Phase 3 Completion

- [ ] Streaming generation with < 200ms TTFA
- [ ] GPU memory usage < 6GB
- [ ] Integration tests pass
- [ ] Listen page bidirectional speech works
- [ ] Latency benchmarks documented

### 10.4 Final Acceptance Criteria

1. **Functional:** End-to-end TTS streaming via WebSocket, pure Rust, no Python
2. **Performance:** TTFA < 200ms, subsequent chunks < 100ms
3. **Quality:** Synthesized speech is intelligible and recognizable as Makima
4. **Reliability:** Error handling is robust; graceful degradation on GPU failure
5. **Architecture:** Clean `tts/` module with trait-based engine selection

---

## Appendix A: Quick Start Commands

### Development

```bash
# Build with candle GPU support
cd makima
cargo build --features cuda   # or --features metal for macOS

# Run server with TTS enabled
TTS_ENGINE=qwen3 TTS_DEVICE=cuda:0 cargo run

# Run TTS-specific tests
cargo test tts
```

### Benchmarks

```bash
# Run latency benchmarks (requires GPU)
cargo bench --bench tts_latency
```

## Appendix B: Reference Implementations

- [candle-transformers qwen2 model](https://docs.rs/candle-transformers/latest/candle_transformers/models/qwen2/index.html) — base attention/MLP layers
- [qwen3-rs](https://github.com/reinterpretcat/qwen3-rs) — educational Qwen3 in Rust
- [Crane](https://github.com/lucasjinreal/Crane) — Rust TTS engine (Qwen3-TTS on roadmap)
- [docs/research/rust-native-tts-research.md](../research/rust-native-tts-research.md) — full feasibility analysis