docs/plans/qwen3-tts-implementation-plan.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514

# Qwen3-TTS Implementation Plan — Pure Rust (Candle)

**Version:** 2.0
**Created:** 2026-01-27
**Status:** Final
**Authors:** makima development team
**Spec Reference:** [docs/specs/qwen3-tts-spec.md](../specs/qwen3-tts-spec.md)
**Research:** [docs/research/rust-native-tts-research.md](../research/rust-native-tts-research.md)

---

## Table of Contents

1. [Overview](#1-overview)
2. [Task Breakdown](#2-task-breakdown)
3. [File Changes](#3-file-changes)
4. [Phase 1: Candle-Based TTS Module](#4-phase-1-candle-based-tts-module)
5. [Phase 2: WebSocket Handler + Voice Assets](#5-phase-2-websocket-handler--voice-assets)
6. [Phase 3: Optimization + Integration](#6-phase-3-optimization--integration)
7. [Testing Plan](#7-testing-plan)
8. [Risk Assessment](#8-risk-assessment)
9. [Dependencies & Prerequisites](#9-dependencies--prerequisites)
10. [Success Criteria](#10-success-criteria)

---

## 1. Overview

This plan details the implementation of Qwen3-TTS integration for the makima system as a **pure Rust** solution using the **candle** ML framework. There is no Python microservice and no proxy pattern — the TTS model runs directly inside the main makima process, loading safetensors weights via candle.

### Key Objectives

1. Implement Qwen3-TTS model inference natively in Rust using candle
2. Create a `makima/src/tts/` module with TTS trait, Chatterbox adapter, and Qwen3 submodule
3. Update the `/api/v1/speak` WebSocket handler to call the TTS engine directly
4. Enable streaming audio delivery with <200ms time-to-first-audio (TTFA)
5. Support voice cloning with default Makima voice

### Architecture Summary

```
Client Browser
     │
     │ WebSocket: /api/v1/speak
     ▼
Makima Server (Rust/Axum)
     │
     │ speak.rs handler → TTS Engine (in-process)
     │
     │ candle-based Qwen3-TTS inference
     │ (safetensors weights loaded directly)
     ▼
Audio Stream back to client
```

**Key architectural decisions:**
- **No Python.** All inference runs in Rust via candle.
- **No microservice.** TTS runs in-process, no separate service to deploy.
- **No proxy.** The speak handler calls the TTS engine directly.
- **Lazy loading.** Models loaded on first TTS request (like listen.rs pattern).
- **SafeTensors.** Weights loaded directly — no ONNX conversion needed.

---

## 2. Task Breakdown

### Phase 1: Candle-Based TTS Module (Priority: Critical)

| ID | Task | Depends On | Estimated Hours |
|----|------|------------|-----------------|
| P1.1 | Create `makima/src/tts/mod.rs` — TTS trait + factory + types | - | 3 |
| P1.2 | Move existing `tts.rs` to `makima/src/tts/chatterbox.rs` | P1.1 | 2 |
| P1.3 | Create `makima/src/tts/qwen3/config.rs` — Model config parsing | P1.1 | 2 |
| P1.4 | Implement `makima/src/tts/qwen3/model.rs` — 28-layer LM backbone | P1.3 | 12 |
| P1.5 | Implement `makima/src/tts/qwen3/code_predictor.rs` — MTP module | P1.4 | 8 |
| P1.6 | Implement `makima/src/tts/qwen3/speech_tokenizer.rs` — ConvNet codec | P1.3 | 10 |
| P1.7 | Implement `makima/src/tts/qwen3/generate.rs` — Autoregressive generation | P1.4, P1.5, P1.6 | 8 |
| P1.8 | Create `makima/src/tts/qwen3/mod.rs` — Public API | P1.7 | 3 |
| P1.9 | Add candle dependencies to `Cargo.toml` | - | 1 |
| P1.10 | Unit tests for config, model layers, tokenizer | P1.4-P1.6 | 6 |

**Phase 1 Total: ~55 hours**

### Phase 2: WebSocket Handler + Voice Assets (Priority: High)

| ID | Task | Depends On | Estimated Hours |
|----|------|------------|-----------------|
| P2.1 | Rewrite `speak.rs` — Direct TTS handler (remove proxy) | P1.8 | 6 |
| P2.2 | Add TTS models to `SharedState` (lazy loading via `OnceCell`) | P1.8 | 3 |
| P2.3 | Implement voice prompt caching (LRU) | P1.8 | 3 |
| P2.4 | Remove `tts_client.rs` (no longer needed) | P2.1 | 1 |
| P2.5 | Update `state.rs` — Remove TTS proxy fields, add TTS model fields | P2.2 | 2 |
| P2.6 | Update `mod.rs` — Remove `tts_client` module | P2.4 | 0.5 |
| P2.7 | Create voice manifest structure (`models/voices/makima/`) | - | 1 |
| P2.8 | Acquire Makima voice reference audio | - | 2 |
| P2.9 | Test voice cloning quality | P1.8, P2.8 | 2 |

**Phase 2 Total: ~20.5 hours**

### Phase 3: Optimization + Integration (Priority: Medium)

| ID | Task | Depends On | Estimated Hours |
|----|------|------------|-----------------|
| P3.1 | Implement streaming generation (token-by-token waveform decode) | P2.1 | 6 |
| P3.2 | GPU memory optimization (bf16, cache management) | P3.1 | 4 |
| P3.3 | Listen page integration for bidirectional speech | P2.1 | 4 |
| P3.4 | Latency benchmarks | P3.1 | 3 |
| P3.5 | Integration tests (WebSocket end-to-end) | P2.1 | 4 |
| P3.6 | Documentation | P3.5 | 2 |

**Phase 3 Total: ~23 hours**

---

## 3. File Changes

### New Files

```
makima/src/tts/
├── mod.rs                      // TTS trait, factory, shared types
├── chatterbox.rs               // Existing ONNX-based Chatterbox (moved from tts.rs)
└── qwen3/
    ├── mod.rs                  // Qwen3TTS public API
    ├── model.rs                // Qwen3 LM transformer (28 layers)
    ├── code_predictor.rs       // MTP module (5 layers, 16 codebooks)
    ├── speech_tokenizer.rs     // Encoder + Decoder (causal ConvNet + RVQ)
    ├── config.rs               // Model config from config.json / safetensors
    └── generate.rs             // Autoregressive generation loop with KV cache
```

### Modified Files

| File | Change Description |
|------|-------------------|
| `makima/src/server/handlers/speak.rs` | Rewrite: direct TTS engine call instead of proxy |
| `makima/src/server/state.rs` | Remove `tts_client`/`tts_config` fields, add `tts_models: OnceCell<TtsModels>` |
| `makima/src/server/mod.rs` | Remove `pub mod tts_client;` |
| `makima/src/server/handlers/mod.rs` | No change (speak already exported) |
| `makima/Cargo.toml` | Add candle-core, candle-nn, candle-transformers; remove tokio-tungstenite if unused |
| `makima/src/lib.rs` or `main.rs` | Add `pub mod tts;` |

### Deleted Files

| File | Reason |
|------|--------|
| `makima/src/server/tts_client.rs` | No longer needed — no proxy pattern |
| `tts-service/` (entire directory) | Python service rejected; pure Rust solution |

---

## 4. Phase 1: Candle-Based TTS Module

### 4.1 TTS Trait and Factory (`tts/mod.rs`)

```rust
use async_trait::async_trait;

/// Audio chunk for streaming output.
pub struct AudioChunk {
    pub samples: Vec<f32>,
    pub sample_rate: u32,
    pub is_final: bool,
}

/// TTS engine trait — implemented by Chatterbox and Qwen3.
#[async_trait]
pub trait TtsEngine: Send + Sync {
    /// Generate audio from text.
    async fn generate(
        &self,
        text: &str,
        voice_id: &str,
        language: &str,
    ) -> Result<Vec<AudioChunk>, TtsError>;

    /// Pre-load a voice prompt.
    async fn load_voice(&self, voice_id: &str) -> Result<(), TtsError>;

    /// Check if the engine is ready.
    fn is_ready(&self) -> bool;
}
```

### 4.2 Qwen3 LM Backbone (`tts/qwen3/model.rs`)

Extend candle-transformers' Qwen2 model implementation:

- **28 transformer layers** with RoPE, GQA (16 heads, 8 KV heads), head dim 128
- **Hidden size:** 1024, **intermediate size:** 3072
- **Input:** text tokens + reference audio codes (concatenated)
- **Output:** zeroth codebook token logits

**Key implementation detail:** The existing `candle_transformers::models::qwen2` module provides the base attention and MLP layers. We extend this with:
- TTS-specific input embedding (text + audio token embeddings)
- Speaker encoder concatenation
- Code predictor output head (instead of standard LM head)

### 4.3 Code Predictor (`tts/qwen3/code_predictor.rs`)

- **5-layer** transformer module
- **Input:** hidden states from the main LM
- **Output:** 16 codebook predictions (vocab size 2048 each)
- After the main LM predicts the zeroth codebook token, this module predicts the remaining 15 codebook layers in parallel

### 4.4 Speech Tokenizer (`tts/qwen3/speech_tokenizer.rs`)

Two sub-components:

**Encoder** (used for voice cloning):
- Causal 1D ConvNet converting reference audio waveform → discrete multi-codebook tokens
- 16-layer RVQ (Residual Vector Quantization)
- First codebook = semantic (WavLM-guided), remaining 15 = acoustic

**Decoder** (used for audio output):
- Causal 1D ConvNet reconstructing waveforms from discrete codes
- Input: 16 codebook indices → lookup embeddings → ConvNet → waveform
- Output: 24kHz mono audio

**candle implementation notes:**
- `candle_nn::Conv1d` for all convolution layers
- `candle_nn::Embedding` for codebook lookups
- Weight normalization handled manually

### 4.5 Autoregressive Generation (`tts/qwen3/generate.rs`)

```rust
pub async fn generate(
    model: &Qwen3Model,
    code_predictor: &CodePredictor,
    speech_tokenizer: &SpeechTokenizer,
    text_tokens: &[u32],
    voice_prompt: &VoicePrompt,
) -> Result<Vec<AudioChunk>, TtsError> {
    // 1. Encode reference audio → speaker embedding + audio codes
    let speaker_emb = speech_tokenizer.encode(&voice_prompt.audio)?;

    // 2. Prepare input: [text_tokens, audio_codes]
    let input = prepare_input(text_tokens, &speaker_emb)?;

    // 3. Autoregressive loop with KV cache
    let mut kv_cache = KvCache::new(model.num_layers());
    let mut generated_codes = Vec::new();

    loop {
        let logits = model.forward(&input, &mut kv_cache)?;
        let next_token = sample_token(&logits);

        if next_token == EOS_TOKEN { break; }
        generated_codes.push(next_token);

        // 4. Code predictor: predict remaining 15 codebooks
        let all_codes = code_predictor.predict(&model.last_hidden_state(), next_token)?;

        // 5. Decode to audio (can be done incrementally for streaming)
        let chunk = speech_tokenizer.decode(&all_codes)?;
        // yield chunk for streaming
    }
}
```

---

## 5. Phase 2: WebSocket Handler + Voice Assets

### 5.1 Speak Handler (Rewritten)

```rust
// makima/src/server/handlers/speak.rs
//
// Direct TTS handler — no proxy, no external service.

pub async fn websocket_handler(
    ws: WebSocketUpgrade,
    State(state): State<SharedState>,
) -> Response {
    ws.on_upgrade(|socket| handle_speak_socket(socket, state))
}

async fn handle_speak_socket(socket: WebSocket, state: SharedState) {
    let session_id = Uuid::new_v4().to_string();

    // Lazy-load TTS models (like listen.rs does for STT)
    let tts = match state.get_tts_models().await {
        Ok(tts) => tts,
        Err(e) => {
            send_error(&mut socket, "MODEL_LOADING", &e.to_string()).await;
            return;
        }
    };

    // Session loop: parse JSON messages, dispatch to TTS engine
    let (mut sender, mut receiver) = socket.split();

    while let Some(msg) = receiver.next().await {
        match parse_client_message(msg) {
            ClientMessage::Start(config) => {
                // Load voice, send Ready
            }
            ClientMessage::Speak(text) => {
                // Run inference, stream audio chunks
                for chunk in tts.engine.generate(&text, voice_id, language).await? {
                    sender.send(Message::Binary(chunk.to_pcm16())).await?;
                }
                sender.send(complete_message()).await?;
            }
            ClientMessage::Stop => break,
            ClientMessage::Cancel => { /* abort current generation */ }
        }
    }
}
```

### 5.2 State Changes

Remove from `AppState`:
- `tts_client: Option<Arc<TtsProxyClient>>`
- `tts_config: Option<TtsConfig>`
- `with_tts()` method
- `is_tts_healthy()` method

Add to `AppState`:
- `tts_models: OnceCell<TtsModels>` — lazily loaded TTS engine
- `get_tts_models()` method (async, like `get_ml_models()`)

### 5.3 Voice Assets

```
models/voices/makima/
├── manifest.json       # Voice metadata
├── reference.wav       # 5-15 second reference audio
└── transcript.txt      # Exact transcript of reference audio
```

---

## 6. Phase 3: Optimization + Integration

### 6.1 Streaming Generation

The 12Hz model's causal architecture enables token-by-token waveform generation:
- Each token = ~80ms of audio (12.5 Hz)
- After generating each token, decode immediately and send audio chunk
- Client receives audio before full generation completes

### 6.2 GPU Memory Optimization

- Load weights in bf16/f16 (candle supports both)
- Implement KV cache with fixed maximum size
- Clear cache between sessions
- CPU fallback when GPU is unavailable

### 6.3 Listen Page Integration

Following the pattern in `listen.rs`:
- TTS model protected behind `tokio::sync::Mutex`
- WebSocket endpoint emits audio chunks as tokens are generated
- Bidirectional: STT (listen) → process → TTS (speak) loop

---

## 7. Testing Plan

### 7.1 Unit Tests

| Test Area | Coverage | Key Tests |
|-----------|----------|-----------|
| Config parsing | 100% | Load config from JSON, validate fields |
| Model layers | 80% | Attention, MLP, Conv1d shapes |
| Code predictor | 85% | Multi-codebook output shapes |
| Speech tokenizer | 80% | Encode/decode round-trip |
| Voice cache | 95% | LRU eviction, TTL expiration |
| Message parsing | 100% | All client/server message types |

### 7.2 Integration Tests

| Test | Description |
|------|-------------|
| WebSocket flow | Connect → Start → Speak → Audio chunks → Complete → Stop |
| Error handling | Invalid text, unknown voice, model loading failure |
| Cancellation | Cancel mid-generation |
| Voice cloning | Generate with custom reference audio |

### 7.3 Latency Benchmarks

| Metric | Target | Acceptable | Warning |
|--------|--------|------------|---------|
| First Audio (short text) | < 150ms | < 200ms | > 300ms |
| First Audio (medium text) | < 200ms | < 300ms | > 500ms |
| First Audio (long text) | < 300ms | < 500ms | > 800ms |
| Inter-chunk latency | < 30ms | < 50ms | > 100ms |
| GPU memory | < 4GB | < 6GB | > 8GB |

---

## 8. Risk Assessment

### 8.1 Technical Risks

| Risk | Likelihood | Impact | Mitigation |
|------|------------|--------|------------|
| **Candle implementation takes longer** | Medium | Medium | Reference Crane's Spark-TTS; use qwen3-rs as LM reference |
| **Speech tokenizer ConvNet is complex** | Medium | High | Study PyTorch source; ConvNet layers are simpler than transformers |
| **Model quality differs from PyTorch** | Low | High | Validate with reference audio; ensure bf16 precision |
| **Crane ships Qwen3-TTS first** | Medium | Positive | Adopt their implementation or use as reference |
| **GPU memory issues** | Low | Medium | 0.6B model is small (~2.5GB); fits in 4GB VRAM |

### 8.2 Contingency Plans

| Scenario | Response |
|----------|----------|
| Candle implementation blocked | Use Crane crate as dependency if they ship Qwen3-TTS |
| ConvNet decoder too complex | Implement simplified decoder; optimize later |
| Latency exceeds targets | Start with batch mode + chunked delivery (acceptable UX) |
| No GPU available | CPU fallback with candle's MKL support (degraded performance) |

---

## 9. Dependencies & Prerequisites

### 9.1 Rust Dependencies

Add to `Cargo.toml`:

```toml
[dependencies]
candle-core = "0.8"
candle-nn = "0.8"
candle-transformers = "0.8"
# Keep existing: tokenizers, hf-hub, safetensors, ndarray
```

### 9.2 Hardware Requirements

| Component | Minimum | Recommended |
|-----------|---------|-------------|
| GPU | CUDA 4GB VRAM / Metal (macOS) | NVIDIA RTX 3060+ (8GB+) |
| RAM | 8GB | 16GB |
| Storage | 5GB (model weights) | 10GB |

### 9.3 Voice Asset Prerequisites

Before Phase 2 voice testing:
1. Makima voice reference audio (5-15 seconds, clean speech)
2. Accurate transcript of reference audio
3. Format: WAV 24kHz mono, 16-bit PCM

---

## 10. Success Criteria

### 10.1 Phase 1 Completion

- [ ] candle-based Qwen3 model loads safetensors weights
- [ ] Forward pass produces valid logits
- [ ] Speech tokenizer encodes/decodes audio
- [ ] Code predictor generates 16 codebook layers
- [ ] Unit tests pass with > 80% coverage

### 10.2 Phase 2 Completion

- [ ] `/api/v1/speak` endpoint produces audio from text
- [ ] No Python service required
- [ ] Voice cloning works with reference audio
- [ ] Error handling returns appropriate codes
- [ ] speak.rs calls TTS engine directly (no proxy)

### 10.3 Phase 3 Completion

- [ ] Streaming generation with < 200ms TTFA
- [ ] GPU memory usage < 6GB
- [ ] Integration tests pass
- [ ] Listen page bidirectional speech works
- [ ] Latency benchmarks documented

### 10.4 Final Acceptance Criteria

1. **Functional:** End-to-end TTS streaming via WebSocket, pure Rust, no Python
2. **Performance:** TTFA < 200ms, subsequent chunks < 100ms
3. **Quality:** Synthesized speech is intelligible and recognizable as Makima
4. **Reliability:** Error handling is robust; graceful degradation on GPU failure
5. **Architecture:** Clean `tts/` module with trait-based engine selection

---

## Appendix A: Quick Start Commands

### Development

```bash
# Build with candle GPU support
cd makima
cargo build --features cuda   # or --features metal for macOS

# Run server with TTS enabled
TTS_ENGINE=qwen3 TTS_DEVICE=cuda:0 cargo run

# Run TTS-specific tests
cargo test tts
```

### Benchmarks

```bash
# Run latency benchmarks (requires GPU)
cargo bench --bench tts_latency
```

## Appendix B: Reference Implementations

- [candle-transformers qwen2 model](https://docs.rs/candle-transformers/latest/candle_transformers/models/qwen2/index.html) — base attention/MLP layers
- [qwen3-rs](https://github.com/reinterpretcat/qwen3-rs) — educational Qwen3 in Rust
- [Crane](https://github.com/lucasjinreal/Crane) — Rust TTS engine (Qwen3-TTS on roadmap)
- [docs/research/rust-native-tts-research.md](../research/rust-native-tts-research.md) — full feasibility analysis