TTS_RESEARCH.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351

# TTS Research Document: Qwen3-TTS Integration for Makima

## Executive Summary

This document summarizes research on replacing the existing ChatterboxTTS implementation with Qwen3-TTS for live/streaming TTS with Makima's Japanese voice speaking English.

---

## 1. Current TTS Implementation Analysis

### 1.1 Architecture Overview

The current makima codebase uses **ChatterboxTTS** (ResembleAI/chatterbox-turbo-ONNX) with the following components:

| Component | File | Purpose |
|-----------|------|---------|
| TTS Module | `makima/src/tts.rs` | Core TTS inference using ONNX Runtime |
| Audio Processing | `makima/src/audio.rs` | Audio decoding, resampling (Symphonia) |
| Library Export | `makima/src/lib.rs` | Exposes `pub mod tts` |

### 1.2 ChatterboxTTS Technical Details

```rust
// Key constants from tts.rs
pub const SAMPLE_RATE: u32 = 24_000;
const MODEL_ID: &str = "ResembleAI/chatterbox-turbo-ONNX";
const DEFAULT_MODEL_DIR: &str = "models/chatterbox-turbo";
```

**ONNX Model Files:**
- `speech_encoder.onnx` - Encodes reference voice audio
- `embed_tokens.onnx` - Text token embedding
- `language_model.onnx` - Autoregressive token generation (24 layers, 16 KV heads)
- `conditional_decoder.onnx` - Decodes speech tokens to waveform
- `tokenizer.json` - Text tokenization

### 1.3 Current API Surface

```rust
pub struct ChatterboxTTS {
    pub fn from_pretrained(model_dir: Option<&str>) -> Result<Self, TtsError>
    pub fn generate_tts(&mut self, text: &str) -> Result<Vec<f32>, TtsError>  // Returns VoiceRequired error
    pub fn generate_tts_with_voice(text: &str, sample_audio_path: &Path) -> Result<Vec<f32>, TtsError>
    pub fn generate_tts_with_samples(text: &str, samples: &[f32], sample_rate: u32) -> Result<Vec<f32>, TtsError>
}
pub fn save_wav(samples: &[f32], path: &Path) -> Result<(), TtsError>
```

### 1.4 Voice Cloning Capabilities (Current)

- **Requires** voice reference audio (returns `VoiceRequired` error without it)
- Accepts reference audio via file path or raw samples
- Resamples reference to 24kHz internally
- Uses speaker embeddings + speaker features for voice cloning

### 1.5 Streaming/Live TTS (Current)

**NOT SUPPORTED** - The current implementation:
- Generates entire audio in one pass
- Uses autoregressive token generation with max 1024 tokens
- No chunked/streaming output capability
- Full pipeline must complete before audio is available

### 1.6 Server Integration Status

The TTS module is **not currently exposed via HTTP endpoints**. The server (`makima/src/server/mod.rs`) has:
- `/api/v1/listen` - WebSocket for Speech-to-Text (STT) only
- No TTS endpoints exist

---

## 2. Qwen3-TTS Model Analysis

### 2.1 Model Specifications

| Attribute | Value |
|-----------|-------|
| **Model** | Qwen/Qwen3-TTS-12Hz-0.6B-Base |
| **Parameters** | 0.6B (also available: 1.7B version) |
| **Architecture** | Discrete multi-codebook LM (16 codebooks, 2048 size) |
| **Tokenizer** | Qwen3-TTS-Tokenizer-12Hz |
| **Sample Rate** | 12 Hz tokenizer (reconstructs to standard rates) |
| **License** | Apache 2.0 |

### 2.2 Key Capabilities

#### Voice Cloning
- **3-second rapid voice clone** - Minimal reference audio needed
- **Flexible input formats**: local files, URLs, base64, (numpy_array, sample_rate) tuples
- **Reusable voice prompts**: Create once, use for multiple generations

```python
# Voice cloning example
model = Qwen3TTSModel.from_pretrained("Qwen/Qwen3-TTS-12Hz-0.6B-Base")
wavs, sr = model.generate_voice_clone(
    text="Target text",
    language="English",
    ref_audio="reference.wav",
    ref_text="Reference transcript"
)
```

#### Streaming/Live TTS
- **97ms end-to-end latency** - Ultra-low latency streaming
- **Dual-track hybrid streaming architecture** - Supports both streaming and non-streaming
- **First packet after single character** - Immediate response capability

#### Multilingual Support
10 languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian

#### Quality Metrics
| Metric | Value |
|--------|-------|
| WER (English) | 1.32 |
| Speaker Similarity (English) | 0.829 |
| PESQ_WB | 3.21 |
| STOI | 0.96 |
| UTMOS | 4.16 |

### 2.3 Requirements

```bash
# Environment
Python 3.12 (recommended)
CUDA-compatible GPU with 8GB+ VRAM

# Installation
pip install -U qwen-tts
pip install -U flash-attn --no-build-isolation  # Optional, reduces GPU memory
```

### 2.4 Current Deployment Options

| Option | Status | Notes |
|--------|--------|-------|
| **Python (qwen-tts)** | Stable | Official package |
| **vLLM-Omni** | Offline only | Online serving coming |
| **ONNX Export** | Not available | No official support |
| **Rust Implementation** | Draft PR #8 | Early development |
| **DashScope API** | Available | Alibaba Cloud hosted |

---

## 3. Makima Voice Audio Clips

### 3.1 Voice Actress Information

| Attribute | Value |
|-----------|-------|
| **Character** | Makima (Chainsaw Man) |
| **Japanese VA** | Tomori Kusunoki (楠木ともり) |
| **English VA** | Suzie Yeung |
| **Agency** | Sony Music Artists |

### 3.2 Audio Clip Sources

1. **Chainsaw Man Anime Episodes** - Primary source for Japanese voice
2. **Behind The Voice Actors** - Character voice samples
3. **YouTube Clips** - Interview compilations, scene clips
4. **Official Media** - Promotional videos, trailers

### 3.3 Audio Requirements for Voice Cloning

#### Qwen3-TTS Requirements
| Parameter | Requirement |
|-----------|-------------|
| **Minimum Duration** | 3 seconds (basic quality) |
| **Recommended Duration** | 10-30 seconds (professional quality) |
| **Format** | WAV, FLAC, MP3, OGG, AIFF, AAC |
| **Sample Rate** | 24kHz or above recommended |
| **Channels** | Mono preferred |

#### Best Practices
- **Clean audio**: No background music/noise
- **Single speaker**: Makima's voice only
- **Consistent tone**: Avoid dramatic variations
- **Include transcript**: Reference text improves quality
- **Varied content**: Mix of sentence types for flexibility

#### Recommended Clip Types
1. Calm, composed dialogue (Makima's signature tone)
2. Commands/instructions (authoritative delivery)
3. Questions (natural intonation)
4. Longer monologues (for voice consistency)

---

## 4. Feasibility Assessment for Live/Streaming TTS

### 4.1 Technical Challenges

| Challenge | Severity | Notes |
|-----------|----------|-------|
| No ONNX export | **High** | Current codebase uses ONNX Runtime |
| Rust implementation | **High** | Only draft PR available |
| Python dependency | Medium | Would require sidecar service |
| GPU memory | Medium | 8GB+ VRAM required |
| Streaming API | Low | Supported in Qwen3-TTS |

### 4.2 Integration Approaches

#### Option A: Python Sidecar Service (Recommended)
**Architecture**: Rust server + Python TTS service via HTTP/gRPC

**Pros:**
- Uses official Qwen3-TTS Python package
- Full streaming support (97ms latency)
- Simpler maintenance

**Cons:**
- Additional deployment complexity
- Inter-process communication overhead

```
┌─────────────────┐     HTTP/gRPC     ┌─────────────────┐
│  Makima Server  │ ◄──────────────► │  Qwen3-TTS      │
│  (Rust/Axum)    │                   │  (Python/FastAPI)│
└─────────────────┘                   └─────────────────┘
```

**Available Implementations:**
- [ValyrianTech/Qwen3-TTS_server](https://github.com/ValyrianTech/Qwen3-TTS_server) - FastAPI server
- [Qwen3-TTS-Openai-Fastapi](https://github.com/twolven/Qwen3-TTS-Openai-Fastapi) - OpenAI-compatible API

#### Option B: Wait for Rust Implementation
**Status**: Draft PR #8 in early development

**Pros:**
- Native Rust integration
- No Python dependency
- Matches current architecture

**Cons:**
- Unknown timeline
- May require significant adaptation

#### Option C: Hybrid (ChatterboxTTS + Qwen3-TTS)
Keep ChatterboxTTS for ONNX compatibility, add Qwen3-TTS for streaming

**Pros:**
- Gradual migration
- Fallback capability

**Cons:**
- Dual model maintenance
- Increased complexity

### 4.3 Recommendation

**Short-term (1-2 weeks)**: Implement **Option A** with Python sidecar
- Deploy ValyrianTech/Qwen3-TTS_server or similar
- Add HTTP client in Rust to call TTS service
- Implement WebSocket endpoint for streaming audio

**Long-term (3-6 months)**: Monitor Rust implementation progress
- Evaluate draft PR #8 stability
- Consider contributing to Rust port
- Migrate to native Rust when mature

---

## 5. Preliminary Technical Approach

### 5.1 Phase 1: Voice Preparation

1. **Collect Makima Audio Clips**
   - Extract 3-5 clean clips from anime (10-30 seconds each)
   - Ensure Japanese voice, clear audio, no BGM
   - Prepare transcripts for each clip

2. **Test Voice Cloning Quality**
   - Use Qwen3-TTS demo to validate clips
   - Iterate on clip selection for best results

### 5.2 Phase 2: TTS Service Setup

1. **Deploy Qwen3-TTS Server**
   ```bash
   # Using ValyrianTech server
   docker run --gpus all -p 7860:7860 qwen3-tts-server
   ```

2. **Configure Voice Clone Profile**
   - Upload Makima reference audio
   - Store voice clone prompt for reuse

### 5.3 Phase 3: Makima Integration

1. **Add TTS Client Module**
   ```rust
   // New module: makima/src/tts_client.rs
   pub struct QwenTTSClient {
       base_url: String,
       voice_profile: String,
   }

   impl QwenTTSClient {
       pub async fn generate_speech(&self, text: &str) -> Result<Vec<u8>, Error>
       pub async fn generate_speech_streaming(&self, text: &str) -> impl Stream<Item = Vec<u8>>
   }
   ```

2. **Add TTS Endpoint**
   ```rust
   // In makima/src/server/mod.rs
   .route("/api/v1/tts", post(tts_handler))
   .route("/api/v1/tts/stream", get(tts_streaming_handler))
   ```

3. **WebSocket Integration for Listen Page**
   - Bidirectional audio: STT input, TTS output
   - Low-latency streaming for conversational flow

### 5.4 Phase 4: Listen Page Integration

1. **Update Frontend**
   - Add TTS playback capability
   - Handle streaming audio chunks
   - UI for voice response indicators

2. **Orchestration Logic**
   - STT → LLM → TTS pipeline
   - Interrupt handling for user speech

---

## 6. Open Questions

1. **Voice Rights**: Are there legal considerations for cloning Tomori Kusunoki's voice?
2. **GPU Allocation**: Shared GPU for STT + TTS, or separate?
3. **Latency Budget**: What's acceptable end-to-end latency for Listen page?
4. **Fallback Strategy**: What happens if TTS service is unavailable?
5. **Multi-user**: How to handle concurrent TTS requests?

---

## 7. References

- [Qwen3-TTS HuggingFace](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base)
- [Qwen3-TTS GitHub](https://github.com/QwenLM/Qwen3-TTS)
- [Qwen3-TTS Technical Report](https://arxiv.org/abs/2601.15621)
- [ValyrianTech Qwen3-TTS Server](https://github.com/ValyrianTech/Qwen3-TTS_server)
- [Qwen3-TTS OpenAI-Compatible FastAPI](https://github.com/twolven/Qwen3-TTS-Openai-Fastapi)
- [Makima Voice Actors](https://www.behindthevoiceactors.com/tv-shows/Chainsaw-Man/Makima/)
- [ChatterboxTTS Audio Guidelines](https://github.com/resemble-ai/chatterbox/issues/39)
- [Voice Cloning Best Practices - Resemble AI](https://www.resemble.ai/script-to-read-for-voice-cloning-guidelines/)
- [Qwen3-rs (Rust LLM implementation)](https://github.com/reinterpretcat/qwen3-rs)

---

*Document created: Research phase for Makima TTS replacement contract*