docs/research/TTS_RESEARCH_NOTES.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405

# TTS Replacement Research Notes

## Executive Summary

This document summarizes research on replacing the existing TTS endpoint in makima with Qwen3-TTS-12Hz-0.6B-Base, with the goal of supporting voice cloning using Makima's Japanese voice speaking English, and achieving near-live/streaming TTS capabilities.

---

## 1. Current TTS Implementation Analysis

### 1.1 Current Model: Chatterbox-Turbo

The existing TTS implementation in `makima/src/tts.rs` uses **ResembleAI/chatterbox-turbo-ONNX**:

- **Architecture**: 4-component ONNX model pipeline
  - `speech_encoder.onnx` - Encodes reference voice audio
  - `embed_tokens.onnx` - Token embedding layer
  - `language_model.onnx` - Autoregressive language model (24 layers, 16 KV heads, 64 head dim)
  - `conditional_decoder.onnx` - Decodes speech tokens to audio waveform

- **Sample Rate**: 24,000 Hz output
- **Voice Cloning**: Required (no default voice support)
- **Special Tokens**:
  - START_SPEECH_TOKEN: 6561
  - STOP_SPEECH_TOKEN: 6562
  - SILENCE_TOKEN: 4299

### 1.2 Current API Surface

**Core TTS Functions:**
```rust
pub fn generate_tts(&mut self, _text: &str) -> Result<Vec<f32>, TtsError>
    // Returns VoiceRequired error - voice cloning is mandatory

pub fn generate_tts_with_voice(&mut self, text: &str, sample_audio_path: &Path) -> Result<Vec<f32>, TtsError>
    // Voice cloning from file path

pub fn generate_tts_with_samples(&mut self, text: &str, samples: &[f32], sample_rate: u32) -> Result<Vec<f32>, TtsError>
    // Voice cloning from raw samples
```

**Audio Processing:**
- Input audio resampled to 24kHz
- Reference voice encoded into:
  - `audio_features` - Acoustic features
  - `prompt_tokens` - Token representation
  - `speaker_embeddings` - Speaker identity
  - `speaker_features` - Voice characteristics

### 1.3 Current Limitations

1. **No Streaming Support**: Current implementation generates complete audio before returning
2. **No Default Voice**: Requires voice reference audio for every call
3. **No HTTP Endpoint**: TTS is only available as a Rust library, not exposed via REST API
4. **Single Language**: Optimized for English, unclear multilingual support
5. **High Latency**: Full autoregressive generation before any audio output

### 1.4 Related Components

**Audio Processing (`makima/src/audio.rs`):**
- Uses Symphonia for audio decoding (MP3, WAV, FLAC, OGG, etc.)
- Resampling via sinc interpolation
- Stereo to mono mixdown
- Target: 16kHz mono for STT

**Listen Endpoint (`makima/src/server/handlers/listen.rs`):**
- WebSocket-based streaming STT
- Uses Parakeet for transcription
- Sortformer for speaker diarization
- Already has real-time audio streaming infrastructure

---

## 2. Qwen3-TTS-12Hz-0.6B-Base Model Analysis

### 2.1 Model Capabilities

| Feature | Specification |
|---------|---------------|
| **Model Size** | 0.6B parameters (lightweight variant) |
| **Voice Cloning** | 3-second reference audio only |
| **Streaming** | Dual-track hybrid architecture |
| **Minimum Latency** | 97ms end-to-end |
| **Languages** | 10 (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian) |
| **Cross-lingual Cloning** | Japanese voice to English speech supported |
| **Speaker Similarity** | 0.95 (near human-level) |
| **Output Sample Rate** | Up to 48kHz (standard 24kHz) |

### 2.2 Voice Cloning Requirements

**Reference Audio:**
- **Minimum Duration**: 3 seconds
- **Recommended Duration**: 5-15 seconds
- **Format**: WAV preferred; also supports URL, base64, numpy array
- **Quality**: Clean, noise-free audio essential
- **Transcript**: Providing `ref_text` significantly improves quality

**Cross-Lingual Usage (Japanese to English):**
```python
ref_audio = "makima_japanese.wav"  # Japanese reference
ref_text = "日本語のテキスト"      # Japanese transcription

wavs, sr = model.generate_voice_clone(
    text="This is English text",   # English output
    language="English",
    ref_audio=ref_audio,
    ref_text=ref_text,
)
```

### 2.3 Technical Requirements

**Python Dependencies:**
```bash
pip install -U qwen-tts
pip install -U flash-attn --no-build-isolation  # For optimal performance
```

**Hardware:**
- CUDA-compatible GPU required
- FlashAttention 2 for optimal memory usage
- Float16/bfloat16 precision support
- For <96GB RAM: `MAX_JOBS=4` for flash-attn installation

**Model Loading:**
```python
from qwen_tts import Qwen3TTSModel
import torch

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-0.6B-Base",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)
```

### 2.4 Streaming Architecture

**Dual-Track Hybrid Design:**
- Single model supports both streaming and non-streaming
- Audio output begins after minimal text input
- 97ms minimum latency achieved through:
  - Proprietary Qwen3-TTS-Tokenizer-12Hz (efficient acoustic compression)
  - Discrete multi-codebook LM (eliminates LM+DiT bottleneck)
  - Lightweight non-DiT vocoder

**Reusable Voice Clone Prompt (Critical for Performance):**
```python
# Pre-compute once
prompt_items = model.create_voice_clone_prompt(
    ref_audio=ref_audio,
    ref_text=ref_text,
    x_vector_only_mode=False
)

# Reuse for multiple generations
wavs, sr = model.generate_voice_clone(
    text=["Line 1", "Line 2"],
    language=["English", "English"],
    voice_clone_prompt=prompt_items,  # Cached prompt
)
```

---

## 3. Makima Voice Audio Sources

### 3.1 Character Information

- **Character**: Makima from Chainsaw Man anime
- **Japanese Voice Actress**: Tomori Kusunoki (楠木ともり)
- **English Voice Actress**: Suzie Yeung

### 3.2 Potential Audio Sources

| Source | Type | Notes |
|--------|------|-------|
| **Voicy Network Soundboard** | Official clips | MP3 download available, 20+ sound effects |
| **101Soundboards** | Fan-curated clips | Various character sounds |
| **Anime Episodes** | Source material | Highest quality, requires extraction |
| **Nikke: Goddess of Victory** | Game voicelines | Same voice actress (Tomori Kusunoki) |
| **Ko-fi (erusha)** | WAV files | x5 character impression audio files |

### 3.3 Recommended Approach

1. **Primary Source**: Extract 5-15 seconds of clean dialogue from Chainsaw Man anime (Japanese audio track)
2. **Selection Criteria**:
   - Clear, isolated dialogue (no background music/effects)
   - Natural speaking tone (not shouting/whispering)
   - Variety of phonemes for better cloning
3. **Transcription**: Provide accurate Japanese transcription for `ref_text`
4. **Processing**: Convert to WAV format, ensure clean audio quality

### 3.4 Legal Considerations

- Voice cloning of real voice actors for commercial use may have legal implications
- Synthetic voice generation based on copyrighted characters may require licenses
- Consider using for internal/personal use only, or creating disclaimer

---

## 4. Feasibility Assessment

### 4.1 Live/Streaming TTS Feasibility: HIGHLY FEASIBLE

**Evidence:**
- Qwen3-TTS achieves 97ms latency (well under 200ms real-time threshold)
- Existing WebSocket infrastructure in makima (`/api/v1/listen`) can be adapted
- Streaming architecture designed for interactive scenarios

**Implementation Approach:**
1. Create new WebSocket endpoint `/api/v1/speak` mirroring listen endpoint
2. Pre-compute voice clone prompt on connection start
3. Stream audio chunks as they're generated
4. Use chunked audio encoding (similar to listen's binary message handling)

### 4.2 Voice Cloning with Japanese Voice: FULLY SUPPORTED

**Evidence:**
- Qwen3-TTS explicitly supports cross-lingual voice cloning
- Japanese is one of 10 supported languages
- 0.95 speaker similarity maintained across languages

**Implementation Approach:**
1. Pre-process Makima voice clips (5-15 seconds Japanese audio)
2. Include Japanese transcription
3. Generate English speech while preserving voice characteristics

### 4.3 Integration Challenges

| Challenge | Difficulty | Mitigation |
|-----------|-----------|------------|
| **Python to Rust Integration** | Medium | Use Python subprocess or microservice |
| **GPU Memory** | Low | 0.6B model is lightweight |
| **Latency Target** | Low | 97ms base latency is excellent |
| **Audio Format Conversion** | Low | Existing symphonia infrastructure |
| **Default Voice Setup** | Low | One-time voice prompt caching |

### 4.4 Architecture Options

**Option A: Python Microservice**
```
[Makima Rust Server] --HTTP/WebSocket--> [Python TTS Service]
                                                |
                                         [Qwen3-TTS Model]
```
Pros: Clean separation, easy Python integration
Cons: Network overhead, deployment complexity

**Option B: PyO3 Rust Bindings**
```
[Makima Rust Server] --FFI--> [pyo3 Python Bindings] --> [Qwen3-TTS]
```
Pros: Single process, lower latency
Cons: Complex build, Python GIL issues

**Option C: ONNX Export (Like Current Chatterbox)**
```
[Makima Rust Server] --ort--> [Qwen3-TTS ONNX Models]
```
Pros: Pure Rust, consistent with existing architecture
Cons: May not have ONNX export available for Qwen3-TTS

**Recommended: Option A (Python Microservice)**
- Fastest time to implementation
- Aligns with Qwen3-TTS's native Python API
- Can use WebSocket for streaming audio chunks
- Easy to deploy alongside existing makima server

---

## 5. Preliminary Technical Approach

### 5.1 Phase 1: Python TTS Microservice

```python
# tts_service.py
from fastapi import FastAPI, WebSocket
from qwen_tts import Qwen3TTSModel
import torch
import base64

app = FastAPI()
model = None
voice_prompt = None

@app.on_event("startup")
async def load_model():
    global model, voice_prompt
    model = Qwen3TTSModel.from_pretrained(
        "Qwen/Qwen3-TTS-12Hz-0.6B-Base",
        device_map="cuda:0",
        dtype=torch.bfloat16,
    )
    # Pre-load Makima voice
    voice_prompt = model.create_voice_clone_prompt(
        ref_audio="makima_voice.wav",
        ref_text="日本語の台詞...",
    )

@app.websocket("/ws/speak")
async def speak(websocket: WebSocket):
    await websocket.accept()
    while True:
        text = await websocket.receive_text()
        wavs, sr = model.generate_voice_clone(
            text=text,
            language="English",
            voice_clone_prompt=voice_prompt,
        )
        # Stream audio chunks
        audio_bytes = wavs[0].tobytes()
        await websocket.send_bytes(audio_bytes)
```

### 5.2 Phase 2: Rust Integration

```rust
// makima/src/server/handlers/speak.rs
pub async fn websocket_handler(
    ws: WebSocketUpgrade,
    State(state): State<SharedState>,
) -> Response {
    ws.on_upgrade(|socket| handle_speak_socket(socket, state))
}

async fn handle_speak_socket(socket: WebSocket, state: SharedState) {
    // Connect to Python TTS service
    let tts_ws = tokio_tungstenite::connect_async("ws://localhost:8001/ws/speak").await?;

    // Forward text to TTS, stream audio back to client
    // ...
}
```

### 5.3 API Design

**WebSocket Endpoint: `/api/v1/speak`**

**Client to Server Messages:**
```json
{
    "type": "start",
    "sample_rate": 24000,
    "encoding": "pcm16"
}

{
    "type": "speak",
    "text": "Hello, I am Makima."
}

{
    "type": "stop"
}
```

**Server to Client Messages:**
```json
{
    "type": "ready",
    "session_id": "uuid"
}

{
    "type": "audio_chunk",
    "data": "<base64-encoded-audio>",
    "is_final": false
}

{
    "type": "complete"
}
```

---

## 6. Next Steps

### Immediate Actions
1. [ ] Obtain Makima voice clips (5-15 seconds clean Japanese audio)
2. [ ] Create Japanese transcription of voice clips
3. [ ] Test Qwen3-TTS voice cloning with Makima samples
4. [ ] Benchmark latency on target hardware

### Development Phases
1. **Phase 1**: Python TTS microservice proof-of-concept
2. **Phase 2**: WebSocket streaming integration
3. **Phase 3**: Rust proxy endpoint in makima
4. **Phase 4**: Listen page integration for bidirectional speech

### Hardware Requirements
- CUDA-compatible GPU (minimum)
- Recommended: 8GB+ VRAM for 0.6B model with FlashAttention 2
- Python 3.12+ environment

---

## References

- [Qwen3-TTS-12Hz-0.6B-Base on HuggingFace](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base)
- [Qwen3-TTS GitHub Repository](https://github.com/QwenLM/Qwen3-TTS)
- [Behind The Voice Actors - Makima](https://www.behindthevoiceactors.com/tv-shows/Chainsaw-Man/Makima/)
- [Voicy Network Chainsaw Man Soundboard](https://www.voicy.network/official-soundboards/anime/chainsaw-man)