# TTS Research Document: Qwen3-TTS Integration for Makima
## Executive Summary
This document summarizes research on replacing the existing ChatterboxTTS implementation with Qwen3-TTS for live/streaming TTS with Makima's Japanese voice speaking English.
---
## 1. Current TTS Implementation Analysis
### 1.1 Architecture Overview
The current makima codebase uses **ChatterboxTTS** (ResembleAI/chatterbox-turbo-ONNX) with the following components:
| Component | File | Purpose |
|-----------|------|---------|
| TTS Module | `makima/src/tts.rs` | Core TTS inference using ONNX Runtime |
| Audio Processing | `makima/src/audio.rs` | Audio decoding, resampling (Symphonia) |
| Library Export | `makima/src/lib.rs` | Exposes `pub mod tts` |
### 1.2 ChatterboxTTS Technical Details
```rust
// Key constants from tts.rs
pub const SAMPLE_RATE: u32 = 24_000;
const MODEL_ID: &str = "ResembleAI/chatterbox-turbo-ONNX";
const DEFAULT_MODEL_DIR: &str = "models/chatterbox-turbo";
```
**ONNX Model Files:**
- `speech_encoder.onnx` - Encodes reference voice audio
- `embed_tokens.onnx` - Text token embedding
- `language_model.onnx` - Autoregressive token generation (24 layers, 16 KV heads)
- `conditional_decoder.onnx` - Decodes speech tokens to waveform
- `tokenizer.json` - Text tokenization
### 1.3 Current API Surface
```rust
pub struct ChatterboxTTS {
pub fn from_pretrained(model_dir: Option<&str>) -> Result<Self, TtsError>
pub fn generate_tts(&mut self, text: &str) -> Result<Vec<f32>, TtsError> // Returns VoiceRequired error
pub fn generate_tts_with_voice(text: &str, sample_audio_path: &Path) -> Result<Vec<f32>, TtsError>
pub fn generate_tts_with_samples(text: &str, samples: &[f32], sample_rate: u32) -> Result<Vec<f32>, TtsError>
}
pub fn save_wav(samples: &[f32], path: &Path) -> Result<(), TtsError>
```
### 1.4 Voice Cloning Capabilities (Current)
- **Requires** voice reference audio (returns `VoiceRequired` error without it)
- Accepts reference audio via file path or raw samples
- Resamples reference to 24kHz internally
- Uses speaker embeddings + speaker features for voice cloning
### 1.5 Streaming/Live TTS (Current)
**NOT SUPPORTED** - The current implementation:
- Generates entire audio in one pass
- Uses autoregressive token generation with max 1024 tokens
- No chunked/streaming output capability
- Full pipeline must complete before audio is available
### 1.6 Server Integration Status
The TTS module is **not currently exposed via HTTP endpoints**. The server (`makima/src/server/mod.rs`) has:
- `/api/v1/listen` - WebSocket for Speech-to-Text (STT) only
- No TTS endpoints exist
---
## 2. Qwen3-TTS Model Analysis
### 2.1 Model Specifications
| Attribute | Value |
|-----------|-------|
| **Model** | Qwen/Qwen3-TTS-12Hz-0.6B-Base |
| **Parameters** | 0.6B (also available: 1.7B version) |
| **Architecture** | Discrete multi-codebook LM (16 codebooks, 2048 size) |
| **Tokenizer** | Qwen3-TTS-Tokenizer-12Hz |
| **Sample Rate** | 12 Hz tokenizer (reconstructs to standard rates) |
| **License** | Apache 2.0 |
### 2.2 Key Capabilities
#### Voice Cloning
- **3-second rapid voice clone** - Minimal reference audio needed
- **Flexible input formats**: local files, URLs, base64, (numpy_array, sample_rate) tuples
- **Reusable voice prompts**: Create once, use for multiple generations
```python
# Voice cloning example
model = Qwen3TTSModel.from_pretrained("Qwen/Qwen3-TTS-12Hz-0.6B-Base")
wavs, sr = model.generate_voice_clone(
text="Target text",
language="English",
ref_audio="reference.wav",
ref_text="Reference transcript"
)
```
#### Streaming/Live TTS
- **97ms end-to-end latency** - Ultra-low latency streaming
- **Dual-track hybrid streaming architecture** - Supports both streaming and non-streaming
- **First packet after single character** - Immediate response capability
#### Multilingual Support
10 languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
#### Quality Metrics
| Metric | Value |
|--------|-------|
| WER (English) | 1.32 |
| Speaker Similarity (English) | 0.829 |
| PESQ_WB | 3.21 |
| STOI | 0.96 |
| UTMOS | 4.16 |
### 2.3 Requirements
```bash
# Environment
Python 3.12 (recommended)
CUDA-compatible GPU with 8GB+ VRAM
# Installation
pip install -U qwen-tts
pip install -U flash-attn --no-build-isolation # Optional, reduces GPU memory
```
### 2.4 Current Deployment Options
| Option | Status | Notes |
|--------|--------|-------|
| **Python (qwen-tts)** | Stable | Official package |
| **vLLM-Omni** | Offline only | Online serving coming |
| **ONNX Export** | Not available | No official support |
| **Rust Implementation** | Draft PR #8 | Early development |
| **DashScope API** | Available | Alibaba Cloud hosted |
---
## 3. Makima Voice Audio Clips
### 3.1 Voice Actress Information
| Attribute | Value |
|-----------|-------|
| **Character** | Makima (Chainsaw Man) |
| **Japanese VA** | Tomori Kusunoki (楠木ともり) |
| **English VA** | Suzie Yeung |
| **Agency** | Sony Music Artists |
### 3.2 Audio Clip Sources
1. **Chainsaw Man Anime Episodes** - Primary source for Japanese voice
2. **Behind The Voice Actors** - Character voice samples
3. **YouTube Clips** - Interview compilations, scene clips
4. **Official Media** - Promotional videos, trailers
### 3.3 Audio Requirements for Voice Cloning
#### Qwen3-TTS Requirements
| Parameter | Requirement |
|-----------|-------------|
| **Minimum Duration** | 3 seconds (basic quality) |
| **Recommended Duration** | 10-30 seconds (professional quality) |
| **Format** | WAV, FLAC, MP3, OGG, AIFF, AAC |
| **Sample Rate** | 24kHz or above recommended |
| **Channels** | Mono preferred |
#### Best Practices
- **Clean audio**: No background music/noise
- **Single speaker**: Makima's voice only
- **Consistent tone**: Avoid dramatic variations
- **Include transcript**: Reference text improves quality
- **Varied content**: Mix of sentence types for flexibility
#### Recommended Clip Types
1. Calm, composed dialogue (Makima's signature tone)
2. Commands/instructions (authoritative delivery)
3. Questions (natural intonation)
4. Longer monologues (for voice consistency)
---
## 4. Feasibility Assessment for Live/Streaming TTS
### 4.1 Technical Challenges
| Challenge | Severity | Notes |
|-----------|----------|-------|
| No ONNX export | **High** | Current codebase uses ONNX Runtime |
| Rust implementation | **High** | Only draft PR available |
| Python dependency | Medium | Would require sidecar service |
| GPU memory | Medium | 8GB+ VRAM required |
| Streaming API | Low | Supported in Qwen3-TTS |
### 4.2 Integration Approaches
#### Option A: Python Sidecar Service (Recommended)
**Architecture**: Rust server + Python TTS service via HTTP/gRPC
**Pros:**
- Uses official Qwen3-TTS Python package
- Full streaming support (97ms latency)
- Simpler maintenance
**Cons:**
- Additional deployment complexity
- Inter-process communication overhead
```
┌─────────────────┐ HTTP/gRPC ┌─────────────────┐
│ Makima Server │ ◄──────────────► │ Qwen3-TTS │
│ (Rust/Axum) │ │ (Python/FastAPI)│
└─────────────────┘ └─────────────────┘
```
**Available Implementations:**
- [ValyrianTech/Qwen3-TTS_server](https://github.com/ValyrianTech/Qwen3-TTS_server) - FastAPI server
- [Qwen3-TTS-Openai-Fastapi](https://github.com/twolven/Qwen3-TTS-Openai-Fastapi) - OpenAI-compatible API
#### Option B: Wait for Rust Implementation
**Status**: Draft PR #8 in early development
**Pros:**
- Native Rust integration
- No Python dependency
- Matches current architecture
**Cons:**
- Unknown timeline
- May require significant adaptation
#### Option C: Hybrid (ChatterboxTTS + Qwen3-TTS)
Keep ChatterboxTTS for ONNX compatibility, add Qwen3-TTS for streaming
**Pros:**
- Gradual migration
- Fallback capability
**Cons:**
- Dual model maintenance
- Increased complexity
### 4.3 Recommendation
**Short-term (1-2 weeks)**: Implement **Option A** with Python sidecar
- Deploy ValyrianTech/Qwen3-TTS_server or similar
- Add HTTP client in Rust to call TTS service
- Implement WebSocket endpoint for streaming audio
**Long-term (3-6 months)**: Monitor Rust implementation progress
- Evaluate draft PR #8 stability
- Consider contributing to Rust port
- Migrate to native Rust when mature
---
## 5. Preliminary Technical Approach
### 5.1 Phase 1: Voice Preparation
1. **Collect Makima Audio Clips**
- Extract 3-5 clean clips from anime (10-30 seconds each)
- Ensure Japanese voice, clear audio, no BGM
- Prepare transcripts for each clip
2. **Test Voice Cloning Quality**
- Use Qwen3-TTS demo to validate clips
- Iterate on clip selection for best results
### 5.2 Phase 2: TTS Service Setup
1. **Deploy Qwen3-TTS Server**
```bash
# Using ValyrianTech server
docker run --gpus all -p 7860:7860 qwen3-tts-server
```
2. **Configure Voice Clone Profile**
- Upload Makima reference audio
- Store voice clone prompt for reuse
### 5.3 Phase 3: Makima Integration
1. **Add TTS Client Module**
```rust
// New module: makima/src/tts_client.rs
pub struct QwenTTSClient {
base_url: String,
voice_profile: String,
}
impl QwenTTSClient {
pub async fn generate_speech(&self, text: &str) -> Result<Vec<u8>, Error>
pub async fn generate_speech_streaming(&self, text: &str) -> impl Stream<Item = Vec<u8>>
}
```
2. **Add TTS Endpoint**
```rust
// In makima/src/server/mod.rs
.route("/api/v1/tts", post(tts_handler))
.route("/api/v1/tts/stream", get(tts_streaming_handler))
```
3. **WebSocket Integration for Listen Page**
- Bidirectional audio: STT input, TTS output
- Low-latency streaming for conversational flow
### 5.4 Phase 4: Listen Page Integration
1. **Update Frontend**
- Add TTS playback capability
- Handle streaming audio chunks
- UI for voice response indicators
2. **Orchestration Logic**
- STT → LLM → TTS pipeline
- Interrupt handling for user speech
---
## 6. Open Questions
1. **Voice Rights**: Are there legal considerations for cloning Tomori Kusunoki's voice?
2. **GPU Allocation**: Shared GPU for STT + TTS, or separate?
3. **Latency Budget**: What's acceptable end-to-end latency for Listen page?
4. **Fallback Strategy**: What happens if TTS service is unavailable?
5. **Multi-user**: How to handle concurrent TTS requests?
---
## 7. References
- [Qwen3-TTS HuggingFace](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base)
- [Qwen3-TTS GitHub](https://github.com/QwenLM/Qwen3-TTS)
- [Qwen3-TTS Technical Report](https://arxiv.org/abs/2601.15621)
- [ValyrianTech Qwen3-TTS Server](https://github.com/ValyrianTech/Qwen3-TTS_server)
- [Qwen3-TTS OpenAI-Compatible FastAPI](https://github.com/twolven/Qwen3-TTS-Openai-Fastapi)
- [Makima Voice Actors](https://www.behindthevoiceactors.com/tv-shows/Chainsaw-Man/Makima/)
- [ChatterboxTTS Audio Guidelines](https://github.com/resemble-ai/chatterbox/issues/39)
- [Voice Cloning Best Practices - Resemble AI](https://www.resemble.ai/script-to-read-for-voice-cloning-guidelines/)
- [Qwen3-rs (Rust LLM implementation)](https://github.com/reinterpretcat/qwen3-rs)
---
*Document created: Research phase for Makima TTS replacement contract*