summaryrefslogblamecommitdiff
path: root/TTS_RESEARCH.md
blob: da7b8b8153a525d4a83856e5c633fd3b4cb73196 (plain) (tree)






























































































































































































































































































































































                                                                                                                                                                             
# TTS Research Document: Qwen3-TTS Integration for Makima

## Executive Summary

This document summarizes research on replacing the existing ChatterboxTTS implementation with Qwen3-TTS for live/streaming TTS with Makima's Japanese voice speaking English.

---

## 1. Current TTS Implementation Analysis

### 1.1 Architecture Overview

The current makima codebase uses **ChatterboxTTS** (ResembleAI/chatterbox-turbo-ONNX) with the following components:

| Component | File | Purpose |
|-----------|------|---------|
| TTS Module | `makima/src/tts.rs` | Core TTS inference using ONNX Runtime |
| Audio Processing | `makima/src/audio.rs` | Audio decoding, resampling (Symphonia) |
| Library Export | `makima/src/lib.rs` | Exposes `pub mod tts` |

### 1.2 ChatterboxTTS Technical Details

```rust
// Key constants from tts.rs
pub const SAMPLE_RATE: u32 = 24_000;
const MODEL_ID: &str = "ResembleAI/chatterbox-turbo-ONNX";
const DEFAULT_MODEL_DIR: &str = "models/chatterbox-turbo";
```

**ONNX Model Files:**
- `speech_encoder.onnx` - Encodes reference voice audio
- `embed_tokens.onnx` - Text token embedding
- `language_model.onnx` - Autoregressive token generation (24 layers, 16 KV heads)
- `conditional_decoder.onnx` - Decodes speech tokens to waveform
- `tokenizer.json` - Text tokenization

### 1.3 Current API Surface

```rust
pub struct ChatterboxTTS {
    pub fn from_pretrained(model_dir: Option<&str>) -> Result<Self, TtsError>
    pub fn generate_tts(&mut self, text: &str) -> Result<Vec<f32>, TtsError>  // Returns VoiceRequired error
    pub fn generate_tts_with_voice(text: &str, sample_audio_path: &Path) -> Result<Vec<f32>, TtsError>
    pub fn generate_tts_with_samples(text: &str, samples: &[f32], sample_rate: u32) -> Result<Vec<f32>, TtsError>
}
pub fn save_wav(samples: &[f32], path: &Path) -> Result<(), TtsError>
```

### 1.4 Voice Cloning Capabilities (Current)

- **Requires** voice reference audio (returns `VoiceRequired` error without it)
- Accepts reference audio via file path or raw samples
- Resamples reference to 24kHz internally
- Uses speaker embeddings + speaker features for voice cloning

### 1.5 Streaming/Live TTS (Current)

**NOT SUPPORTED** - The current implementation:
- Generates entire audio in one pass
- Uses autoregressive token generation with max 1024 tokens
- No chunked/streaming output capability
- Full pipeline must complete before audio is available

### 1.6 Server Integration Status

The TTS module is **not currently exposed via HTTP endpoints**. The server (`makima/src/server/mod.rs`) has:
- `/api/v1/listen` - WebSocket for Speech-to-Text (STT) only
- No TTS endpoints exist

---

## 2. Qwen3-TTS Model Analysis

### 2.1 Model Specifications

| Attribute | Value |
|-----------|-------|
| **Model** | Qwen/Qwen3-TTS-12Hz-0.6B-Base |
| **Parameters** | 0.6B (also available: 1.7B version) |
| **Architecture** | Discrete multi-codebook LM (16 codebooks, 2048 size) |
| **Tokenizer** | Qwen3-TTS-Tokenizer-12Hz |
| **Sample Rate** | 12 Hz tokenizer (reconstructs to standard rates) |
| **License** | Apache 2.0 |

### 2.2 Key Capabilities

#### Voice Cloning
- **3-second rapid voice clone** - Minimal reference audio needed
- **Flexible input formats**: local files, URLs, base64, (numpy_array, sample_rate) tuples
- **Reusable voice prompts**: Create once, use for multiple generations

```python
# Voice cloning example
model = Qwen3TTSModel.from_pretrained("Qwen/Qwen3-TTS-12Hz-0.6B-Base")
wavs, sr = model.generate_voice_clone(
    text="Target text",
    language="English",
    ref_audio="reference.wav",
    ref_text="Reference transcript"
)
```

#### Streaming/Live TTS
- **97ms end-to-end latency** - Ultra-low latency streaming
- **Dual-track hybrid streaming architecture** - Supports both streaming and non-streaming
- **First packet after single character** - Immediate response capability

#### Multilingual Support
10 languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian

#### Quality Metrics
| Metric | Value |
|--------|-------|
| WER (English) | 1.32 |
| Speaker Similarity (English) | 0.829 |
| PESQ_WB | 3.21 |
| STOI | 0.96 |
| UTMOS | 4.16 |

### 2.3 Requirements

```bash
# Environment
Python 3.12 (recommended)
CUDA-compatible GPU with 8GB+ VRAM

# Installation
pip install -U qwen-tts
pip install -U flash-attn --no-build-isolation  # Optional, reduces GPU memory
```

### 2.4 Current Deployment Options

| Option | Status | Notes |
|--------|--------|-------|
| **Python (qwen-tts)** | Stable | Official package |
| **vLLM-Omni** | Offline only | Online serving coming |
| **ONNX Export** | Not available | No official support |
| **Rust Implementation** | Draft PR #8 | Early development |
| **DashScope API** | Available | Alibaba Cloud hosted |

---

## 3. Makima Voice Audio Clips

### 3.1 Voice Actress Information

| Attribute | Value |
|-----------|-------|
| **Character** | Makima (Chainsaw Man) |
| **Japanese VA** | Tomori Kusunoki (楠木ともり) |
| **English VA** | Suzie Yeung |
| **Agency** | Sony Music Artists |

### 3.2 Audio Clip Sources

1. **Chainsaw Man Anime Episodes** - Primary source for Japanese voice
2. **Behind The Voice Actors** - Character voice samples
3. **YouTube Clips** - Interview compilations, scene clips
4. **Official Media** - Promotional videos, trailers

### 3.3 Audio Requirements for Voice Cloning

#### Qwen3-TTS Requirements
| Parameter | Requirement |
|-----------|-------------|
| **Minimum Duration** | 3 seconds (basic quality) |
| **Recommended Duration** | 10-30 seconds (professional quality) |
| **Format** | WAV, FLAC, MP3, OGG, AIFF, AAC |
| **Sample Rate** | 24kHz or above recommended |
| **Channels** | Mono preferred |

#### Best Practices
- **Clean audio**: No background music/noise
- **Single speaker**: Makima's voice only
- **Consistent tone**: Avoid dramatic variations
- **Include transcript**: Reference text improves quality
- **Varied content**: Mix of sentence types for flexibility

#### Recommended Clip Types
1. Calm, composed dialogue (Makima's signature tone)
2. Commands/instructions (authoritative delivery)
3. Questions (natural intonation)
4. Longer monologues (for voice consistency)

---

## 4. Feasibility Assessment for Live/Streaming TTS

### 4.1 Technical Challenges

| Challenge | Severity | Notes |
|-----------|----------|-------|
| No ONNX export | **High** | Current codebase uses ONNX Runtime |
| Rust implementation | **High** | Only draft PR available |
| Python dependency | Medium | Would require sidecar service |
| GPU memory | Medium | 8GB+ VRAM required |
| Streaming API | Low | Supported in Qwen3-TTS |

### 4.2 Integration Approaches

#### Option A: Python Sidecar Service (Recommended)
**Architecture**: Rust server + Python TTS service via HTTP/gRPC

**Pros:**
- Uses official Qwen3-TTS Python package
- Full streaming support (97ms latency)
- Simpler maintenance

**Cons:**
- Additional deployment complexity
- Inter-process communication overhead

```
┌─────────────────┐     HTTP/gRPC     ┌─────────────────┐
│  Makima Server  │ ◄──────────────► │  Qwen3-TTS      │
│  (Rust/Axum)    │                   │  (Python/FastAPI)│
└─────────────────┘                   └─────────────────┘
```

**Available Implementations:**
- [ValyrianTech/Qwen3-TTS_server](https://github.com/ValyrianTech/Qwen3-TTS_server) - FastAPI server
- [Qwen3-TTS-Openai-Fastapi](https://github.com/twolven/Qwen3-TTS-Openai-Fastapi) - OpenAI-compatible API

#### Option B: Wait for Rust Implementation
**Status**: Draft PR #8 in early development

**Pros:**
- Native Rust integration
- No Python dependency
- Matches current architecture

**Cons:**
- Unknown timeline
- May require significant adaptation

#### Option C: Hybrid (ChatterboxTTS + Qwen3-TTS)
Keep ChatterboxTTS for ONNX compatibility, add Qwen3-TTS for streaming

**Pros:**
- Gradual migration
- Fallback capability

**Cons:**
- Dual model maintenance
- Increased complexity

### 4.3 Recommendation

**Short-term (1-2 weeks)**: Implement **Option A** with Python sidecar
- Deploy ValyrianTech/Qwen3-TTS_server or similar
- Add HTTP client in Rust to call TTS service
- Implement WebSocket endpoint for streaming audio

**Long-term (3-6 months)**: Monitor Rust implementation progress
- Evaluate draft PR #8 stability
- Consider contributing to Rust port
- Migrate to native Rust when mature

---

## 5. Preliminary Technical Approach

### 5.1 Phase 1: Voice Preparation

1. **Collect Makima Audio Clips**
   - Extract 3-5 clean clips from anime (10-30 seconds each)
   - Ensure Japanese voice, clear audio, no BGM
   - Prepare transcripts for each clip

2. **Test Voice Cloning Quality**
   - Use Qwen3-TTS demo to validate clips
   - Iterate on clip selection for best results

### 5.2 Phase 2: TTS Service Setup

1. **Deploy Qwen3-TTS Server**
   ```bash
   # Using ValyrianTech server
   docker run --gpus all -p 7860:7860 qwen3-tts-server
   ```

2. **Configure Voice Clone Profile**
   - Upload Makima reference audio
   - Store voice clone prompt for reuse

### 5.3 Phase 3: Makima Integration

1. **Add TTS Client Module**
   ```rust
   // New module: makima/src/tts_client.rs
   pub struct QwenTTSClient {
       base_url: String,
       voice_profile: String,
   }

   impl QwenTTSClient {
       pub async fn generate_speech(&self, text: &str) -> Result<Vec<u8>, Error>
       pub async fn generate_speech_streaming(&self, text: &str) -> impl Stream<Item = Vec<u8>>
   }
   ```

2. **Add TTS Endpoint**
   ```rust
   // In makima/src/server/mod.rs
   .route("/api/v1/tts", post(tts_handler))
   .route("/api/v1/tts/stream", get(tts_streaming_handler))
   ```

3. **WebSocket Integration for Listen Page**
   - Bidirectional audio: STT input, TTS output
   - Low-latency streaming for conversational flow

### 5.4 Phase 4: Listen Page Integration

1. **Update Frontend**
   - Add TTS playback capability
   - Handle streaming audio chunks
   - UI for voice response indicators

2. **Orchestration Logic**
   - STT → LLM → TTS pipeline
   - Interrupt handling for user speech

---

## 6. Open Questions

1. **Voice Rights**: Are there legal considerations for cloning Tomori Kusunoki's voice?
2. **GPU Allocation**: Shared GPU for STT + TTS, or separate?
3. **Latency Budget**: What's acceptable end-to-end latency for Listen page?
4. **Fallback Strategy**: What happens if TTS service is unavailable?
5. **Multi-user**: How to handle concurrent TTS requests?

---

## 7. References

- [Qwen3-TTS HuggingFace](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base)
- [Qwen3-TTS GitHub](https://github.com/QwenLM/Qwen3-TTS)
- [Qwen3-TTS Technical Report](https://arxiv.org/abs/2601.15621)
- [ValyrianTech Qwen3-TTS Server](https://github.com/ValyrianTech/Qwen3-TTS_server)
- [Qwen3-TTS OpenAI-Compatible FastAPI](https://github.com/twolven/Qwen3-TTS-Openai-Fastapi)
- [Makima Voice Actors](https://www.behindthevoiceactors.com/tv-shows/Chainsaw-Man/Makima/)
- [ChatterboxTTS Audio Guidelines](https://github.com/resemble-ai/chatterbox/issues/39)
- [Voice Cloning Best Practices - Resemble AI](https://www.resemble.ai/script-to-read-for-voice-cloning-guidelines/)
- [Qwen3-rs (Rust LLM implementation)](https://github.com/reinterpretcat/qwen3-rs)

---

*Document created: Research phase for Makima TTS replacement contract*