path: root/docs/research/tts-qwen3-research.md



# TTS Research: Qwen3-TTS-12Hz-0.6B-Base Integration

## Executive Summary

This document evaluates replacing the current Chatterbox TTS implementation with Qwen3-TTS-12Hz-0.6B-Base for the makima system. The goal is to enable near-real-time voice synthesis with voice cloning capabilities, defaulting to Makima's Japanese voice (Tomori Kusunoki) speaking English.

**Key Findings:**
- Qwen3-TTS offers superior streaming capabilities (~97ms latency) compared to the current batch-only Chatterbox implementation
- Voice cloning requires only 3 seconds of reference audio
- No official ONNX export exists; Python/PyTorch inference required
- The 0.6B model is optimized for resource-constrained environments

---

## 1. Current TTS Implementation Analysis

### 1.1 Architecture Overview

The current implementation uses **Chatterbox-Turbo-ONNX** from ResembleAI:

```
Location: makima/src/tts.rs
Model ID: ResembleAI/chatterbox-turbo-ONNX
Sample Rate: 24,000 Hz
```

**Components:**
| Component | File | Purpose |
|-----------|------|---------|
| `speech_encoder.onnx` | ~XX MB | Encodes reference audio to speaker embeddings |
| `embed_tokens.onnx` | ~XX MB | Token embedding layer |
| `language_model.onnx` | ~XX MB | Autoregressive text-to-speech token generation |
| `conditional_decoder.onnx` | ~XX MB | Converts speech tokens to waveform |
| `tokenizer.json` | ~KB | Text tokenization |

### 1.2 Current API Surface

```rust
pub struct ChatterboxTTS {
    speech_encoder: Session,
    embed_tokens: Session,
    language_model: Session,
    conditional_decoder: Session,
    tokenizer: Tokenizer,
}

impl ChatterboxTTS {
    // Load from pretrained models
    pub fn from_pretrained(model_dir: Option<&str>) -> Result<Self, TtsError>;

    // Generate speech (requires voice reference)
    pub fn generate_tts(&mut self, _text: &str) -> Result<Vec<f32>, TtsError>;

    // Voice cloning from file path
    pub fn generate_tts_with_voice(
        &mut self,
        text: &str,
        sample_audio_path: &Path,
    ) -> Result<Vec<f32>, TtsError>;

    // Voice cloning from raw samples
    pub fn generate_tts_with_samples(
        &mut self,
        text: &str,
        samples: &[f32],
        sample_rate: u32,
    ) -> Result<Vec<f32>, TtsError>;
}
```

### 1.3 Current Capabilities

| Feature | Supported | Notes |
|---------|-----------|-------|
| Voice Cloning | **Yes** | Required for all synthesis |
| Streaming | **No** | Batch generation only |
| Languages | Limited | English-focused |
| ONNX Runtime | **Yes** | CPU inference via `ort` crate |
| GPU Acceleration | Partial | ONNX supports CUDA EP |
| Real-time Factor | Unknown | Not benchmarked |

### 1.4 Integration Points

The TTS module is currently:
- Exposed as `pub mod tts` in `lib.rs`
- Used in `main.rs` for testing
- **Not integrated with the web server** (no `/api/v1/tts` endpoint)

The audio processing infrastructure is shared with the Listen (STT) module:
- `audio.rs` provides format conversion utilities
- `symphonia` for decoding various audio formats
- Resampling to target sample rates (16kHz for STT, 24kHz for TTS)

---

## 2. Qwen3-TTS-12Hz-0.6B-Base Analysis

### 2.1 Model Overview

**Source:** [Hugging Face - Qwen/Qwen3-TTS-12Hz-0.6B-Base](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base)

| Specification | Value |
|---------------|-------|
| Parameters | 0.6B |
| Release Date | January 22, 2026 |
| Architecture | Dual-Track hybrid streaming LM |
| Tokenizer | Qwen3-TTS-Tokenizer-12Hz |
| Frame Rate | 12.5 Hz |
| Output Sample Rate | 24 kHz |
| Languages | 10 (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian) |

### 2.2 Key Features

| Feature | Status | Details |
|---------|--------|---------|
| **Voice Cloning** | Yes | 3-second minimum reference audio |
| **Streaming** | Yes | 97ms end-to-end latency |
| **Real-time** | Yes | First audio packet after single character |
| **Multilingual** | Yes | 10 languages supported |
| **Instruction Control** | No | Base model limitation |

### 2.3 Streaming Architecture

The Dual-Track architecture enables:
1. **Streaming text input** - Processes text incrementally
2. **Streaming audio output** - Emits audio packets as generated
3. **Multi-Token Prediction (MTP)** - Enables immediate speech decoding from first codec frame

**Latency Benchmarks:**
- First token latency: ~97ms (end-to-end)
- Optimized TTFT: ~170ms on RTX 5090 (community fork)
- Initial implementations: ~800ms TTFT (before optimization)

### 2.4 Voice Cloning Requirements

| Requirement | Specification |
|-------------|---------------|
| Reference Audio Length | **3 seconds minimum** |
| Audio Format | WAV, MP3, or common formats |
| Input Methods | File path, URL, base64, numpy array |
| Reference Text | **Required** (transcript of reference audio) |
| X-Vector Only Mode | Optional (speaker embedding only, lower quality) |

### 2.5 Python API

```python
from qwen_tts import Qwen3TTSModel

# Load model
model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-0.6B-Base",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

# Voice cloning
wavs, sr = model.generate_voice_clone(
    text="Hello, this is a test.",
    language="English",
    ref_audio="reference.wav",
    ref_text="Original speaker text from reference",
)

# Reusable prompt (efficient for multiple generations)
prompt = model.create_voice_clone_prompt(
    ref_audio="reference.wav",
    ref_text="Reference transcript",
)

wavs, sr = model.generate_voice_clone(
    text="New text",
    language="English",
    voice_clone_prompt=prompt,
)
```

### 2.6 Dependencies

```
pip install -U qwen-tts
pip install -U flash-attn --no-build-isolation  # Optional, recommended
```

**Requirements:**
- Python 3.12 recommended
- CUDA-capable GPU (for optimal performance)
- FlashAttention 2 compatible hardware
- PyTorch with bfloat16 support

---

## 3. Feasibility Assessment

### 3.1 Streaming/Live TTS Feasibility

**Assessment: FEASIBLE with caveats**

| Factor | Current State | Path Forward |
|--------|---------------|--------------|
| Streaming API | Experimental (community fork) | Use [dffdeeq/Qwen3-TTS-streaming](https://github.com/dffdeeq/Qwen3-TTS-streaming) or wait for official support |
| Latency Target | 97ms advertised | Achievable with optimization |
| First Token | ~170ms optimized | Acceptable for conversational use |
| Audio Stability | First 1-2s may have timbre issues | Known limitation, may need buffering |

**Streaming Implementation Status:**
- Official repository: Streaming documented but **not released**
- Community fork: Working implementation with ~170ms TTFT
- vLLM-Omni: Offline inference only (online serving planned)

### 3.2 Voice Cloning for Makima

**Assessment: FULLY FEASIBLE**

Requirements for Makima voice cloning:
1. **3+ seconds of clean audio** - Tomori Kusunoki (Japanese VA) speaking
2. **Transcript of the audio** - Required for full quality
3. **Audio format** - WAV/MP3 acceptable

**Audio Sources:**
- Chainsaw Man anime clips
- Voice actress promotional material
- Behind The Voice Actors database

**Considerations:**
- Japanese VA speaking English may work with explicit `language="English"` setting
- May need English-speaking Makima clips (Suzie Yeung, English dub VA) as fallback
- X-vector mode available if transcript is difficult to obtain

### 3.3 Integration Complexity

| Component | Complexity | Notes |
|-----------|------------|-------|
| Model Loading | Medium | Python subprocess or PyO3 bridge required |
| Streaming API | High | WebSocket integration needed |
| Voice Caching | Low | `create_voice_clone_prompt()` supports this |
| Audio Format | Low | Both use 24kHz output |
| ONNX Migration | N/A | **No ONNX export available** |

### 3.4 ONNX vs Python Inference

**Current approach (Chatterbox):** Rust + ONNX Runtime
- Pros: Native Rust, low latency, CPU-friendly
- Cons: Limited model ecosystem, no streaming

**Required approach (Qwen3-TTS):** Python + PyTorch
- Pros: Full model access, streaming support, GPU acceleration
- Cons: Python subprocess overhead, dependency management

**Integration Options:**

1. **Python Subprocess/Service**
   - Run `qwen-tts` as separate Python service
   - Communicate via HTTP/WebSocket
   - Cleanest separation, easiest to implement

2. **PyO3 Bridge**
   - Embed Python in Rust binary
   - Higher complexity, tighter integration
   - May have GIL contention issues

3. **Custom ONNX Export** (Future)
   - Not currently available
   - Would require community effort
   - No timeline from Qwen team

**Recommendation:** Python service with WebSocket streaming

---

## 4. Audio Clip Requirements

### 4.1 For Voice Cloning Setup

| Requirement | Specification |
|-------------|---------------|
| Minimum Duration | 3 seconds |
| Recommended Duration | 5-10 seconds |
| Format | WAV (preferred), MP3 |
| Sample Rate | Any (will be resampled) |
| Content | Clear speech, minimal background noise |
| Transcript | Required for full quality |

### 4.2 Makima Voice Sources

**Priority 1: Japanese VA (Tomori Kusunoki) speaking Japanese**
- Source: Chainsaw Man anime
- Challenge: Need clear dialogue without music/SFX
- Fallback: May not transfer well to English output

**Priority 2: English VA (Suzie Yeung)**
- Source: Chainsaw Man English dub
- Advantage: Native English speaker for English output
- Trade-off: Different vocal characteristics from Japanese VA

**Recommended Approach:**
1. Extract 5-10 second clips of both VAs
2. Test voice cloning quality with each
3. Select based on English speech naturalness
4. Store reference audio + transcript in `models/voices/makima/`

### 4.3 Transcript Requirements

For optimal voice cloning:
```
ref_audio: "models/voices/makima/makima-reference.wav"
ref_text: "The exact words spoken in the reference audio"
```

X-vector fallback (lower quality, no transcript needed):
```python
prompt = model.create_voice_clone_prompt(
    ref_audio="reference.wav",
    x_vector_only_mode=True,  # No transcript required
)
```

---

## 5. Preliminary Technical Approach

### 5.1 Architecture Overview

```
┌─────────────────────────────────────────────────────────────┐
│                    Makima Server (Rust)                     │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌──────────────────────┐│
│  │ Listen (STT)│  │ TTS Proxy   │  │ Chat/Contract APIs   ││
│  │ /api/v1/    │  │ /api/v1/tts │  │ /api/v1/...          ││
│  │ listen      │  │             │  │                      ││
│  └──────┬──────┘  └──────┬──────┘  └──────────────────────┘│
│         │                │                                  │
│         │         ┌──────▼──────┐                          │
│         │         │  WebSocket  │                          │
│         │         │   Bridge    │                          │
│         │         └──────┬──────┘                          │
└─────────┼────────────────┼──────────────────────────────────┘
          │                │
          │         ┌──────▼──────┐
          │         │ Python TTS  │
          │         │   Service   │
          │         │ (Qwen3-TTS) │
          │         └─────────────┘
          │
   ┌──────▼──────┐
   │  ML Models  │
   │ (Parakeet,  │
   │ Sortformer) │
   └─────────────┘
```

### 5.2 Python TTS Service

**Proposed Architecture:**

```python
# tts_service.py
import asyncio
from fastapi import FastAPI, WebSocket
from qwen_tts import Qwen3TTSModel

app = FastAPI()
model = None
voice_prompts = {}

@app.on_event("startup")
async def load_model():
    global model
    model = Qwen3TTSModel.from_pretrained(
        "Qwen/Qwen3-TTS-12Hz-0.6B-Base",
        device_map="cuda:0",
        dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
    )

    # Pre-load Makima voice prompt
    voice_prompts["makima"] = model.create_voice_clone_prompt(
        ref_audio="models/voices/makima/reference.wav",
        ref_text="[Makima reference transcript]",
    )

@app.websocket("/ws/tts")
async def tts_stream(websocket: WebSocket):
    await websocket.accept()
    while True:
        data = await websocket.receive_json()
        text = data["text"]
        voice = data.get("voice", "makima")
        language = data.get("language", "English")

        # Generate with streaming (when available)
        prompt = voice_prompts.get(voice)
        wavs, sr = model.generate_voice_clone(
            text=text,
            language=language,
            voice_clone_prompt=prompt,
        )

        # Send audio chunks
        await websocket.send_bytes(wavs[0].tobytes())

@app.post("/api/tts")
async def tts_batch(request: TTSRequest):
    # Batch fallback for non-streaming use cases
    ...
```

### 5.3 Rust Integration

**New Endpoint: `/api/v1/tts`**

```rust
// server/handlers/tts.rs
pub async fn tts_websocket_handler(
    ws: WebSocketUpgrade,
    State(state): State<SharedState>,
) -> Response {
    ws.on_upgrade(|socket| handle_tts_socket(socket, state))
}

async fn handle_tts_socket(socket: WebSocket, state: SharedState) {
    // Proxy WebSocket to Python TTS service
    let tts_client = state.tts_client.lock().await;

    let (mut sender, mut receiver) = socket.split();

    while let Some(msg) = receiver.next().await {
        match msg {
            Ok(Message::Text(text)) => {
                // Forward to Python service
                let response = tts_client.generate(text).await;

                // Stream audio back
                for chunk in response.audio_chunks {
                    sender.send(Message::Binary(chunk)).await.ok();
                }
            }
            _ => {}
        }
    }
}
```

### 5.4 Voice Prompt Caching

```rust
// Pre-computed voice prompts stored in state
pub struct TtsConfig {
    pub default_voice: String,
    pub voices: HashMap<String, VoicePrompt>,
}

pub struct VoicePrompt {
    pub name: String,
    pub ref_audio_path: PathBuf,
    pub ref_text: String,
    pub language: String,
    // Cached prompt from Python service
    pub cached_prompt_id: Option<String>,
}
```

### 5.5 Message Protocol

**Client -> Server:**
```json
{
  "type": "synthesize",
  "text": "Hello, I am Makima.",
  "voice": "makima",
  "language": "English",
  "stream": true
}
```

**Server -> Client:**
```json
// Audio chunk
{"type": "audio", "data": "<base64 PCM>", "sample_rate": 24000, "final": false}

// Completion
{"type": "complete", "duration_ms": 1234}

// Error
{"type": "error", "code": "SYNTHESIS_ERROR", "message": "..."}
```

---

## 6. Implementation Phases

### Phase 1: Research & Setup (Current)
- [x] Analyze current TTS implementation
- [x] Research Qwen3-TTS capabilities
- [x] Document feasibility and approach
- [ ] Acquire Makima voice reference clips
- [ ] Test voice cloning quality

### Phase 2: Python Service
- [ ] Create Python TTS service with FastAPI
- [ ] Implement batch TTS endpoint
- [ ] Implement WebSocket streaming (when available)
- [ ] Add voice prompt management
- [ ] GPU optimization with FlashAttention 2

### Phase 3: Rust Integration
- [ ] Add TTS proxy endpoints to makima server
- [ ] WebSocket bridge implementation
- [ ] Voice configuration management
- [ ] Error handling and fallbacks

### Phase 4: Production Ready
- [ ] Health checks for Python service
- [ ] Voice prompt caching optimization
- [ ] Latency benchmarking
- [ ] Integration with Listen page

---

## 7. Open Questions

1. **Streaming API Availability**: When will official streaming support be released?
   - Fallback: Use community fork or batch with chunked responses

2. **Voice Quality**: How well does Japanese VA voice clone to English speech?
   - Action: Test with both Japanese and English VA samples

3. **GPU Requirements**: What's the minimum VRAM for 0.6B model?
   - Estimate: ~2-4GB with bf16 quantization

4. **Latency Target**: What's acceptable for "close to live" TTS?
   - Proposal: <500ms first audio, <100ms subsequent chunks

5. **Transcript Acquisition**: How to obtain accurate transcripts for voice clips?
   - Options: Manual transcription, Whisper ASR, community resources

---

## 8. References

- [Qwen3-TTS-12Hz-0.6B-Base (Hugging Face)](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base)
- [Qwen3-TTS GitHub Repository](https://github.com/QwenLM/Qwen3-TTS)
- [Qwen3-TTS Technical Report (arXiv)](https://arxiv.org/abs/2601.15621)
- [Streaming Inference Issue #77](https://github.com/QwenLM/Qwen3-TTS/issues/77)
- [Community Streaming Fork](https://github.com/dffdeeq/Qwen3-TTS-streaming)
- [Makima Voice Actors](https://www.behindthevoiceactors.com/characters/Chainsaw-Man/Makima/)
- [Chatterbox-Turbo-ONNX (Current Model)](https://huggingface.co/ResembleAI/chatterbox-turbo-ONNX)