# TTS Research: Qwen3-TTS-12Hz-0.6B-Base Integration
## Executive Summary
This document evaluates replacing the current Chatterbox TTS implementation with Qwen3-TTS-12Hz-0.6B-Base for the makima system. The goal is to enable near-real-time voice synthesis with voice cloning capabilities, defaulting to Makima's Japanese voice (Tomori Kusunoki) speaking English.
**Key Findings:**
- Qwen3-TTS offers superior streaming capabilities (~97ms latency) compared to the current batch-only Chatterbox implementation
- Voice cloning requires only 3 seconds of reference audio
- No official ONNX export exists; Python/PyTorch inference required
- The 0.6B model is optimized for resource-constrained environments
---
## 1. Current TTS Implementation Analysis
### 1.1 Architecture Overview
The current implementation uses **Chatterbox-Turbo-ONNX** from ResembleAI:
```
Location: makima/src/tts.rs
Model ID: ResembleAI/chatterbox-turbo-ONNX
Sample Rate: 24,000 Hz
```
**Components:**
| Component | File | Purpose |
|-----------|------|---------|
| `speech_encoder.onnx` | ~XX MB | Encodes reference audio to speaker embeddings |
| `embed_tokens.onnx` | ~XX MB | Token embedding layer |
| `language_model.onnx` | ~XX MB | Autoregressive text-to-speech token generation |
| `conditional_decoder.onnx` | ~XX MB | Converts speech tokens to waveform |
| `tokenizer.json` | ~KB | Text tokenization |
### 1.2 Current API Surface
```rust
pub struct ChatterboxTTS {
speech_encoder: Session,
embed_tokens: Session,
language_model: Session,
conditional_decoder: Session,
tokenizer: Tokenizer,
}
impl ChatterboxTTS {
// Load from pretrained models
pub fn from_pretrained(model_dir: Option<&str>) -> Result<Self, TtsError>;
// Generate speech (requires voice reference)
pub fn generate_tts(&mut self, _text: &str) -> Result<Vec<f32>, TtsError>;
// Voice cloning from file path
pub fn generate_tts_with_voice(
&mut self,
text: &str,
sample_audio_path: &Path,
) -> Result<Vec<f32>, TtsError>;
// Voice cloning from raw samples
pub fn generate_tts_with_samples(
&mut self,
text: &str,
samples: &[f32],
sample_rate: u32,
) -> Result<Vec<f32>, TtsError>;
}
```
### 1.3 Current Capabilities
| Feature | Supported | Notes |
|---------|-----------|-------|
| Voice Cloning | **Yes** | Required for all synthesis |
| Streaming | **No** | Batch generation only |
| Languages | Limited | English-focused |
| ONNX Runtime | **Yes** | CPU inference via `ort` crate |
| GPU Acceleration | Partial | ONNX supports CUDA EP |
| Real-time Factor | Unknown | Not benchmarked |
### 1.4 Integration Points
The TTS module is currently:
- Exposed as `pub mod tts` in `lib.rs`
- Used in `main.rs` for testing
- **Not integrated with the web server** (no `/api/v1/tts` endpoint)
The audio processing infrastructure is shared with the Listen (STT) module:
- `audio.rs` provides format conversion utilities
- `symphonia` for decoding various audio formats
- Resampling to target sample rates (16kHz for STT, 24kHz for TTS)
---
## 2. Qwen3-TTS-12Hz-0.6B-Base Analysis
### 2.1 Model Overview
**Source:** [Hugging Face - Qwen/Qwen3-TTS-12Hz-0.6B-Base](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base)
| Specification | Value |
|---------------|-------|
| Parameters | 0.6B |
| Release Date | January 22, 2026 |
| Architecture | Dual-Track hybrid streaming LM |
| Tokenizer | Qwen3-TTS-Tokenizer-12Hz |
| Frame Rate | 12.5 Hz |
| Output Sample Rate | 24 kHz |
| Languages | 10 (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian) |
### 2.2 Key Features
| Feature | Status | Details |
|---------|--------|---------|
| **Voice Cloning** | Yes | 3-second minimum reference audio |
| **Streaming** | Yes | 97ms end-to-end latency |
| **Real-time** | Yes | First audio packet after single character |
| **Multilingual** | Yes | 10 languages supported |
| **Instruction Control** | No | Base model limitation |
### 2.3 Streaming Architecture
The Dual-Track architecture enables:
1. **Streaming text input** - Processes text incrementally
2. **Streaming audio output** - Emits audio packets as generated
3. **Multi-Token Prediction (MTP)** - Enables immediate speech decoding from first codec frame
**Latency Benchmarks:**
- First token latency: ~97ms (end-to-end)
- Optimized TTFT: ~170ms on RTX 5090 (community fork)
- Initial implementations: ~800ms TTFT (before optimization)
### 2.4 Voice Cloning Requirements
| Requirement | Specification |
|-------------|---------------|
| Reference Audio Length | **3 seconds minimum** |
| Audio Format | WAV, MP3, or common formats |
| Input Methods | File path, URL, base64, numpy array |
| Reference Text | **Required** (transcript of reference audio) |
| X-Vector Only Mode | Optional (speaker embedding only, lower quality) |
### 2.5 Python API
```python
from qwen_tts import Qwen3TTSModel
# Load model
model = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-0.6B-Base",
device_map="cuda:0",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
# Voice cloning
wavs, sr = model.generate_voice_clone(
text="Hello, this is a test.",
language="English",
ref_audio="reference.wav",
ref_text="Original speaker text from reference",
)
# Reusable prompt (efficient for multiple generations)
prompt = model.create_voice_clone_prompt(
ref_audio="reference.wav",
ref_text="Reference transcript",
)
wavs, sr = model.generate_voice_clone(
text="New text",
language="English",
voice_clone_prompt=prompt,
)
```
### 2.6 Dependencies
```
pip install -U qwen-tts
pip install -U flash-attn --no-build-isolation # Optional, recommended
```
**Requirements:**
- Python 3.12 recommended
- CUDA-capable GPU (for optimal performance)
- FlashAttention 2 compatible hardware
- PyTorch with bfloat16 support
---
## 3. Feasibility Assessment
### 3.1 Streaming/Live TTS Feasibility
**Assessment: FEASIBLE with caveats**
| Factor | Current State | Path Forward |
|--------|---------------|--------------|
| Streaming API | Experimental (community fork) | Use [dffdeeq/Qwen3-TTS-streaming](https://github.com/dffdeeq/Qwen3-TTS-streaming) or wait for official support |
| Latency Target | 97ms advertised | Achievable with optimization |
| First Token | ~170ms optimized | Acceptable for conversational use |
| Audio Stability | First 1-2s may have timbre issues | Known limitation, may need buffering |
**Streaming Implementation Status:**
- Official repository: Streaming documented but **not released**
- Community fork: Working implementation with ~170ms TTFT
- vLLM-Omni: Offline inference only (online serving planned)
### 3.2 Voice Cloning for Makima
**Assessment: FULLY FEASIBLE**
Requirements for Makima voice cloning:
1. **3+ seconds of clean audio** - Tomori Kusunoki (Japanese VA) speaking
2. **Transcript of the audio** - Required for full quality
3. **Audio format** - WAV/MP3 acceptable
**Audio Sources:**
- Chainsaw Man anime clips
- Voice actress promotional material
- Behind The Voice Actors database
**Considerations:**
- Japanese VA speaking English may work with explicit `language="English"` setting
- May need English-speaking Makima clips (Suzie Yeung, English dub VA) as fallback
- X-vector mode available if transcript is difficult to obtain
### 3.3 Integration Complexity
| Component | Complexity | Notes |
|-----------|------------|-------|
| Model Loading | Medium | Python subprocess or PyO3 bridge required |
| Streaming API | High | WebSocket integration needed |
| Voice Caching | Low | `create_voice_clone_prompt()` supports this |
| Audio Format | Low | Both use 24kHz output |
| ONNX Migration | N/A | **No ONNX export available** |
### 3.4 ONNX vs Python Inference
**Current approach (Chatterbox):** Rust + ONNX Runtime
- Pros: Native Rust, low latency, CPU-friendly
- Cons: Limited model ecosystem, no streaming
**Required approach (Qwen3-TTS):** Python + PyTorch
- Pros: Full model access, streaming support, GPU acceleration
- Cons: Python subprocess overhead, dependency management
**Integration Options:**
1. **Python Subprocess/Service**
- Run `qwen-tts` as separate Python service
- Communicate via HTTP/WebSocket
- Cleanest separation, easiest to implement
2. **PyO3 Bridge**
- Embed Python in Rust binary
- Higher complexity, tighter integration
- May have GIL contention issues
3. **Custom ONNX Export** (Future)
- Not currently available
- Would require community effort
- No timeline from Qwen team
**Recommendation:** Python service with WebSocket streaming
---
## 4. Audio Clip Requirements
### 4.1 For Voice Cloning Setup
| Requirement | Specification |
|-------------|---------------|
| Minimum Duration | 3 seconds |
| Recommended Duration | 5-10 seconds |
| Format | WAV (preferred), MP3 |
| Sample Rate | Any (will be resampled) |
| Content | Clear speech, minimal background noise |
| Transcript | Required for full quality |
### 4.2 Makima Voice Sources
**Priority 1: Japanese VA (Tomori Kusunoki) speaking Japanese**
- Source: Chainsaw Man anime
- Challenge: Need clear dialogue without music/SFX
- Fallback: May not transfer well to English output
**Priority 2: English VA (Suzie Yeung)**
- Source: Chainsaw Man English dub
- Advantage: Native English speaker for English output
- Trade-off: Different vocal characteristics from Japanese VA
**Recommended Approach:**
1. Extract 5-10 second clips of both VAs
2. Test voice cloning quality with each
3. Select based on English speech naturalness
4. Store reference audio + transcript in `models/voices/makima/`
### 4.3 Transcript Requirements
For optimal voice cloning:
```
ref_audio: "models/voices/makima/makima-reference.wav"
ref_text: "The exact words spoken in the reference audio"
```
X-vector fallback (lower quality, no transcript needed):
```python
prompt = model.create_voice_clone_prompt(
ref_audio="reference.wav",
x_vector_only_mode=True, # No transcript required
)
```
---
## 5. Preliminary Technical Approach
### 5.1 Architecture Overview
```
┌─────────────────────────────────────────────────────────────┐
│ Makima Server (Rust) │
├─────────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌──────────────────────┐│
│ │ Listen (STT)│ │ TTS Proxy │ │ Chat/Contract APIs ││
│ │ /api/v1/ │ │ /api/v1/tts │ │ /api/v1/... ││
│ │ listen │ │ │ │ ││
│ └──────┬──────┘ └──────┬──────┘ └──────────────────────┘│
│ │ │ │
│ │ ┌──────▼──────┐ │
│ │ │ WebSocket │ │
│ │ │ Bridge │ │
│ │ └──────┬──────┘ │
└─────────┼────────────────┼──────────────────────────────────┘
│ │
│ ┌──────▼──────┐
│ │ Python TTS │
│ │ Service │
│ │ (Qwen3-TTS) │
│ └─────────────┘
│
┌──────▼──────┐
│ ML Models │
│ (Parakeet, │
│ Sortformer) │
└─────────────┘
```
### 5.2 Python TTS Service
**Proposed Architecture:**
```python
# tts_service.py
import asyncio
from fastapi import FastAPI, WebSocket
from qwen_tts import Qwen3TTSModel
app = FastAPI()
model = None
voice_prompts = {}
@app.on_event("startup")
async def load_model():
global model
model = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-0.6B-Base",
device_map="cuda:0",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
# Pre-load Makima voice prompt
voice_prompts["makima"] = model.create_voice_clone_prompt(
ref_audio="models/voices/makima/reference.wav",
ref_text="[Makima reference transcript]",
)
@app.websocket("/ws/tts")
async def tts_stream(websocket: WebSocket):
await websocket.accept()
while True:
data = await websocket.receive_json()
text = data["text"]
voice = data.get("voice", "makima")
language = data.get("language", "English")
# Generate with streaming (when available)
prompt = voice_prompts.get(voice)
wavs, sr = model.generate_voice_clone(
text=text,
language=language,
voice_clone_prompt=prompt,
)
# Send audio chunks
await websocket.send_bytes(wavs[0].tobytes())
@app.post("/api/tts")
async def tts_batch(request: TTSRequest):
# Batch fallback for non-streaming use cases
...
```
### 5.3 Rust Integration
**New Endpoint: `/api/v1/tts`**
```rust
// server/handlers/tts.rs
pub async fn tts_websocket_handler(
ws: WebSocketUpgrade,
State(state): State<SharedState>,
) -> Response {
ws.on_upgrade(|socket| handle_tts_socket(socket, state))
}
async fn handle_tts_socket(socket: WebSocket, state: SharedState) {
// Proxy WebSocket to Python TTS service
let tts_client = state.tts_client.lock().await;
let (mut sender, mut receiver) = socket.split();
while let Some(msg) = receiver.next().await {
match msg {
Ok(Message::Text(text)) => {
// Forward to Python service
let response = tts_client.generate(text).await;
// Stream audio back
for chunk in response.audio_chunks {
sender.send(Message::Binary(chunk)).await.ok();
}
}
_ => {}
}
}
}
```
### 5.4 Voice Prompt Caching
```rust
// Pre-computed voice prompts stored in state
pub struct TtsConfig {
pub default_voice: String,
pub voices: HashMap<String, VoicePrompt>,
}
pub struct VoicePrompt {
pub name: String,
pub ref_audio_path: PathBuf,
pub ref_text: String,
pub language: String,
// Cached prompt from Python service
pub cached_prompt_id: Option<String>,
}
```
### 5.5 Message Protocol
**Client -> Server:**
```json
{
"type": "synthesize",
"text": "Hello, I am Makima.",
"voice": "makima",
"language": "English",
"stream": true
}
```
**Server -> Client:**
```json
// Audio chunk
{"type": "audio", "data": "<base64 PCM>", "sample_rate": 24000, "final": false}
// Completion
{"type": "complete", "duration_ms": 1234}
// Error
{"type": "error", "code": "SYNTHESIS_ERROR", "message": "..."}
```
---
## 6. Implementation Phases
### Phase 1: Research & Setup (Current)
- [x] Analyze current TTS implementation
- [x] Research Qwen3-TTS capabilities
- [x] Document feasibility and approach
- [ ] Acquire Makima voice reference clips
- [ ] Test voice cloning quality
### Phase 2: Python Service
- [ ] Create Python TTS service with FastAPI
- [ ] Implement batch TTS endpoint
- [ ] Implement WebSocket streaming (when available)
- [ ] Add voice prompt management
- [ ] GPU optimization with FlashAttention 2
### Phase 3: Rust Integration
- [ ] Add TTS proxy endpoints to makima server
- [ ] WebSocket bridge implementation
- [ ] Voice configuration management
- [ ] Error handling and fallbacks
### Phase 4: Production Ready
- [ ] Health checks for Python service
- [ ] Voice prompt caching optimization
- [ ] Latency benchmarking
- [ ] Integration with Listen page
---
## 7. Open Questions
1. **Streaming API Availability**: When will official streaming support be released?
- Fallback: Use community fork or batch with chunked responses
2. **Voice Quality**: How well does Japanese VA voice clone to English speech?
- Action: Test with both Japanese and English VA samples
3. **GPU Requirements**: What's the minimum VRAM for 0.6B model?
- Estimate: ~2-4GB with bf16 quantization
4. **Latency Target**: What's acceptable for "close to live" TTS?
- Proposal: <500ms first audio, <100ms subsequent chunks
5. **Transcript Acquisition**: How to obtain accurate transcripts for voice clips?
- Options: Manual transcription, Whisper ASR, community resources
---
## 8. References
- [Qwen3-TTS-12Hz-0.6B-Base (Hugging Face)](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base)
- [Qwen3-TTS GitHub Repository](https://github.com/QwenLM/Qwen3-TTS)
- [Qwen3-TTS Technical Report (arXiv)](https://arxiv.org/abs/2601.15621)
- [Streaming Inference Issue #77](https://github.com/QwenLM/Qwen3-TTS/issues/77)
- [Community Streaming Fork](https://github.com/dffdeeq/Qwen3-TTS-streaming)
- [Makima Voice Actors](https://www.behindthevoiceactors.com/characters/Chainsaw-Man/Makima/)
- [Chatterbox-Turbo-ONNX (Current Model)](https://huggingface.co/ResembleAI/chatterbox-turbo-ONNX)