# TTS Research: Qwen3-TTS-12Hz-0.6B-Base Integration ## Executive Summary This document evaluates replacing the current Chatterbox TTS implementation with Qwen3-TTS-12Hz-0.6B-Base for the makima system. The goal is to enable near-real-time voice synthesis with voice cloning capabilities, defaulting to Makima's Japanese voice (Tomori Kusunoki) speaking English. **Key Findings:** - Qwen3-TTS offers superior streaming capabilities (~97ms latency) compared to the current batch-only Chatterbox implementation - Voice cloning requires only 3 seconds of reference audio - No official ONNX export exists; Python/PyTorch inference required - The 0.6B model is optimized for resource-constrained environments --- ## 1. Current TTS Implementation Analysis ### 1.1 Architecture Overview The current implementation uses **Chatterbox-Turbo-ONNX** from ResembleAI: ``` Location: makima/src/tts.rs Model ID: ResembleAI/chatterbox-turbo-ONNX Sample Rate: 24,000 Hz ``` **Components:** | Component | File | Purpose | |-----------|------|---------| | `speech_encoder.onnx` | ~XX MB | Encodes reference audio to speaker embeddings | | `embed_tokens.onnx` | ~XX MB | Token embedding layer | | `language_model.onnx` | ~XX MB | Autoregressive text-to-speech token generation | | `conditional_decoder.onnx` | ~XX MB | Converts speech tokens to waveform | | `tokenizer.json` | ~KB | Text tokenization | ### 1.2 Current API Surface ```rust pub struct ChatterboxTTS { speech_encoder: Session, embed_tokens: Session, language_model: Session, conditional_decoder: Session, tokenizer: Tokenizer, } impl ChatterboxTTS { // Load from pretrained models pub fn from_pretrained(model_dir: Option<&str>) -> Result; // Generate speech (requires voice reference) pub fn generate_tts(&mut self, _text: &str) -> Result, TtsError>; // Voice cloning from file path pub fn generate_tts_with_voice( &mut self, text: &str, sample_audio_path: &Path, ) -> Result, TtsError>; // Voice cloning from raw samples pub fn generate_tts_with_samples( &mut self, text: &str, samples: &[f32], sample_rate: u32, ) -> Result, TtsError>; } ``` ### 1.3 Current Capabilities | Feature | Supported | Notes | |---------|-----------|-------| | Voice Cloning | **Yes** | Required for all synthesis | | Streaming | **No** | Batch generation only | | Languages | Limited | English-focused | | ONNX Runtime | **Yes** | CPU inference via `ort` crate | | GPU Acceleration | Partial | ONNX supports CUDA EP | | Real-time Factor | Unknown | Not benchmarked | ### 1.4 Integration Points The TTS module is currently: - Exposed as `pub mod tts` in `lib.rs` - Used in `main.rs` for testing - **Not integrated with the web server** (no `/api/v1/tts` endpoint) The audio processing infrastructure is shared with the Listen (STT) module: - `audio.rs` provides format conversion utilities - `symphonia` for decoding various audio formats - Resampling to target sample rates (16kHz for STT, 24kHz for TTS) --- ## 2. Qwen3-TTS-12Hz-0.6B-Base Analysis ### 2.1 Model Overview **Source:** [Hugging Face - Qwen/Qwen3-TTS-12Hz-0.6B-Base](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base) | Specification | Value | |---------------|-------| | Parameters | 0.6B | | Release Date | January 22, 2026 | | Architecture | Dual-Track hybrid streaming LM | | Tokenizer | Qwen3-TTS-Tokenizer-12Hz | | Frame Rate | 12.5 Hz | | Output Sample Rate | 24 kHz | | Languages | 10 (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian) | ### 2.2 Key Features | Feature | Status | Details | |---------|--------|---------| | **Voice Cloning** | Yes | 3-second minimum reference audio | | **Streaming** | Yes | 97ms end-to-end latency | | **Real-time** | Yes | First audio packet after single character | | **Multilingual** | Yes | 10 languages supported | | **Instruction Control** | No | Base model limitation | ### 2.3 Streaming Architecture The Dual-Track architecture enables: 1. **Streaming text input** - Processes text incrementally 2. **Streaming audio output** - Emits audio packets as generated 3. **Multi-Token Prediction (MTP)** - Enables immediate speech decoding from first codec frame **Latency Benchmarks:** - First token latency: ~97ms (end-to-end) - Optimized TTFT: ~170ms on RTX 5090 (community fork) - Initial implementations: ~800ms TTFT (before optimization) ### 2.4 Voice Cloning Requirements | Requirement | Specification | |-------------|---------------| | Reference Audio Length | **3 seconds minimum** | | Audio Format | WAV, MP3, or common formats | | Input Methods | File path, URL, base64, numpy array | | Reference Text | **Required** (transcript of reference audio) | | X-Vector Only Mode | Optional (speaker embedding only, lower quality) | ### 2.5 Python API ```python from qwen_tts import Qwen3TTSModel # Load model model = Qwen3TTSModel.from_pretrained( "Qwen/Qwen3-TTS-12Hz-0.6B-Base", device_map="cuda:0", dtype=torch.bfloat16, attn_implementation="flash_attention_2", ) # Voice cloning wavs, sr = model.generate_voice_clone( text="Hello, this is a test.", language="English", ref_audio="reference.wav", ref_text="Original speaker text from reference", ) # Reusable prompt (efficient for multiple generations) prompt = model.create_voice_clone_prompt( ref_audio="reference.wav", ref_text="Reference transcript", ) wavs, sr = model.generate_voice_clone( text="New text", language="English", voice_clone_prompt=prompt, ) ``` ### 2.6 Dependencies ``` pip install -U qwen-tts pip install -U flash-attn --no-build-isolation # Optional, recommended ``` **Requirements:** - Python 3.12 recommended - CUDA-capable GPU (for optimal performance) - FlashAttention 2 compatible hardware - PyTorch with bfloat16 support --- ## 3. Feasibility Assessment ### 3.1 Streaming/Live TTS Feasibility **Assessment: FEASIBLE with caveats** | Factor | Current State | Path Forward | |--------|---------------|--------------| | Streaming API | Experimental (community fork) | Use [dffdeeq/Qwen3-TTS-streaming](https://github.com/dffdeeq/Qwen3-TTS-streaming) or wait for official support | | Latency Target | 97ms advertised | Achievable with optimization | | First Token | ~170ms optimized | Acceptable for conversational use | | Audio Stability | First 1-2s may have timbre issues | Known limitation, may need buffering | **Streaming Implementation Status:** - Official repository: Streaming documented but **not released** - Community fork: Working implementation with ~170ms TTFT - vLLM-Omni: Offline inference only (online serving planned) ### 3.2 Voice Cloning for Makima **Assessment: FULLY FEASIBLE** Requirements for Makima voice cloning: 1. **3+ seconds of clean audio** - Tomori Kusunoki (Japanese VA) speaking 2. **Transcript of the audio** - Required for full quality 3. **Audio format** - WAV/MP3 acceptable **Audio Sources:** - Chainsaw Man anime clips - Voice actress promotional material - Behind The Voice Actors database **Considerations:** - Japanese VA speaking English may work with explicit `language="English"` setting - May need English-speaking Makima clips (Suzie Yeung, English dub VA) as fallback - X-vector mode available if transcript is difficult to obtain ### 3.3 Integration Complexity | Component | Complexity | Notes | |-----------|------------|-------| | Model Loading | Medium | Python subprocess or PyO3 bridge required | | Streaming API | High | WebSocket integration needed | | Voice Caching | Low | `create_voice_clone_prompt()` supports this | | Audio Format | Low | Both use 24kHz output | | ONNX Migration | N/A | **No ONNX export available** | ### 3.4 ONNX vs Python Inference **Current approach (Chatterbox):** Rust + ONNX Runtime - Pros: Native Rust, low latency, CPU-friendly - Cons: Limited model ecosystem, no streaming **Required approach (Qwen3-TTS):** Python + PyTorch - Pros: Full model access, streaming support, GPU acceleration - Cons: Python subprocess overhead, dependency management **Integration Options:** 1. **Python Subprocess/Service** - Run `qwen-tts` as separate Python service - Communicate via HTTP/WebSocket - Cleanest separation, easiest to implement 2. **PyO3 Bridge** - Embed Python in Rust binary - Higher complexity, tighter integration - May have GIL contention issues 3. **Custom ONNX Export** (Future) - Not currently available - Would require community effort - No timeline from Qwen team **Recommendation:** Python service with WebSocket streaming --- ## 4. Audio Clip Requirements ### 4.1 For Voice Cloning Setup | Requirement | Specification | |-------------|---------------| | Minimum Duration | 3 seconds | | Recommended Duration | 5-10 seconds | | Format | WAV (preferred), MP3 | | Sample Rate | Any (will be resampled) | | Content | Clear speech, minimal background noise | | Transcript | Required for full quality | ### 4.2 Makima Voice Sources **Priority 1: Japanese VA (Tomori Kusunoki) speaking Japanese** - Source: Chainsaw Man anime - Challenge: Need clear dialogue without music/SFX - Fallback: May not transfer well to English output **Priority 2: English VA (Suzie Yeung)** - Source: Chainsaw Man English dub - Advantage: Native English speaker for English output - Trade-off: Different vocal characteristics from Japanese VA **Recommended Approach:** 1. Extract 5-10 second clips of both VAs 2. Test voice cloning quality with each 3. Select based on English speech naturalness 4. Store reference audio + transcript in `models/voices/makima/` ### 4.3 Transcript Requirements For optimal voice cloning: ``` ref_audio: "models/voices/makima/makima-reference.wav" ref_text: "The exact words spoken in the reference audio" ``` X-vector fallback (lower quality, no transcript needed): ```python prompt = model.create_voice_clone_prompt( ref_audio="reference.wav", x_vector_only_mode=True, # No transcript required ) ``` --- ## 5. Preliminary Technical Approach ### 5.1 Architecture Overview ``` ┌─────────────────────────────────────────────────────────────┐ │ Makima Server (Rust) │ ├─────────────────────────────────────────────────────────────┤ │ ┌─────────────┐ ┌─────────────┐ ┌──────────────────────┐│ │ │ Listen (STT)│ │ TTS Proxy │ │ Chat/Contract APIs ││ │ │ /api/v1/ │ │ /api/v1/tts │ │ /api/v1/... ││ │ │ listen │ │ │ │ ││ │ └──────┬──────┘ └──────┬──────┘ └──────────────────────┘│ │ │ │ │ │ │ ┌──────▼──────┐ │ │ │ │ WebSocket │ │ │ │ │ Bridge │ │ │ │ └──────┬──────┘ │ └─────────┼────────────────┼──────────────────────────────────┘ │ │ │ ┌──────▼──────┐ │ │ Python TTS │ │ │ Service │ │ │ (Qwen3-TTS) │ │ └─────────────┘ │ ┌──────▼──────┐ │ ML Models │ │ (Parakeet, │ │ Sortformer) │ └─────────────┘ ``` ### 5.2 Python TTS Service **Proposed Architecture:** ```python # tts_service.py import asyncio from fastapi import FastAPI, WebSocket from qwen_tts import Qwen3TTSModel app = FastAPI() model = None voice_prompts = {} @app.on_event("startup") async def load_model(): global model model = Qwen3TTSModel.from_pretrained( "Qwen/Qwen3-TTS-12Hz-0.6B-Base", device_map="cuda:0", dtype=torch.bfloat16, attn_implementation="flash_attention_2", ) # Pre-load Makima voice prompt voice_prompts["makima"] = model.create_voice_clone_prompt( ref_audio="models/voices/makima/reference.wav", ref_text="[Makima reference transcript]", ) @app.websocket("/ws/tts") async def tts_stream(websocket: WebSocket): await websocket.accept() while True: data = await websocket.receive_json() text = data["text"] voice = data.get("voice", "makima") language = data.get("language", "English") # Generate with streaming (when available) prompt = voice_prompts.get(voice) wavs, sr = model.generate_voice_clone( text=text, language=language, voice_clone_prompt=prompt, ) # Send audio chunks await websocket.send_bytes(wavs[0].tobytes()) @app.post("/api/tts") async def tts_batch(request: TTSRequest): # Batch fallback for non-streaming use cases ... ``` ### 5.3 Rust Integration **New Endpoint: `/api/v1/tts`** ```rust // server/handlers/tts.rs pub async fn tts_websocket_handler( ws: WebSocketUpgrade, State(state): State, ) -> Response { ws.on_upgrade(|socket| handle_tts_socket(socket, state)) } async fn handle_tts_socket(socket: WebSocket, state: SharedState) { // Proxy WebSocket to Python TTS service let tts_client = state.tts_client.lock().await; let (mut sender, mut receiver) = socket.split(); while let Some(msg) = receiver.next().await { match msg { Ok(Message::Text(text)) => { // Forward to Python service let response = tts_client.generate(text).await; // Stream audio back for chunk in response.audio_chunks { sender.send(Message::Binary(chunk)).await.ok(); } } _ => {} } } } ``` ### 5.4 Voice Prompt Caching ```rust // Pre-computed voice prompts stored in state pub struct TtsConfig { pub default_voice: String, pub voices: HashMap, } pub struct VoicePrompt { pub name: String, pub ref_audio_path: PathBuf, pub ref_text: String, pub language: String, // Cached prompt from Python service pub cached_prompt_id: Option, } ``` ### 5.5 Message Protocol **Client -> Server:** ```json { "type": "synthesize", "text": "Hello, I am Makima.", "voice": "makima", "language": "English", "stream": true } ``` **Server -> Client:** ```json // Audio chunk {"type": "audio", "data": "", "sample_rate": 24000, "final": false} // Completion {"type": "complete", "duration_ms": 1234} // Error {"type": "error", "code": "SYNTHESIS_ERROR", "message": "..."} ``` --- ## 6. Implementation Phases ### Phase 1: Research & Setup (Current) - [x] Analyze current TTS implementation - [x] Research Qwen3-TTS capabilities - [x] Document feasibility and approach - [ ] Acquire Makima voice reference clips - [ ] Test voice cloning quality ### Phase 2: Python Service - [ ] Create Python TTS service with FastAPI - [ ] Implement batch TTS endpoint - [ ] Implement WebSocket streaming (when available) - [ ] Add voice prompt management - [ ] GPU optimization with FlashAttention 2 ### Phase 3: Rust Integration - [ ] Add TTS proxy endpoints to makima server - [ ] WebSocket bridge implementation - [ ] Voice configuration management - [ ] Error handling and fallbacks ### Phase 4: Production Ready - [ ] Health checks for Python service - [ ] Voice prompt caching optimization - [ ] Latency benchmarking - [ ] Integration with Listen page --- ## 7. Open Questions 1. **Streaming API Availability**: When will official streaming support be released? - Fallback: Use community fork or batch with chunked responses 2. **Voice Quality**: How well does Japanese VA voice clone to English speech? - Action: Test with both Japanese and English VA samples 3. **GPU Requirements**: What's the minimum VRAM for 0.6B model? - Estimate: ~2-4GB with bf16 quantization 4. **Latency Target**: What's acceptable for "close to live" TTS? - Proposal: <500ms first audio, <100ms subsequent chunks 5. **Transcript Acquisition**: How to obtain accurate transcripts for voice clips? - Options: Manual transcription, Whisper ASR, community resources --- ## 8. References - [Qwen3-TTS-12Hz-0.6B-Base (Hugging Face)](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base) - [Qwen3-TTS GitHub Repository](https://github.com/QwenLM/Qwen3-TTS) - [Qwen3-TTS Technical Report (arXiv)](https://arxiv.org/abs/2601.15621) - [Streaming Inference Issue #77](https://github.com/QwenLM/Qwen3-TTS/issues/77) - [Community Streaming Fork](https://github.com/dffdeeq/Qwen3-TTS-streaming) - [Makima Voice Actors](https://www.behindthevoiceactors.com/characters/Chainsaw-Man/Makima/) - [Chatterbox-Turbo-ONNX (Current Model)](https://huggingface.co/ResembleAI/chatterbox-turbo-ONNX)