# TTS Replacement Research Notes ## Executive Summary This document summarizes research on replacing the existing TTS endpoint in makima with Qwen3-TTS-12Hz-0.6B-Base, with the goal of supporting voice cloning using Makima's Japanese voice speaking English, and achieving near-live/streaming TTS capabilities. --- ## 1. Current TTS Implementation Analysis ### 1.1 Current Model: Chatterbox-Turbo The existing TTS implementation in `makima/src/tts.rs` uses **ResembleAI/chatterbox-turbo-ONNX**: - **Architecture**: 4-component ONNX model pipeline - `speech_encoder.onnx` - Encodes reference voice audio - `embed_tokens.onnx` - Token embedding layer - `language_model.onnx` - Autoregressive language model (24 layers, 16 KV heads, 64 head dim) - `conditional_decoder.onnx` - Decodes speech tokens to audio waveform - **Sample Rate**: 24,000 Hz output - **Voice Cloning**: Required (no default voice support) - **Special Tokens**: - START_SPEECH_TOKEN: 6561 - STOP_SPEECH_TOKEN: 6562 - SILENCE_TOKEN: 4299 ### 1.2 Current API Surface **Core TTS Functions:** ```rust pub fn generate_tts(&mut self, _text: &str) -> Result, TtsError> // Returns VoiceRequired error - voice cloning is mandatory pub fn generate_tts_with_voice(&mut self, text: &str, sample_audio_path: &Path) -> Result, TtsError> // Voice cloning from file path pub fn generate_tts_with_samples(&mut self, text: &str, samples: &[f32], sample_rate: u32) -> Result, TtsError> // Voice cloning from raw samples ``` **Audio Processing:** - Input audio resampled to 24kHz - Reference voice encoded into: - `audio_features` - Acoustic features - `prompt_tokens` - Token representation - `speaker_embeddings` - Speaker identity - `speaker_features` - Voice characteristics ### 1.3 Current Limitations 1. **No Streaming Support**: Current implementation generates complete audio before returning 2. **No Default Voice**: Requires voice reference audio for every call 3. **No HTTP Endpoint**: TTS is only available as a Rust library, not exposed via REST API 4. **Single Language**: Optimized for English, unclear multilingual support 5. **High Latency**: Full autoregressive generation before any audio output ### 1.4 Related Components **Audio Processing (`makima/src/audio.rs`):** - Uses Symphonia for audio decoding (MP3, WAV, FLAC, OGG, etc.) - Resampling via sinc interpolation - Stereo to mono mixdown - Target: 16kHz mono for STT **Listen Endpoint (`makima/src/server/handlers/listen.rs`):** - WebSocket-based streaming STT - Uses Parakeet for transcription - Sortformer for speaker diarization - Already has real-time audio streaming infrastructure --- ## 2. Qwen3-TTS-12Hz-0.6B-Base Model Analysis ### 2.1 Model Capabilities | Feature | Specification | |---------|---------------| | **Model Size** | 0.6B parameters (lightweight variant) | | **Voice Cloning** | 3-second reference audio only | | **Streaming** | Dual-track hybrid architecture | | **Minimum Latency** | 97ms end-to-end | | **Languages** | 10 (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian) | | **Cross-lingual Cloning** | Japanese voice to English speech supported | | **Speaker Similarity** | 0.95 (near human-level) | | **Output Sample Rate** | Up to 48kHz (standard 24kHz) | ### 2.2 Voice Cloning Requirements **Reference Audio:** - **Minimum Duration**: 3 seconds - **Recommended Duration**: 5-15 seconds - **Format**: WAV preferred; also supports URL, base64, numpy array - **Quality**: Clean, noise-free audio essential - **Transcript**: Providing `ref_text` significantly improves quality **Cross-Lingual Usage (Japanese to English):** ```python ref_audio = "makima_japanese.wav" # Japanese reference ref_text = "日本語のテキスト" # Japanese transcription wavs, sr = model.generate_voice_clone( text="This is English text", # English output language="English", ref_audio=ref_audio, ref_text=ref_text, ) ``` ### 2.3 Technical Requirements **Python Dependencies:** ```bash pip install -U qwen-tts pip install -U flash-attn --no-build-isolation # For optimal performance ``` **Hardware:** - CUDA-compatible GPU required - FlashAttention 2 for optimal memory usage - Float16/bfloat16 precision support - For <96GB RAM: `MAX_JOBS=4` for flash-attn installation **Model Loading:** ```python from qwen_tts import Qwen3TTSModel import torch model = Qwen3TTSModel.from_pretrained( "Qwen/Qwen3-TTS-12Hz-0.6B-Base", device_map="cuda:0", dtype=torch.bfloat16, attn_implementation="flash_attention_2", ) ``` ### 2.4 Streaming Architecture **Dual-Track Hybrid Design:** - Single model supports both streaming and non-streaming - Audio output begins after minimal text input - 97ms minimum latency achieved through: - Proprietary Qwen3-TTS-Tokenizer-12Hz (efficient acoustic compression) - Discrete multi-codebook LM (eliminates LM+DiT bottleneck) - Lightweight non-DiT vocoder **Reusable Voice Clone Prompt (Critical for Performance):** ```python # Pre-compute once prompt_items = model.create_voice_clone_prompt( ref_audio=ref_audio, ref_text=ref_text, x_vector_only_mode=False ) # Reuse for multiple generations wavs, sr = model.generate_voice_clone( text=["Line 1", "Line 2"], language=["English", "English"], voice_clone_prompt=prompt_items, # Cached prompt ) ``` --- ## 3. Makima Voice Audio Sources ### 3.1 Character Information - **Character**: Makima from Chainsaw Man anime - **Japanese Voice Actress**: Tomori Kusunoki (楠木ともり) - **English Voice Actress**: Suzie Yeung ### 3.2 Potential Audio Sources | Source | Type | Notes | |--------|------|-------| | **Voicy Network Soundboard** | Official clips | MP3 download available, 20+ sound effects | | **101Soundboards** | Fan-curated clips | Various character sounds | | **Anime Episodes** | Source material | Highest quality, requires extraction | | **Nikke: Goddess of Victory** | Game voicelines | Same voice actress (Tomori Kusunoki) | | **Ko-fi (erusha)** | WAV files | x5 character impression audio files | ### 3.3 Recommended Approach 1. **Primary Source**: Extract 5-15 seconds of clean dialogue from Chainsaw Man anime (Japanese audio track) 2. **Selection Criteria**: - Clear, isolated dialogue (no background music/effects) - Natural speaking tone (not shouting/whispering) - Variety of phonemes for better cloning 3. **Transcription**: Provide accurate Japanese transcription for `ref_text` 4. **Processing**: Convert to WAV format, ensure clean audio quality ### 3.4 Legal Considerations - Voice cloning of real voice actors for commercial use may have legal implications - Synthetic voice generation based on copyrighted characters may require licenses - Consider using for internal/personal use only, or creating disclaimer --- ## 4. Feasibility Assessment ### 4.1 Live/Streaming TTS Feasibility: HIGHLY FEASIBLE **Evidence:** - Qwen3-TTS achieves 97ms latency (well under 200ms real-time threshold) - Existing WebSocket infrastructure in makima (`/api/v1/listen`) can be adapted - Streaming architecture designed for interactive scenarios **Implementation Approach:** 1. Create new WebSocket endpoint `/api/v1/speak` mirroring listen endpoint 2. Pre-compute voice clone prompt on connection start 3. Stream audio chunks as they're generated 4. Use chunked audio encoding (similar to listen's binary message handling) ### 4.2 Voice Cloning with Japanese Voice: FULLY SUPPORTED **Evidence:** - Qwen3-TTS explicitly supports cross-lingual voice cloning - Japanese is one of 10 supported languages - 0.95 speaker similarity maintained across languages **Implementation Approach:** 1. Pre-process Makima voice clips (5-15 seconds Japanese audio) 2. Include Japanese transcription 3. Generate English speech while preserving voice characteristics ### 4.3 Integration Challenges | Challenge | Difficulty | Mitigation | |-----------|-----------|------------| | **Python to Rust Integration** | Medium | Use Python subprocess or microservice | | **GPU Memory** | Low | 0.6B model is lightweight | | **Latency Target** | Low | 97ms base latency is excellent | | **Audio Format Conversion** | Low | Existing symphonia infrastructure | | **Default Voice Setup** | Low | One-time voice prompt caching | ### 4.4 Architecture Options **Option A: Python Microservice** ``` [Makima Rust Server] --HTTP/WebSocket--> [Python TTS Service] | [Qwen3-TTS Model] ``` Pros: Clean separation, easy Python integration Cons: Network overhead, deployment complexity **Option B: PyO3 Rust Bindings** ``` [Makima Rust Server] --FFI--> [pyo3 Python Bindings] --> [Qwen3-TTS] ``` Pros: Single process, lower latency Cons: Complex build, Python GIL issues **Option C: ONNX Export (Like Current Chatterbox)** ``` [Makima Rust Server] --ort--> [Qwen3-TTS ONNX Models] ``` Pros: Pure Rust, consistent with existing architecture Cons: May not have ONNX export available for Qwen3-TTS **Recommended: Option A (Python Microservice)** - Fastest time to implementation - Aligns with Qwen3-TTS's native Python API - Can use WebSocket for streaming audio chunks - Easy to deploy alongside existing makima server --- ## 5. Preliminary Technical Approach ### 5.1 Phase 1: Python TTS Microservice ```python # tts_service.py from fastapi import FastAPI, WebSocket from qwen_tts import Qwen3TTSModel import torch import base64 app = FastAPI() model = None voice_prompt = None @app.on_event("startup") async def load_model(): global model, voice_prompt model = Qwen3TTSModel.from_pretrained( "Qwen/Qwen3-TTS-12Hz-0.6B-Base", device_map="cuda:0", dtype=torch.bfloat16, ) # Pre-load Makima voice voice_prompt = model.create_voice_clone_prompt( ref_audio="makima_voice.wav", ref_text="日本語の台詞...", ) @app.websocket("/ws/speak") async def speak(websocket: WebSocket): await websocket.accept() while True: text = await websocket.receive_text() wavs, sr = model.generate_voice_clone( text=text, language="English", voice_clone_prompt=voice_prompt, ) # Stream audio chunks audio_bytes = wavs[0].tobytes() await websocket.send_bytes(audio_bytes) ``` ### 5.2 Phase 2: Rust Integration ```rust // makima/src/server/handlers/speak.rs pub async fn websocket_handler( ws: WebSocketUpgrade, State(state): State, ) -> Response { ws.on_upgrade(|socket| handle_speak_socket(socket, state)) } async fn handle_speak_socket(socket: WebSocket, state: SharedState) { // Connect to Python TTS service let tts_ws = tokio_tungstenite::connect_async("ws://localhost:8001/ws/speak").await?; // Forward text to TTS, stream audio back to client // ... } ``` ### 5.3 API Design **WebSocket Endpoint: `/api/v1/speak`** **Client to Server Messages:** ```json { "type": "start", "sample_rate": 24000, "encoding": "pcm16" } { "type": "speak", "text": "Hello, I am Makima." } { "type": "stop" } ``` **Server to Client Messages:** ```json { "type": "ready", "session_id": "uuid" } { "type": "audio_chunk", "data": "", "is_final": false } { "type": "complete" } ``` --- ## 6. Next Steps ### Immediate Actions 1. [ ] Obtain Makima voice clips (5-15 seconds clean Japanese audio) 2. [ ] Create Japanese transcription of voice clips 3. [ ] Test Qwen3-TTS voice cloning with Makima samples 4. [ ] Benchmark latency on target hardware ### Development Phases 1. **Phase 1**: Python TTS microservice proof-of-concept 2. **Phase 2**: WebSocket streaming integration 3. **Phase 3**: Rust proxy endpoint in makima 4. **Phase 4**: Listen page integration for bidirectional speech ### Hardware Requirements - CUDA-compatible GPU (minimum) - Recommended: 8GB+ VRAM for 0.6B model with FlashAttention 2 - Python 3.12+ environment --- ## References - [Qwen3-TTS-12Hz-0.6B-Base on HuggingFace](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base) - [Qwen3-TTS GitHub Repository](https://github.com/QwenLM/Qwen3-TTS) - [Behind The Voice Actors - Makima](https://www.behindthevoiceactors.com/tv-shows/Chainsaw-Man/Makima/) - [Voicy Network Chainsaw Man Soundboard](https://www.voicy.network/official-soundboards/anime/chainsaw-man)