|
|
Research findings for replacing Chatterbox TTS with Qwen3-TTS-12Hz-0.6B-Base:
- Current TTS: Chatterbox-Turbo-ONNX with batch-only generation, no streaming
- Qwen3-TTS: 97ms end-to-end latency, streaming support, 3-second voice cloning
- Voice cloning: Requires 3s reference audio + transcript (Makima voice planned)
- Integration: Python service with WebSocket bridge (no ONNX export available)
- Languages: 10 supported including English and Japanese
Document includes:
- Current architecture analysis (makima/src/tts.rs)
- Qwen3-TTS capabilities and requirements
- Feasibility assessment for live/streaming TTS
- Audio clip requirements for voice cloning
- Preliminary technical approach with architecture diagrams
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|