# Qwen3-TTS Integration Specification
**Version:** 2.0
**Date:** 2026-01-27
**Status:** Draft
**Author:** Makima Engineering
## Table of Contents
1. [Overview](#1-overview)
2. [Functional Requirements](#2-functional-requirements)
3. [Non-Functional Requirements](#3-non-functional-requirements)
4. [Architecture Specification](#4-architecture-specification)
5. [API Contract](#5-api-contract)
6. [Voice Asset Requirements](#6-voice-asset-requirements)
7. [Testing Strategy](#7-testing-strategy)
8. [Implementation Phases](#8-implementation-phases)
9. [Appendix](#appendix)
---
## 1. Overview
### 1.1 Purpose
This specification defines the integration of Qwen3-TTS-12Hz-0.6B-Base as a replacement for the existing Chatterbox-Turbo TTS implementation in the makima system. The new implementation is a **pure Rust** solution using the **candle** ML framework — no Python, no separate microservice. The TTS model runs directly inside the main makima process. The implementation will provide:
- **Streaming TTS** with near-real-time audio synthesis
- **Voice cloning** with default Makima voice (Japanese voice actress speaking English)
- **Bidirectional speech integration** for the Listen page
- **WebSocket-based streaming API** for low-latency delivery
### 1.2 Background
The current TTS implementation (Chatterbox-Turbo-ONNX) has limitations:
- No streaming support (batch-only generation)
- No HTTP/WebSocket endpoint exposed
- High latency for interactive use cases
Qwen3-TTS offers significant improvements:
- **97ms** end-to-end latency (vs. batch processing)
- **10 languages** supported including Japanese cross-lingual cloning
- **3-second** reference audio for voice cloning
- **Dual-track streaming architecture**
### 1.3 Scope
This specification covers:
- WebSocket endpoint `/api/v1/speak` for streaming TTS
- Pure Rust candle-based model inference running in-process
- Voice asset management
- Testing and benchmarking
Out of scope:
- ONNX export of Qwen3-TTS (not available)
- Instruction-following TTS features (base model only)
- Full replacement of STT/Listen functionality
---
## 2. Functional Requirements
### 2.1 WebSocket Endpoint: `/api/v1/speak`
The TTS service SHALL be exposed via a WebSocket endpoint at `/api/v1/speak` for streaming audio synthesis.
#### 2.1.1 Connection Flow
```
Client Server (Rust/Axum)
| |
|--- WS Connect ------------------>|
| | [Load TTS model lazily if needed]
|<-- Ready (session_id) -----------|
| |
|--- Start (config) -------------->|
|<-- Started ----------------------| [Load voice prompt]
| |
|--- Speak (text) ---------------->| [Direct candle model inference]
| |
|<-- AudioChunk (binary) ----------|
|<-- AudioChunk (binary) ----------|
|<-- Complete ---------------------|
| |
|--- Stop ------------------------>|
|<-- Stopped ----------------------|
| |
```
#### 2.1.2 Voice Cloning
The system SHALL support voice cloning with the following modes:
| Mode | Description | Requirements |
|------|-------------|--------------|
| Default (Makima) | Pre-loaded Makima voice | None (auto-selected) |
| Custom Voice | User-provided reference | Audio file + transcript |
| X-Vector Only | Speaker embedding only | Audio file (no transcript) |
**Default Voice Behavior:**
- If no voice is specified, use the pre-loaded Makima voice prompt
- Makima voice SHALL be a Japanese voice actress (Tomori Kusunoki) speaking English
- Voice prompt is pre-computed at model load time for zero-latency switching
#### 2.1.3 Message Protocol
##### Client-to-Server Messages
All messages use JSON format with a `type` field for routing.
**Start Message** - Initialize TTS session
```json
{
"type": "start",
"sampleRate": 24000,
"encoding": "pcm16",
"voice": "makima",
"language": "English",
"authToken": "optional-jwt-token",
"contractId": "optional-contract-uuid"
}
```
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `type` | string | Yes | Must be "start" |
| `sampleRate` | number | No | Output sample rate (default: 24000) |
| `encoding` | string | No | Audio encoding: "pcm16", "pcm32f" (default: "pcm16") |
| `voice` | string | No | Voice ID or "makima" (default: "makima") |
| `language` | string | No | Output language (default: "English") |
| `authToken` | string | No | JWT for authenticated sessions |
| `contractId` | string | No | Contract ID for context |
**Speak Message** - Request speech synthesis
```json
{
"type": "speak",
"text": "Hello, I am Makima.",
"priority": "normal"
}
```
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `type` | string | Yes | Must be "speak" |
| `text` | string | Yes | Text to synthesize |
| `priority` | string | No | "high" or "normal" (default: "normal") |
**Stop Message** - End session
```json
{
"type": "stop",
"reason": "user_requested"
}
```
**Cancel Message** - Cancel current synthesis
```json
{
"type": "cancel"
}
```
##### Server-to-Client Messages
**Ready Message** - Session established
```json
{
"type": "ready",
"sessionId": "uuid-string",
"voiceLoaded": "makima",
"capabilities": {
"streaming": true,
"languages": ["English", "Japanese", "Chinese", "Korean", "German", "French", "Russian", "Portuguese", "Spanish", "Italian"]
}
}
```
**Started Message** - TTS session configured
```json
{
"type": "started",
"sampleRate": 24000,
"encoding": "pcm16",
"voice": "makima"
}
```
**AudioChunk Message** - Streaming audio data
```json
{
"type": "audioChunk",
"data": "<base64-encoded-audio>",
"sequenceNumber": 1,
"isFinal": false,
"timestampMs": 1234567890
}
```
For binary transport (recommended for performance):
- Server MAY send raw binary WebSocket frames
- Binary frames contain PCM audio data directly
- JSON control messages indicate start/end of audio stream
**Complete Message** - Synthesis finished
```json
{
"type": "complete",
"durationMs": 1500,
"charactersProcessed": 25,
"audioLengthMs": 2100
}
```
**Error Message** - Error occurred
```json
{
"type": "error",
"code": "SYNTHESIS_ERROR",
"message": "Failed to generate audio",
"recoverable": true
}
```
**Stopped Message** - Session ended
```json
{
"type": "stopped",
"reason": "user_requested"
}
```
#### 2.1.4 Integration with Listen Page
The TTS endpoint SHALL integrate with the existing Listen page (`/api/v1/listen`) to enable bidirectional speech:
**Bidirectional Flow:**
```
User Speech -> /api/v1/listen (STT) -> Transcription
|
v
LLM Processing / Task Creation
|
v
Response Text -> /api/v1/speak (TTS) -> Audio -> User
```
**Implementation Requirements:**
1. Both endpoints SHALL support the same `contractId` for context sharing
2. TTS SHALL support interruption when new STT input is detected
3. Session management SHALL coordinate between STT and TTS
### 2.2 Voice Configuration API
#### 2.2.1 List Available Voices
```
GET /api/v1/voices
```
Response:
```json
{
"voices": [
{
"id": "makima",
"name": "Makima (Default)",
"language": "Japanese",
"description": "Default Makima voice (Tomori Kusunoki)",
"isDefault": true
}
]
}
```
#### 2.2.2 Upload Custom Voice (Future)
```
POST /api/v1/voices
Content-Type: multipart/form-data
audio: <audio-file>
transcript: "Text spoken in the audio"
name: "Custom Voice"
```
---
## 3. Non-Functional Requirements
### 3.1 Latency Requirements
| Metric | Target | Maximum | Notes |
|--------|--------|---------|-------|
| First Audio Byte | < 200ms | 500ms | From text submission to first audio chunk |
| Subsequent Chunks | < 50ms | 100ms | Inter-chunk latency |
| End-to-End Latency | < 300ms | 800ms | Total time for short phrases |
| Voice Prompt Loading | < 500ms | 2000ms | One-time at session start |
**Measurement Points:**
- T0: Client sends "speak" message
- T1: First audio chunk received by client
- T2: Last audio chunk received ("complete" message)
- First Audio Latency = T1 - T0
- Total Latency = T2 - T0
### 3.2 Audio Quality Requirements
| Specification | Value |
|---------------|-------|
| Output Sample Rate | 24,000 Hz |
| Bit Depth | 16-bit (PCM16) or 32-bit float |
| Channels | Mono (1 channel) |
| Audio Codec | Raw PCM (WebSocket), WAV (download) |
| Voice Similarity | > 0.90 speaker similarity score |
**Quality Metrics:**
- MOS (Mean Opinion Score): Target > 4.0
- Speaker similarity to reference: Target > 0.90
- No audible artifacts or glitches in streaming mode
### 3.3 Hardware Requirements
The TTS model runs directly in the makima process using candle.
| Component | Minimum | Recommended |
|-----------|---------|-------------|
| GPU | CUDA-capable, 4GB VRAM (or Metal on macOS) | RTX 3060+ with 8GB+ VRAM |
| GPU Memory | 4GB | 8GB |
| System RAM | 8GB | 16GB |
| Storage | 5GB (model weights) | 10GB |
**GPU Memory Breakdown:**
- Model weights (bf16): ~1.2GB
- Speech tokenizer: ~682MB
- KV cache during inference: ~1-2GB
- Safety margin: ~1GB
**CPU Fallback:**
- candle supports CPU with MKL for systems without GPU
- Latency will be higher but functional
### 3.4 Scalability Requirements
| Metric | Target |
|--------|--------|
| Concurrent Sessions | 10 per GPU |
| Requests per Second | 50 text-to-speech requests |
| Audio Throughput | 10 hours of audio per hour |
### 3.5 Availability Requirements
| Metric | Target |
|--------|--------|
| Service Uptime | 99.5% |
| Recovery Time | < 30 seconds |
| Graceful Degradation | Fall back to batch mode if streaming fails |
---
## 4. Architecture Specification
### 4.1 System Architecture
```
+-------------------------------------------------------------------------+
| Client Application |
| +-------------+ +-------------+ +------------------------------+ |
| | Listen | | Speak | | UI Components | |
| | (STT UI) | | (TTS UI) | | (Audio Player, Controls) | |
| +------+------+ +------+------+ +------------------------------+ |
+---------|--------------------|------------------------------------------+
| WebSocket | WebSocket
| /api/v1/listen | /api/v1/speak
| |
+---------|--------------------|------------------------------------------+
| | Makima Server (Rust/Axum) |
| +------v--------------------v------+ |
| | WebSocket Router | |
| | (axum WebSocket handlers) | |
| +------+--------------------+------+ |
| | | |
| +------v------+ +------v------+ +-----------------------------+ |
| | Listen | | Speak | | Shared State | |
| | Handler | | Handler | | - ML Models (STT) | |
| | (STT/ML) | | (TTS/ML) | | - TTS Model (candle) | |
| +-------------+ +------+------+ | - Voice Prompt Cache | |
| | | - Session Manager | |
| | +-----------------------------+ |
| +------v------+ |
| | TTS Module | |
| | (candle) | |
| +------+------+ |
| | |
| +------v------+ |
| | Qwen3-TTS | |
| | Components | |
| | - LM (28L) | |
| | - Code Pred | |
| | - Speech Tok| |
| +-------------+ |
+--------------------------------------------------------------------------+
```
### 4.2 TTS Module Structure
```
makima/src/tts/
├── mod.rs // TTS trait + factory (select Chatterbox vs Qwen3)
├── chatterbox.rs // Existing ONNX-based Chatterbox (moved from tts.rs)
├── qwen3/
│ ├── mod.rs // Qwen3TTS public API
│ ├── model.rs // Qwen3 LM transformer (28 layers)
│ ├── code_predictor.rs // MTP module (5 layers, 16 codebooks)
│ ├── speech_tokenizer.rs // Encoder + Decoder (causal ConvNet)
│ ├── config.rs // Model config from config.json
│ └── generate.rs // Autoregressive generation loop with KV cache
```
#### 4.2.1 TTS Trait
```rust
// makima/src/tts/mod.rs
/// Trait for text-to-speech implementations.
#[async_trait]
pub trait TtsEngine: Send + Sync {
/// Generate audio from text with a given voice prompt.
async fn generate(
&self,
text: &str,
voice_id: &str,
language: &str,
) -> Result<Vec<AudioChunk>, TtsError>;
/// Load and cache a voice prompt from reference audio.
async fn load_voice(&self, voice_id: &str) -> Result<(), TtsError>;
/// Check if the engine is ready for inference.
fn is_ready(&self) -> bool;
}
/// Select the appropriate TTS engine based on configuration.
pub fn create_engine(config: &TtsConfig) -> Box<dyn TtsEngine> {
match config.engine {
TtsEngineType::Qwen3 => Box::new(qwen3::Qwen3Tts::new(config)),
TtsEngineType::Chatterbox => Box::new(chatterbox::ChatterboxTts::new(config)),
}
}
```
#### 4.2.2 Qwen3 Candle Implementation
The Qwen3 module implements the three core model components using candle:
1. **Language Model** — 28-layer transformer using candle-transformers' Qwen2 attention with TTS-specific modifications
2. **Code Predictor** — 5-layer MTP module predicting 16 codebook layers
3. **Speech Tokenizer** — GAN-based codec with Conv1d encoder/decoder
**Key candle features used:**
- `candle_core::Tensor` for all tensor operations
- `candle_nn::Module` for model layers
- `candle_nn::VarBuilder` for loading safetensors weights
- `candle_core::Device` for GPU/CPU selection
#### 4.2.3 Model Loading
Models are loaded lazily on first TTS request, following the pattern established by `listen.rs`:
```rust
// Models held in SharedState behind async mutex
pub struct TtsModels {
pub engine: Box<dyn TtsEngine>,
pub voice_cache: VoicePromptCache,
}
impl AppState {
pub async fn get_tts_models(&self) -> Result<&TtsModels, TtsError> {
self.tts_models.get_or_try_init(|| async {
// Load safetensors weights via candle
// Initialize voice cache with default Makima voice
}).await
}
}
```
### 4.3 Speak Handler
```rust
// makima/src/server/handlers/speak.rs
/// WebSocket handler for TTS streaming.
/// Calls the TTS engine directly — no proxy, no external service.
pub async fn websocket_handler(
ws: WebSocketUpgrade,
State(state): State<SharedState>,
) -> Response {
ws.on_upgrade(|socket| handle_speak_socket(socket, state))
}
async fn handle_speak_socket(socket: WebSocket, state: SharedState) {
let session_id = Uuid::new_v4().to_string();
// Get or lazily load TTS models
let tts = match state.get_tts_models().await {
Ok(tts) => tts,
Err(e) => {
// Send error and close
return;
}
};
// Handle WebSocket messages directly
// Parse JSON commands, run inference, stream audio chunks back
}
```
### 4.4 Voice Prompt Caching
Voice prompts are cached in-memory using an LRU cache:
```rust
// makima/src/tts/mod.rs
pub struct VoicePromptCache {
cache: tokio::sync::Mutex<lru::LruCache<String, VoicePrompt>>,
}
impl VoicePromptCache {
pub fn new(max_size: usize) -> Self { /* ... */ }
pub async fn get(&self, voice_id: &str) -> Option<VoicePrompt> { /* ... */ }
pub async fn insert(&self, voice_id: String, prompt: VoicePrompt) { /* ... */ }
}
```
### 4.5 Error Handling and Recovery
#### 4.5.1 Error Categories
| Error Code | Category | Recoverable | Action |
|------------|----------|-------------|--------|
| `MODEL_LOADING` | Initialization | Yes | Wait and retry |
| `SYNTHESIS_ERROR` | Generation | Yes | Retry with same input |
| `INVALID_TEXT` | Input | No | Return error to client |
| `VOICE_NOT_FOUND` | Configuration | No | Fall back to default voice |
| `GPU_OUT_OF_MEMORY` | Resource | Yes | Clear cache, retry on CPU |
| `TIMEOUT` | Inference | Yes | Retry with backoff |
---
## 5. API Contract
### 5.1 WebSocket Message Formats
#### 5.1.1 Client-to-Server Messages
```typescript
// TypeScript type definitions for client implementation
interface StartMessage {
type: "start";
sampleRate?: number; // Default: 24000
encoding?: "pcm16" | "pcm32f"; // Default: "pcm16"
voice?: string; // Default: "makima"
language?: string; // Default: "English"
authToken?: string; // JWT for authenticated sessions
contractId?: string; // Contract context
}
interface SpeakMessage {
type: "speak";
text: string; // Required: text to synthesize
priority?: "normal" | "high"; // Default: "normal"
}
interface CancelMessage {
type: "cancel";
}
interface StopMessage {
type: "stop";
reason?: string;
}
type ClientMessage = StartMessage | SpeakMessage | CancelMessage | StopMessage;
```
#### 5.1.2 Server-to-Client Messages
```typescript
interface ReadyMessage {
type: "ready";
sessionId: string;
voiceLoaded: string;
capabilities: {
streaming: boolean;
languages: string[];
};
}
interface StartedMessage {
type: "started";
sampleRate: number;
encoding: string;
voice: string;
}
interface AudioChunkMessage {
type: "audioChunk";
data: string; // Base64-encoded PCM audio
sequenceNumber: number;
isFinal: boolean;
timestampMs: number;
}
interface CompleteMessage {
type: "complete";
durationMs: number;
charactersProcessed: number;
audioLengthMs: number;
}
interface ErrorMessage {
type: "error";
code: string;
message: string;
recoverable: boolean;
}
interface StoppedMessage {
type: "stopped";
reason: string;
}
type ServerMessage =
| ReadyMessage
| StartedMessage
| AudioChunkMessage
| CompleteMessage
| ErrorMessage
| StoppedMessage;
```
### 5.2 Error Codes
| Code | HTTP-like | Description | Recovery |
|------|-----------|-------------|----------|
| `MODEL_LOADING` | 503 | Model still loading | Wait and retry |
| `SYNTHESIS_ERROR` | 500 | Failed to generate audio | Retry |
| `INVALID_TEXT` | 400 | Text is empty or invalid | Fix input |
| `VOICE_NOT_FOUND` | 404 | Requested voice doesn't exist | Use default |
| `UNAUTHORIZED` | 401 | Invalid or missing auth token | Re-authenticate |
| `RATE_LIMITED` | 429 | Too many requests | Back off |
| `TIMEOUT` | 408 | Operation timed out | Retry |
| `CANCELLED` | 499 | Client cancelled request | N/A |
### 5.3 Session Management
#### 5.3.1 Session Lifecycle
```
DISCONNECTED -> CONNECTING -> READY -> STARTED -> SPEAKING -> READY -> ... -> STOPPED
| | | |
v v v v
ERROR ERROR ERROR STOPPED
```
---
## 6. Voice Asset Requirements
### 6.1 Makima Voice Clip Specifications
#### 6.1.1 Audio Requirements
| Specification | Requirement |
|---------------|-------------|
| Duration | 5-10 seconds (minimum 3s) |
| Format | WAV (PCM) |
| Sample Rate | 24,000 Hz or higher |
| Bit Depth | 16-bit or higher |
| Channels | Mono (preferred) or Stereo |
| Content | Clear speech, natural tone |
| Background | Minimal noise/music |
#### 6.1.2 Content Guidelines
**DO:**
- Use dialogue with varied intonation
- Include multiple phonemes
- Capture natural speaking rhythm
- Extract from clean audio scenes
**DON'T:**
- Include background music
- Use shouting or whispering
- Include sound effects
- Use heavily processed audio
#### 6.1.3 Transcript Requirements
| Specification | Requirement |
|---------------|-------------|
| Format | Plain text (.txt) or JSON |
| Encoding | UTF-8 |
| Content | Exact transcription of audio |
| Language | Japanese (for Japanese reference) |
### 6.2 Storage Location and Management
#### 6.2.1 Directory Structure
```
models/
└── voices/
├── makima/
│ ├── reference.wav # Primary reference audio
│ ├── transcript.txt # Plain text transcript
│ ├── transcript.json # Structured transcript (optional)
│ └── metadata.json # Voice metadata
├── makima-alt/ # Alternative Makima clips (future)
│ └── ...
└── custom/ # User-uploaded voices (future)
└── {voice_id}/
├── reference.wav
├── transcript.txt
└── metadata.json
```
---
## 7. Testing Strategy
### 7.1 Unit Tests
#### 7.1.1 Rust TTS Module Tests
```rust
// makima/src/tts/qwen3/tests.rs
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_config_loading() {
let config = Qwen3Config::from_json("test_config.json").unwrap();
assert_eq!(config.hidden_size, 1024);
assert_eq!(config.num_layers, 28);
}
#[test]
fn test_voice_prompt_cache_lru() {
let cache = VoicePromptCache::new(2);
cache.insert("a", prompt_a);
cache.insert("b", prompt_b);
cache.get("a"); // access a
cache.insert("c", prompt_c); // should evict b
assert!(cache.get("a").is_some());
assert!(cache.get("b").is_none());
assert!(cache.get("c").is_some());
}
#[tokio::test]
async fn test_speak_handler_message_parsing() {
let json = r#"{"type": "start", "voice": "makima"}"#;
let msg: SpeakClientMessage = serde_json::from_str(json).unwrap();
match msg {
SpeakClientMessage::Start(start) => {
assert_eq!(start.voice, Some("makima".to_string()));
}
_ => panic!("Expected Start message"),
}
}
}
```
### 7.2 Integration Tests
```rust
// tests/tts_integration.rs
#[tokio::test]
async fn test_speak_websocket_flow() {
// Start test server with TTS enabled
let state = create_test_state_with_tts().await;
let app = make_router(state);
// Connect WebSocket
let ws = connect_ws("/api/v1/speak").await;
// Send start
ws.send_json(json!({"type": "start", "voice": "makima"})).await;
let ready = ws.recv_json().await;
assert_eq!(ready["type"], "ready");
// Send speak
ws.send_json(json!({"type": "speak", "text": "Hello."})).await;
// Collect audio chunks
let mut chunks = vec![];
loop {
let msg = ws.recv().await;
match msg {
WsMsg::Binary(data) => chunks.push(data),
WsMsg::Text(json) => {
let data: Value = serde_json::from_str(&json).unwrap();
if data["type"] == "complete" { break; }
}
}
}
assert!(!chunks.is_empty());
}
```
### 7.3 Performance Targets
| Metric | Target | Acceptable | Warning |
|--------|--------|------------|---------|
| First Audio (short) | < 150ms | < 200ms | > 300ms |
| First Audio (medium) | < 200ms | < 300ms | > 500ms |
| First Audio (long) | < 300ms | < 500ms | > 800ms |
| Inter-chunk | < 30ms | < 50ms | > 100ms |
| Memory (GPU) | < 4GB | < 6GB | > 8GB |
| Memory (CPU) | < 2GB | < 4GB | > 8GB |
---
## 8. Implementation Phases
### Phase 1: Candle-Based Qwen3-TTS Module (Week 1-2)
**Deliverables:**
- [ ] `makima/src/tts/mod.rs` — TTS trait + factory
- [ ] `makima/src/tts/chatterbox.rs` — Move existing code from tts.rs
- [ ] `makima/src/tts/qwen3/model.rs` — 28-layer LM backbone (extend candle Qwen2)
- [ ] `makima/src/tts/qwen3/code_predictor.rs` — MTP module (5 layers, 16 codebooks)
- [ ] `makima/src/tts/qwen3/speech_tokenizer.rs` — ConvNet encoder/decoder + RVQ
- [ ] `makima/src/tts/qwen3/config.rs` — Config from safetensors
- [ ] `makima/src/tts/qwen3/generate.rs` — Autoregressive generation with KV cache
- [ ] Add `candle-core`, `candle-nn`, `candle-transformers` to Cargo.toml
**Success Criteria:**
- Model loads safetensors weights successfully
- Can generate audio from text via direct inference
- First audio latency < 500ms (initial, unoptimized)
### Phase 2: WebSocket Handler + Voice Assets (Week 2-3)
**Deliverables:**
- [ ] Update `makima/src/server/handlers/speak.rs` — Direct TTS handler (no proxy)
- [ ] Lazy model loading via `SharedState`
- [ ] Voice prompt caching
- [ ] Makima voice asset acquisition and processing
- [ ] Basic error handling and session management
**Success Criteria:**
- `/api/v1/speak` endpoint produces streaming audio
- Default Makima voice works
- Error handling matches specification
### Phase 3: Optimization + Integration (Week 3-4)
**Deliverables:**
- [ ] Streaming audio generation (token-by-token decoding)
- [ ] GPU memory optimization
- [ ] Listen page integration for bidirectional speech
- [ ] Session coordination between STT and TTS
- [ ] Full test suite (unit, integration)
- [ ] Latency benchmarks
**Success Criteria:**
- First audio latency < 200ms
- Memory usage < 6GB
- All tests passing
- Documentation complete
---
## Appendix
### A. Dependencies
#### Rust (Cargo.toml additions)
```toml
[dependencies]
candle-core = "0.8"
candle-nn = "0.8"
candle-transformers = "0.8"
# Keep existing: tokenizers, hf-hub, ndarray (for compatibility)
```
### B. Environment Variables
```bash
# TTS Configuration
TTS_ENGINE=qwen3 # "qwen3" or "chatterbox"
TTS_MODEL_ID=Qwen/Qwen3-TTS-12Hz-0.6B-Base
TTS_DEVICE=cuda:0 # "cuda:0", "metal", or "cpu"
TTS_VOICES_DIR=models/voices
TTS_DEFAULT_VOICE=makima
```
### C. References
1. [Qwen3-TTS-12Hz-0.6B-Base (Hugging Face)](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base)
2. [Qwen3-TTS GitHub Repository](https://github.com/QwenLM/Qwen3-TTS)
3. [Qwen3-TTS Technical Report (arXiv:2601.15621)](https://arxiv.org/abs/2601.15621)
4. [Candle — HuggingFace Rust ML Framework](https://github.com/huggingface/candle)
5. [axum WebSocket Documentation](https://docs.rs/axum/latest/axum/extract/ws/index.html)
6. [docs/research/rust-native-tts-research.md](../research/rust-native-tts-research.md) — Detailed feasibility analysis
### D. Glossary
| Term | Definition |
|------|------------|
| **TTS** | Text-to-Speech: Converting text input to audio output |
| **STT** | Speech-to-Text: Converting audio input to text output |
| **Voice Cloning** | Creating synthetic speech that mimics a specific speaker |
| **Voice Prompt** | Pre-computed speaker embedding for voice cloning |
| **Candle** | HuggingFace's minimalist Rust ML framework |
| **SafeTensors** | Efficient, safe model weight serialization format |
| **RVQ** | Residual Vector Quantization — multi-codebook audio tokenization |
| **MTP** | Multi-Token Prediction — code predictor generating 16 codebook layers |
| **bf16** | Brain floating-point 16-bit format for GPU computation |