path: root/docs/specs/qwen3-tts-spec.md



# Qwen3-TTS Integration Specification

**Version:** 2.0
**Date:** 2026-01-27
**Status:** Draft
**Author:** Makima Engineering

## Table of Contents

1. [Overview](#1-overview)
2. [Functional Requirements](#2-functional-requirements)
3. [Non-Functional Requirements](#3-non-functional-requirements)
4. [Architecture Specification](#4-architecture-specification)
5. [API Contract](#5-api-contract)
6. [Voice Asset Requirements](#6-voice-asset-requirements)
7. [Testing Strategy](#7-testing-strategy)
8. [Implementation Phases](#8-implementation-phases)
9. [Appendix](#appendix)

---

## 1. Overview

### 1.1 Purpose

This specification defines the integration of Qwen3-TTS-12Hz-0.6B-Base as a replacement for the existing Chatterbox-Turbo TTS implementation in the makima system. The new implementation is a **pure Rust** solution using the **candle** ML framework — no Python, no separate microservice. The TTS model runs directly inside the main makima process. The implementation will provide:

- **Streaming TTS** with near-real-time audio synthesis
- **Voice cloning** with default Makima voice (Japanese voice actress speaking English)
- **Bidirectional speech integration** for the Listen page
- **WebSocket-based streaming API** for low-latency delivery

### 1.2 Background

The current TTS implementation (Chatterbox-Turbo-ONNX) has limitations:
- No streaming support (batch-only generation)
- No HTTP/WebSocket endpoint exposed
- High latency for interactive use cases

Qwen3-TTS offers significant improvements:
- **97ms** end-to-end latency (vs. batch processing)
- **10 languages** supported including Japanese cross-lingual cloning
- **3-second** reference audio for voice cloning
- **Dual-track streaming architecture**

### 1.3 Scope

This specification covers:
- WebSocket endpoint `/api/v1/speak` for streaming TTS
- Pure Rust candle-based model inference running in-process
- Voice asset management
- Testing and benchmarking

Out of scope:
- ONNX export of Qwen3-TTS (not available)
- Instruction-following TTS features (base model only)
- Full replacement of STT/Listen functionality

---

## 2. Functional Requirements

### 2.1 WebSocket Endpoint: `/api/v1/speak`

The TTS service SHALL be exposed via a WebSocket endpoint at `/api/v1/speak` for streaming audio synthesis.

#### 2.1.1 Connection Flow

```
Client                           Server (Rust/Axum)
   |                                  |
   |--- WS Connect ------------------>|
   |                                  | [Load TTS model lazily if needed]
   |<-- Ready (session_id) -----------|
   |                                  |
   |--- Start (config) -------------->|
   |<-- Started ----------------------| [Load voice prompt]
   |                                  |
   |--- Speak (text) ---------------->| [Direct candle model inference]
   |                                  |
   |<-- AudioChunk (binary) ----------|
   |<-- AudioChunk (binary) ----------|
   |<-- Complete ---------------------|
   |                                  |
   |--- Stop ------------------------>|
   |<-- Stopped ----------------------|
   |                                  |
```

#### 2.1.2 Voice Cloning

The system SHALL support voice cloning with the following modes:

| Mode | Description | Requirements |
|------|-------------|--------------|
| Default (Makima) | Pre-loaded Makima voice | None (auto-selected) |
| Custom Voice | User-provided reference | Audio file + transcript |
| X-Vector Only | Speaker embedding only | Audio file (no transcript) |

**Default Voice Behavior:**
- If no voice is specified, use the pre-loaded Makima voice prompt
- Makima voice SHALL be a Japanese voice actress (Tomori Kusunoki) speaking English
- Voice prompt is pre-computed at model load time for zero-latency switching

#### 2.1.3 Message Protocol

##### Client-to-Server Messages

All messages use JSON format with a `type` field for routing.

**Start Message** - Initialize TTS session
```json
{
  "type": "start",
  "sampleRate": 24000,
  "encoding": "pcm16",
  "voice": "makima",
  "language": "English",
  "authToken": "optional-jwt-token",
  "contractId": "optional-contract-uuid"
}
```

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `type` | string | Yes | Must be "start" |
| `sampleRate` | number | No | Output sample rate (default: 24000) |
| `encoding` | string | No | Audio encoding: "pcm16", "pcm32f" (default: "pcm16") |
| `voice` | string | No | Voice ID or "makima" (default: "makima") |
| `language` | string | No | Output language (default: "English") |
| `authToken` | string | No | JWT for authenticated sessions |
| `contractId` | string | No | Contract ID for context |

**Speak Message** - Request speech synthesis
```json
{
  "type": "speak",
  "text": "Hello, I am Makima.",
  "priority": "normal"
}
```

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `type` | string | Yes | Must be "speak" |
| `text` | string | Yes | Text to synthesize |
| `priority` | string | No | "high" or "normal" (default: "normal") |

**Stop Message** - End session
```json
{
  "type": "stop",
  "reason": "user_requested"
}
```

**Cancel Message** - Cancel current synthesis
```json
{
  "type": "cancel"
}
```

##### Server-to-Client Messages

**Ready Message** - Session established
```json
{
  "type": "ready",
  "sessionId": "uuid-string",
  "voiceLoaded": "makima",
  "capabilities": {
    "streaming": true,
    "languages": ["English", "Japanese", "Chinese", "Korean", "German", "French", "Russian", "Portuguese", "Spanish", "Italian"]
  }
}
```

**Started Message** - TTS session configured
```json
{
  "type": "started",
  "sampleRate": 24000,
  "encoding": "pcm16",
  "voice": "makima"
}
```

**AudioChunk Message** - Streaming audio data
```json
{
  "type": "audioChunk",
  "data": "<base64-encoded-audio>",
  "sequenceNumber": 1,
  "isFinal": false,
  "timestampMs": 1234567890
}
```

For binary transport (recommended for performance):
- Server MAY send raw binary WebSocket frames
- Binary frames contain PCM audio data directly
- JSON control messages indicate start/end of audio stream

**Complete Message** - Synthesis finished
```json
{
  "type": "complete",
  "durationMs": 1500,
  "charactersProcessed": 25,
  "audioLengthMs": 2100
}
```

**Error Message** - Error occurred
```json
{
  "type": "error",
  "code": "SYNTHESIS_ERROR",
  "message": "Failed to generate audio",
  "recoverable": true
}
```

**Stopped Message** - Session ended
```json
{
  "type": "stopped",
  "reason": "user_requested"
}
```

#### 2.1.4 Integration with Listen Page

The TTS endpoint SHALL integrate with the existing Listen page (`/api/v1/listen`) to enable bidirectional speech:

**Bidirectional Flow:**
```
User Speech -> /api/v1/listen (STT) -> Transcription
                                          |
                                          v
                              LLM Processing / Task Creation
                                          |
                                          v
Response Text -> /api/v1/speak (TTS) -> Audio -> User
```

**Implementation Requirements:**
1. Both endpoints SHALL support the same `contractId` for context sharing
2. TTS SHALL support interruption when new STT input is detected
3. Session management SHALL coordinate between STT and TTS

### 2.2 Voice Configuration API

#### 2.2.1 List Available Voices

```
GET /api/v1/voices
```

Response:
```json
{
  "voices": [
    {
      "id": "makima",
      "name": "Makima (Default)",
      "language": "Japanese",
      "description": "Default Makima voice (Tomori Kusunoki)",
      "isDefault": true
    }
  ]
}
```

#### 2.2.2 Upload Custom Voice (Future)

```
POST /api/v1/voices
Content-Type: multipart/form-data

audio: <audio-file>
transcript: "Text spoken in the audio"
name: "Custom Voice"
```

---

## 3. Non-Functional Requirements

### 3.1 Latency Requirements

| Metric | Target | Maximum | Notes |
|--------|--------|---------|-------|
| First Audio Byte | < 200ms | 500ms | From text submission to first audio chunk |
| Subsequent Chunks | < 50ms | 100ms | Inter-chunk latency |
| End-to-End Latency | < 300ms | 800ms | Total time for short phrases |
| Voice Prompt Loading | < 500ms | 2000ms | One-time at session start |

**Measurement Points:**
- T0: Client sends "speak" message
- T1: First audio chunk received by client
- T2: Last audio chunk received ("complete" message)
- First Audio Latency = T1 - T0
- Total Latency = T2 - T0

### 3.2 Audio Quality Requirements

| Specification | Value |
|---------------|-------|
| Output Sample Rate | 24,000 Hz |
| Bit Depth | 16-bit (PCM16) or 32-bit float |
| Channels | Mono (1 channel) |
| Audio Codec | Raw PCM (WebSocket), WAV (download) |
| Voice Similarity | > 0.90 speaker similarity score |

**Quality Metrics:**
- MOS (Mean Opinion Score): Target > 4.0
- Speaker similarity to reference: Target > 0.90
- No audible artifacts or glitches in streaming mode

### 3.3 Hardware Requirements

The TTS model runs directly in the makima process using candle.

| Component | Minimum | Recommended |
|-----------|---------|-------------|
| GPU | CUDA-capable, 4GB VRAM (or Metal on macOS) | RTX 3060+ with 8GB+ VRAM |
| GPU Memory | 4GB | 8GB |
| System RAM | 8GB | 16GB |
| Storage | 5GB (model weights) | 10GB |

**GPU Memory Breakdown:**
- Model weights (bf16): ~1.2GB
- Speech tokenizer: ~682MB
- KV cache during inference: ~1-2GB
- Safety margin: ~1GB

**CPU Fallback:**
- candle supports CPU with MKL for systems without GPU
- Latency will be higher but functional

### 3.4 Scalability Requirements

| Metric | Target |
|--------|--------|
| Concurrent Sessions | 10 per GPU |
| Requests per Second | 50 text-to-speech requests |
| Audio Throughput | 10 hours of audio per hour |

### 3.5 Availability Requirements

| Metric | Target |
|--------|--------|
| Service Uptime | 99.5% |
| Recovery Time | < 30 seconds |
| Graceful Degradation | Fall back to batch mode if streaming fails |

---

## 4. Architecture Specification

### 4.1 System Architecture

```
+-------------------------------------------------------------------------+
|                         Client Application                               |
|  +-------------+    +-------------+    +------------------------------+ |
|  |   Listen    |    |   Speak     |    |        UI Components         | |
|  |  (STT UI)   |    |  (TTS UI)   |    |  (Audio Player, Controls)    | |
|  +------+------+    +------+------+    +------------------------------+ |
+---------|--------------------|------------------------------------------+
          | WebSocket          | WebSocket
          | /api/v1/listen     | /api/v1/speak
          |                    |
+---------|--------------------|------------------------------------------+
|         |    Makima Server (Rust/Axum)                                   |
|  +------v--------------------v------+                                    |
|  |        WebSocket Router          |                                    |
|  |   (axum WebSocket handlers)      |                                    |
|  +------+--------------------+------+                                    |
|         |                    |                                           |
|  +------v------+      +------v------+    +-----------------------------+ |
|  |   Listen    |      |   Speak     |    |     Shared State            | |
|  |   Handler   |      |   Handler   |    |  - ML Models (STT)          | |
|  |  (STT/ML)   |      |  (TTS/ML)   |    |  - TTS Model (candle)      | |
|  +-------------+      +------+------+    |  - Voice Prompt Cache       | |
|                              |           |  - Session Manager          | |
|                              |           +-----------------------------+ |
|                       +------v------+                                    |
|                       |  TTS Module |                                    |
|                       |  (candle)   |                                    |
|                       +------+------+                                    |
|                              |                                           |
|                       +------v------+                                    |
|                       | Qwen3-TTS   |                                    |
|                       | Components  |                                    |
|                       | - LM (28L)  |                                    |
|                       | - Code Pred |                                    |
|                       | - Speech Tok|                                    |
|                       +-------------+                                    |
+--------------------------------------------------------------------------+
```

### 4.2 TTS Module Structure

```
makima/src/tts/
├── mod.rs                  // TTS trait + factory (select Chatterbox vs Qwen3)
├── chatterbox.rs           // Existing ONNX-based Chatterbox (moved from tts.rs)
├── qwen3/
│   ├── mod.rs              // Qwen3TTS public API
│   ├── model.rs            // Qwen3 LM transformer (28 layers)
│   ├── code_predictor.rs   // MTP module (5 layers, 16 codebooks)
│   ├── speech_tokenizer.rs // Encoder + Decoder (causal ConvNet)
│   ├── config.rs           // Model config from config.json
│   └── generate.rs         // Autoregressive generation loop with KV cache
```

#### 4.2.1 TTS Trait

```rust
// makima/src/tts/mod.rs

/// Trait for text-to-speech implementations.
#[async_trait]
pub trait TtsEngine: Send + Sync {
    /// Generate audio from text with a given voice prompt.
    async fn generate(
        &self,
        text: &str,
        voice_id: &str,
        language: &str,
    ) -> Result<Vec<AudioChunk>, TtsError>;

    /// Load and cache a voice prompt from reference audio.
    async fn load_voice(&self, voice_id: &str) -> Result<(), TtsError>;

    /// Check if the engine is ready for inference.
    fn is_ready(&self) -> bool;
}

/// Select the appropriate TTS engine based on configuration.
pub fn create_engine(config: &TtsConfig) -> Box<dyn TtsEngine> {
    match config.engine {
        TtsEngineType::Qwen3 => Box::new(qwen3::Qwen3Tts::new(config)),
        TtsEngineType::Chatterbox => Box::new(chatterbox::ChatterboxTts::new(config)),
    }
}
```

#### 4.2.2 Qwen3 Candle Implementation

The Qwen3 module implements the three core model components using candle:

1. **Language Model** — 28-layer transformer using candle-transformers' Qwen2 attention with TTS-specific modifications
2. **Code Predictor** — 5-layer MTP module predicting 16 codebook layers
3. **Speech Tokenizer** — GAN-based codec with Conv1d encoder/decoder

**Key candle features used:**
- `candle_core::Tensor` for all tensor operations
- `candle_nn::Module` for model layers
- `candle_nn::VarBuilder` for loading safetensors weights
- `candle_core::Device` for GPU/CPU selection

#### 4.2.3 Model Loading

Models are loaded lazily on first TTS request, following the pattern established by `listen.rs`:

```rust
// Models held in SharedState behind async mutex
pub struct TtsModels {
    pub engine: Box<dyn TtsEngine>,
    pub voice_cache: VoicePromptCache,
}

impl AppState {
    pub async fn get_tts_models(&self) -> Result<&TtsModels, TtsError> {
        self.tts_models.get_or_try_init(|| async {
            // Load safetensors weights via candle
            // Initialize voice cache with default Makima voice
        }).await
    }
}
```

### 4.3 Speak Handler

```rust
// makima/src/server/handlers/speak.rs

/// WebSocket handler for TTS streaming.
/// Calls the TTS engine directly — no proxy, no external service.
pub async fn websocket_handler(
    ws: WebSocketUpgrade,
    State(state): State<SharedState>,
) -> Response {
    ws.on_upgrade(|socket| handle_speak_socket(socket, state))
}

async fn handle_speak_socket(socket: WebSocket, state: SharedState) {
    let session_id = Uuid::new_v4().to_string();

    // Get or lazily load TTS models
    let tts = match state.get_tts_models().await {
        Ok(tts) => tts,
        Err(e) => {
            // Send error and close
            return;
        }
    };

    // Handle WebSocket messages directly
    // Parse JSON commands, run inference, stream audio chunks back
}
```

### 4.4 Voice Prompt Caching

Voice prompts are cached in-memory using an LRU cache:

```rust
// makima/src/tts/mod.rs

pub struct VoicePromptCache {
    cache: tokio::sync::Mutex<lru::LruCache<String, VoicePrompt>>,
}

impl VoicePromptCache {
    pub fn new(max_size: usize) -> Self { /* ... */ }
    pub async fn get(&self, voice_id: &str) -> Option<VoicePrompt> { /* ... */ }
    pub async fn insert(&self, voice_id: String, prompt: VoicePrompt) { /* ... */ }
}
```

### 4.5 Error Handling and Recovery

#### 4.5.1 Error Categories

| Error Code | Category | Recoverable | Action |
|------------|----------|-------------|--------|
| `MODEL_LOADING` | Initialization | Yes | Wait and retry |
| `SYNTHESIS_ERROR` | Generation | Yes | Retry with same input |
| `INVALID_TEXT` | Input | No | Return error to client |
| `VOICE_NOT_FOUND` | Configuration | No | Fall back to default voice |
| `GPU_OUT_OF_MEMORY` | Resource | Yes | Clear cache, retry on CPU |
| `TIMEOUT` | Inference | Yes | Retry with backoff |

---

## 5. API Contract

### 5.1 WebSocket Message Formats

#### 5.1.1 Client-to-Server Messages

```typescript
// TypeScript type definitions for client implementation

interface StartMessage {
  type: "start";
  sampleRate?: number;      // Default: 24000
  encoding?: "pcm16" | "pcm32f";  // Default: "pcm16"
  voice?: string;           // Default: "makima"
  language?: string;        // Default: "English"
  authToken?: string;       // JWT for authenticated sessions
  contractId?: string;      // Contract context
}

interface SpeakMessage {
  type: "speak";
  text: string;             // Required: text to synthesize
  priority?: "normal" | "high";  // Default: "normal"
}

interface CancelMessage {
  type: "cancel";
}

interface StopMessage {
  type: "stop";
  reason?: string;
}

type ClientMessage = StartMessage | SpeakMessage | CancelMessage | StopMessage;
```

#### 5.1.2 Server-to-Client Messages

```typescript
interface ReadyMessage {
  type: "ready";
  sessionId: string;
  voiceLoaded: string;
  capabilities: {
    streaming: boolean;
    languages: string[];
  };
}

interface StartedMessage {
  type: "started";
  sampleRate: number;
  encoding: string;
  voice: string;
}

interface AudioChunkMessage {
  type: "audioChunk";
  data: string;           // Base64-encoded PCM audio
  sequenceNumber: number;
  isFinal: boolean;
  timestampMs: number;
}

interface CompleteMessage {
  type: "complete";
  durationMs: number;
  charactersProcessed: number;
  audioLengthMs: number;
}

interface ErrorMessage {
  type: "error";
  code: string;
  message: string;
  recoverable: boolean;
}

interface StoppedMessage {
  type: "stopped";
  reason: string;
}

type ServerMessage =
  | ReadyMessage
  | StartedMessage
  | AudioChunkMessage
  | CompleteMessage
  | ErrorMessage
  | StoppedMessage;
```

### 5.2 Error Codes

| Code | HTTP-like | Description | Recovery |
|------|-----------|-------------|----------|
| `MODEL_LOADING` | 503 | Model still loading | Wait and retry |
| `SYNTHESIS_ERROR` | 500 | Failed to generate audio | Retry |
| `INVALID_TEXT` | 400 | Text is empty or invalid | Fix input |
| `VOICE_NOT_FOUND` | 404 | Requested voice doesn't exist | Use default |
| `UNAUTHORIZED` | 401 | Invalid or missing auth token | Re-authenticate |
| `RATE_LIMITED` | 429 | Too many requests | Back off |
| `TIMEOUT` | 408 | Operation timed out | Retry |
| `CANCELLED` | 499 | Client cancelled request | N/A |

### 5.3 Session Management

#### 5.3.1 Session Lifecycle

```
DISCONNECTED -> CONNECTING -> READY -> STARTED -> SPEAKING -> READY -> ... -> STOPPED
                    |            |         |                      |
                    v            v         v                      v
                  ERROR       ERROR     ERROR                  STOPPED
```

---

## 6. Voice Asset Requirements

### 6.1 Makima Voice Clip Specifications

#### 6.1.1 Audio Requirements

| Specification | Requirement |
|---------------|-------------|
| Duration | 5-10 seconds (minimum 3s) |
| Format | WAV (PCM) |
| Sample Rate | 24,000 Hz or higher |
| Bit Depth | 16-bit or higher |
| Channels | Mono (preferred) or Stereo |
| Content | Clear speech, natural tone |
| Background | Minimal noise/music |

#### 6.1.2 Content Guidelines

**DO:**
- Use dialogue with varied intonation
- Include multiple phonemes
- Capture natural speaking rhythm
- Extract from clean audio scenes

**DON'T:**
- Include background music
- Use shouting or whispering
- Include sound effects
- Use heavily processed audio

#### 6.1.3 Transcript Requirements

| Specification | Requirement |
|---------------|-------------|
| Format | Plain text (.txt) or JSON |
| Encoding | UTF-8 |
| Content | Exact transcription of audio |
| Language | Japanese (for Japanese reference) |

### 6.2 Storage Location and Management

#### 6.2.1 Directory Structure

```
models/
└── voices/
    ├── makima/
    │   ├── reference.wav       # Primary reference audio
    │   ├── transcript.txt      # Plain text transcript
    │   ├── transcript.json     # Structured transcript (optional)
    │   └── metadata.json       # Voice metadata
    ├── makima-alt/             # Alternative Makima clips (future)
    │   └── ...
    └── custom/                 # User-uploaded voices (future)
        └── {voice_id}/
            ├── reference.wav
            ├── transcript.txt
            └── metadata.json
```

---

## 7. Testing Strategy

### 7.1 Unit Tests

#### 7.1.1 Rust TTS Module Tests

```rust
// makima/src/tts/qwen3/tests.rs

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_config_loading() {
        let config = Qwen3Config::from_json("test_config.json").unwrap();
        assert_eq!(config.hidden_size, 1024);
        assert_eq!(config.num_layers, 28);
    }

    #[test]
    fn test_voice_prompt_cache_lru() {
        let cache = VoicePromptCache::new(2);
        cache.insert("a", prompt_a);
        cache.insert("b", prompt_b);
        cache.get("a"); // access a
        cache.insert("c", prompt_c); // should evict b

        assert!(cache.get("a").is_some());
        assert!(cache.get("b").is_none());
        assert!(cache.get("c").is_some());
    }

    #[tokio::test]
    async fn test_speak_handler_message_parsing() {
        let json = r#"{"type": "start", "voice": "makima"}"#;
        let msg: SpeakClientMessage = serde_json::from_str(json).unwrap();

        match msg {
            SpeakClientMessage::Start(start) => {
                assert_eq!(start.voice, Some("makima".to_string()));
            }
            _ => panic!("Expected Start message"),
        }
    }
}
```

### 7.2 Integration Tests

```rust
// tests/tts_integration.rs

#[tokio::test]
async fn test_speak_websocket_flow() {
    // Start test server with TTS enabled
    let state = create_test_state_with_tts().await;
    let app = make_router(state);

    // Connect WebSocket
    let ws = connect_ws("/api/v1/speak").await;

    // Send start
    ws.send_json(json!({"type": "start", "voice": "makima"})).await;
    let ready = ws.recv_json().await;
    assert_eq!(ready["type"], "ready");

    // Send speak
    ws.send_json(json!({"type": "speak", "text": "Hello."})).await;

    // Collect audio chunks
    let mut chunks = vec![];
    loop {
        let msg = ws.recv().await;
        match msg {
            WsMsg::Binary(data) => chunks.push(data),
            WsMsg::Text(json) => {
                let data: Value = serde_json::from_str(&json).unwrap();
                if data["type"] == "complete" { break; }
            }
        }
    }
    assert!(!chunks.is_empty());
}
```

### 7.3 Performance Targets

| Metric | Target | Acceptable | Warning |
|--------|--------|------------|---------|
| First Audio (short) | < 150ms | < 200ms | > 300ms |
| First Audio (medium) | < 200ms | < 300ms | > 500ms |
| First Audio (long) | < 300ms | < 500ms | > 800ms |
| Inter-chunk | < 30ms | < 50ms | > 100ms |
| Memory (GPU) | < 4GB | < 6GB | > 8GB |
| Memory (CPU) | < 2GB | < 4GB | > 8GB |

---

## 8. Implementation Phases

### Phase 1: Candle-Based Qwen3-TTS Module (Week 1-2)

**Deliverables:**
- [ ] `makima/src/tts/mod.rs` — TTS trait + factory
- [ ] `makima/src/tts/chatterbox.rs` — Move existing code from tts.rs
- [ ] `makima/src/tts/qwen3/model.rs` — 28-layer LM backbone (extend candle Qwen2)
- [ ] `makima/src/tts/qwen3/code_predictor.rs` — MTP module (5 layers, 16 codebooks)
- [ ] `makima/src/tts/qwen3/speech_tokenizer.rs` — ConvNet encoder/decoder + RVQ
- [ ] `makima/src/tts/qwen3/config.rs` — Config from safetensors
- [ ] `makima/src/tts/qwen3/generate.rs` — Autoregressive generation with KV cache
- [ ] Add `candle-core`, `candle-nn`, `candle-transformers` to Cargo.toml

**Success Criteria:**
- Model loads safetensors weights successfully
- Can generate audio from text via direct inference
- First audio latency < 500ms (initial, unoptimized)

### Phase 2: WebSocket Handler + Voice Assets (Week 2-3)

**Deliverables:**
- [ ] Update `makima/src/server/handlers/speak.rs` — Direct TTS handler (no proxy)
- [ ] Lazy model loading via `SharedState`
- [ ] Voice prompt caching
- [ ] Makima voice asset acquisition and processing
- [ ] Basic error handling and session management

**Success Criteria:**
- `/api/v1/speak` endpoint produces streaming audio
- Default Makima voice works
- Error handling matches specification

### Phase 3: Optimization + Integration (Week 3-4)

**Deliverables:**
- [ ] Streaming audio generation (token-by-token decoding)
- [ ] GPU memory optimization
- [ ] Listen page integration for bidirectional speech
- [ ] Session coordination between STT and TTS
- [ ] Full test suite (unit, integration)
- [ ] Latency benchmarks

**Success Criteria:**
- First audio latency < 200ms
- Memory usage < 6GB
- All tests passing
- Documentation complete

---

## Appendix

### A. Dependencies

#### Rust (Cargo.toml additions)

```toml
[dependencies]
candle-core = "0.8"
candle-nn = "0.8"
candle-transformers = "0.8"
# Keep existing: tokenizers, hf-hub, ndarray (for compatibility)
```

### B. Environment Variables

```bash
# TTS Configuration
TTS_ENGINE=qwen3                    # "qwen3" or "chatterbox"
TTS_MODEL_ID=Qwen/Qwen3-TTS-12Hz-0.6B-Base
TTS_DEVICE=cuda:0                   # "cuda:0", "metal", or "cpu"
TTS_VOICES_DIR=models/voices
TTS_DEFAULT_VOICE=makima
```

### C. References

1. [Qwen3-TTS-12Hz-0.6B-Base (Hugging Face)](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base)
2. [Qwen3-TTS GitHub Repository](https://github.com/QwenLM/Qwen3-TTS)
3. [Qwen3-TTS Technical Report (arXiv:2601.15621)](https://arxiv.org/abs/2601.15621)
4. [Candle — HuggingFace Rust ML Framework](https://github.com/huggingface/candle)
5. [axum WebSocket Documentation](https://docs.rs/axum/latest/axum/extract/ws/index.html)
6. [docs/research/rust-native-tts-research.md](../research/rust-native-tts-research.md) — Detailed feasibility analysis

### D. Glossary

| Term | Definition |
|------|------------|
| **TTS** | Text-to-Speech: Converting text input to audio output |
| **STT** | Speech-to-Text: Converting audio input to text output |
| **Voice Cloning** | Creating synthetic speech that mimics a specific speaker |
| **Voice Prompt** | Pre-computed speaker embedding for voice cloning |
| **Candle** | HuggingFace's minimalist Rust ML framework |
| **SafeTensors** | Efficient, safe model weight serialization format |
| **RVQ** | Residual Vector Quantization — multi-codebook audio tokenization |
| **MTP** | Multi-Token Prediction — code predictor generating 16 codebook layers |
| **bf16** | Brain floating-point 16-bit format for GPU computation |