path: root/voices/makima/README.md



# Makima Voice Reference Audio

This directory contains the voice profile for **Makima** — the default TTS voice used by the makima system for voice cloning.

## What You Need

A **reference audio clip** (`reference.wav`) of Makima's Japanese voice actress (Tomori Kusunoki) speaking English.

### Requirements

| Property       | Value                                    |
|----------------|------------------------------------------|
| **Filename**   | `reference.wav`                          |
| **Duration**   | 5–15 seconds (10s ideal)                 |
| **Format**     | WAV (PCM)                                |
| **Sample Rate**| 24 kHz (will be resampled if different)  |
| **Channels**   | Mono (1 channel)                         |
| **Bit Depth**  | 16-bit or 32-bit float                   |

### Why These Parameters?

- **5–15 seconds**: Enough for the TTS model to capture voice characteristics without being too long for memory.
- **24 kHz mono**: Native sample rate of the Qwen3-TTS model. Audio at other rates will be automatically resampled, but starting at 24 kHz avoids quality loss.
- **Clear speech**: Minimal background noise, no music overlay. A single speaker only.

## How to Obtain Reference Audio

### Option 1: Record or Find English Speech

The best reference audio is a clean clip of the target voice speaking English. Sources:

- **Anime convention panels or interviews** where the VA speaks English
- **Behind-the-scenes clips** from Chainsaw Man production
- **Fan events or promotional videos** with English speech segments

### Option 2: Extract from YouTube

You can extract audio from YouTube clips. Here are some potential sources:

1. Search YouTube for: `"Tomori Kusunoki" english` or `"楠木ともり" english`
2. Look for interview clips, event recordings, or promotional content

**Extraction steps using `yt-dlp` and `ffmpeg`:**

```bash
# 1. Download audio from a YouTube clip
yt-dlp -x --audio-format wav -o "raw_audio.%(ext)s" "YOUTUBE_URL_HERE"

# 2. Convert to 24kHz mono WAV, trimming to a 10-second segment
#    Adjust -ss (start time) and -t (duration) as needed
ffmpeg -i raw_audio.wav \
  -ss 00:00:05 -t 00:00:10 \
  -ar 24000 -ac 1 \
  -acodec pcm_s16le \
  voices/makima/reference.wav

# 3. Verify the output
ffprobe -v error -show_entries stream=sample_rate,channels,duration \
  -of default=noprint_wrappers=1 voices/makima/reference.wav
```

### Option 3: Use Any Japanese-Accented English Voice

If you cannot find clips of the specific VA, any clear recording of a female Japanese speaker speaking English will work as a starting point. The voice cloning will adapt to the reference audio's characteristics.

```bash
# Example: record your own reference using sox (if available)
sox -d -r 24000 -c 1 -b 16 voices/makima/reference.wav trim 0 10
```

## Tips for Best Quality

1. **Clean audio**: Remove any background music or noise. Use a noise gate or audio editor if needed.
2. **Natural speech**: Conversational tone works better than reading. The model captures prosody and rhythm.
3. **Consistent volume**: Normalize the audio to avoid clipping or very quiet segments.
4. **Single speaker**: Only the target voice should be present in the clip.

```bash
# Normalize audio volume with ffmpeg
ffmpeg -i reference_raw.wav \
  -af "loudnorm=I=-16:TP=-1.5:LRA=11" \
  -ar 24000 -ac 1 -acodec pcm_s16le \
  voices/makima/reference.wav
```

## File Structure

```
voices/makima/
├── manifest.json     # Voice metadata (name, sample rate, backend)
├── reference.wav     # Reference audio clip (YOU PROVIDE THIS)
└── README.md         # This file
```

## Verification

After placing `reference.wav`, you can verify the system loads it correctly:

```bash
# The TTS handler will log voice loading on first speak request:
# INFO makima::server::handlers::voice: Loaded voice reference audio
#   voice_id="makima" voice_name="Makima" samples_len=240000 duration_secs=10.0
```

If the reference audio is missing, the TTS system will still work but without voice cloning — it will use the model's default voice instead.