# Makima Voice Reference Audio This directory contains the voice profile for **Makima** — the default TTS voice used by the makima system for voice cloning. ## What You Need A **reference audio clip** (`reference.wav`) of Makima's Japanese voice actress (Tomori Kusunoki) speaking English. ### Requirements | Property | Value | |----------------|------------------------------------------| | **Filename** | `reference.wav` | | **Duration** | 5–15 seconds (10s ideal) | | **Format** | WAV (PCM) | | **Sample Rate**| 24 kHz (will be resampled if different) | | **Channels** | Mono (1 channel) | | **Bit Depth** | 16-bit or 32-bit float | ### Why These Parameters? - **5–15 seconds**: Enough for the TTS model to capture voice characteristics without being too long for memory. - **24 kHz mono**: Native sample rate of the Qwen3-TTS model. Audio at other rates will be automatically resampled, but starting at 24 kHz avoids quality loss. - **Clear speech**: Minimal background noise, no music overlay. A single speaker only. ## How to Obtain Reference Audio ### Option 1: Record or Find English Speech The best reference audio is a clean clip of the target voice speaking English. Sources: - **Anime convention panels or interviews** where the VA speaks English - **Behind-the-scenes clips** from Chainsaw Man production - **Fan events or promotional videos** with English speech segments ### Option 2: Extract from YouTube You can extract audio from YouTube clips. Here are some potential sources: 1. Search YouTube for: `"Tomori Kusunoki" english` or `"楠木ともり" english` 2. Look for interview clips, event recordings, or promotional content **Extraction steps using `yt-dlp` and `ffmpeg`:** ```bash # 1. Download audio from a YouTube clip yt-dlp -x --audio-format wav -o "raw_audio.%(ext)s" "YOUTUBE_URL_HERE" # 2. Convert to 24kHz mono WAV, trimming to a 10-second segment # Adjust -ss (start time) and -t (duration) as needed ffmpeg -i raw_audio.wav \ -ss 00:00:05 -t 00:00:10 \ -ar 24000 -ac 1 \ -acodec pcm_s16le \ voices/makima/reference.wav # 3. Verify the output ffprobe -v error -show_entries stream=sample_rate,channels,duration \ -of default=noprint_wrappers=1 voices/makima/reference.wav ``` ### Option 3: Use Any Japanese-Accented English Voice If you cannot find clips of the specific VA, any clear recording of a female Japanese speaker speaking English will work as a starting point. The voice cloning will adapt to the reference audio's characteristics. ```bash # Example: record your own reference using sox (if available) sox -d -r 24000 -c 1 -b 16 voices/makima/reference.wav trim 0 10 ``` ## Tips for Best Quality 1. **Clean audio**: Remove any background music or noise. Use a noise gate or audio editor if needed. 2. **Natural speech**: Conversational tone works better than reading. The model captures prosody and rhythm. 3. **Consistent volume**: Normalize the audio to avoid clipping or very quiet segments. 4. **Single speaker**: Only the target voice should be present in the clip. ```bash # Normalize audio volume with ffmpeg ffmpeg -i reference_raw.wav \ -af "loudnorm=I=-16:TP=-1.5:LRA=11" \ -ar 24000 -ac 1 -acodec pcm_s16le \ voices/makima/reference.wav ``` ## File Structure ``` voices/makima/ ├── manifest.json # Voice metadata (name, sample rate, backend) ├── reference.wav # Reference audio clip (YOU PROVIDE THIS) └── README.md # This file ``` ## Verification After placing `reference.wav`, you can verify the system loads it correctly: ```bash # The TTS handler will log voice loading on first speak request: # INFO makima::server::handlers::voice: Loaded voice reference audio # voice_id="makima" voice_name="Makima" samples_len=240000 duration_secs=10.0 ``` If the reference audio is missing, the TTS system will still work but without voice cloning — it will use the model's default voice instead.