blob: 8553daf42c94607cf95c2c29d1c520bcca6a1115 (
plain) (
tree)
|
|
# Makima Voice Reference Audio
This directory contains the voice profile for **Makima** — the default TTS voice used by the makima system for voice cloning.
## What You Need
A **reference audio clip** (`reference.wav`) of Makima's Japanese voice actress (Tomori Kusunoki) speaking English.
### Requirements
| Property | Value |
|----------------|------------------------------------------|
| **Filename** | `reference.wav` |
| **Duration** | 5–15 seconds (10s ideal) |
| **Format** | WAV (PCM) |
| **Sample Rate**| 24 kHz (will be resampled if different) |
| **Channels** | Mono (1 channel) |
| **Bit Depth** | 16-bit or 32-bit float |
### Why These Parameters?
- **5–15 seconds**: Enough for the TTS model to capture voice characteristics without being too long for memory.
- **24 kHz mono**: Native sample rate of the Qwen3-TTS model. Audio at other rates will be automatically resampled, but starting at 24 kHz avoids quality loss.
- **Clear speech**: Minimal background noise, no music overlay. A single speaker only.
## How to Obtain Reference Audio
### Option 1: Record or Find English Speech
The best reference audio is a clean clip of the target voice speaking English. Sources:
- **Anime convention panels or interviews** where the VA speaks English
- **Behind-the-scenes clips** from Chainsaw Man production
- **Fan events or promotional videos** with English speech segments
### Option 2: Extract from YouTube
You can extract audio from YouTube clips. Here are some potential sources:
1. Search YouTube for: `"Tomori Kusunoki" english` or `"楠木ともり" english`
2. Look for interview clips, event recordings, or promotional content
**Extraction steps using `yt-dlp` and `ffmpeg`:**
```bash
# 1. Download audio from a YouTube clip
yt-dlp -x --audio-format wav -o "raw_audio.%(ext)s" "YOUTUBE_URL_HERE"
# 2. Convert to 24kHz mono WAV, trimming to a 10-second segment
# Adjust -ss (start time) and -t (duration) as needed
ffmpeg -i raw_audio.wav \
-ss 00:00:05 -t 00:00:10 \
-ar 24000 -ac 1 \
-acodec pcm_s16le \
voices/makima/reference.wav
# 3. Verify the output
ffprobe -v error -show_entries stream=sample_rate,channels,duration \
-of default=noprint_wrappers=1 voices/makima/reference.wav
```
### Option 3: Use Any Japanese-Accented English Voice
If you cannot find clips of the specific VA, any clear recording of a female Japanese speaker speaking English will work as a starting point. The voice cloning will adapt to the reference audio's characteristics.
```bash
# Example: record your own reference using sox (if available)
sox -d -r 24000 -c 1 -b 16 voices/makima/reference.wav trim 0 10
```
## Tips for Best Quality
1. **Clean audio**: Remove any background music or noise. Use a noise gate or audio editor if needed.
2. **Natural speech**: Conversational tone works better than reading. The model captures prosody and rhythm.
3. **Consistent volume**: Normalize the audio to avoid clipping or very quiet segments.
4. **Single speaker**: Only the target voice should be present in the clip.
```bash
# Normalize audio volume with ffmpeg
ffmpeg -i reference_raw.wav \
-af "loudnorm=I=-16:TP=-1.5:LRA=11" \
-ar 24000 -ac 1 -acodec pcm_s16le \
voices/makima/reference.wav
```
## File Structure
```
voices/makima/
├── manifest.json # Voice metadata (name, sample rate, backend)
├── reference.wav # Reference audio clip (YOU PROVIDE THIS)
└── README.md # This file
```
## Verification
After placing `reference.wav`, you can verify the system loads it correctly:
```bash
# The TTS handler will log voice loading on first speak request:
# INFO makima::server::handlers::voice: Loaded voice reference audio
# voice_id="makima" voice_name="Makima" samples_len=240000 duration_secs=10.0
```
If the reference audio is missing, the TTS system will still work but without voice cloning — it will use the model's default voice instead.
|