diff options
| author | soryu <soryu@soryu.co> | 2026-01-28 03:50:45 +0000 |
|---|---|---|
| committer | GitHub <noreply@github.com> | 2026-01-28 03:50:45 +0000 |
| commit | 9b53f6c6b01da85ef73bd5960b32ec319df0b947 (patch) | |
| tree | 8c5e9983e1a5e75afab4a7d7a18ba22b75211628 /voices/makima | |
| parent | c14192cc8b0e82369c93c1aee615fcc9cfad5911 (diff) | |
| download | soryu-9b53f6c6b01da85ef73bd5960b32ec319df0b947.tar.gz soryu-9b53f6c6b01da85ef73bd5960b32ec319df0b947.zip | |
Replace TTS endpoint with Rust-native Qwen3-TTS (#41)
* chore: fix unused import warnings in qwen3-tts module
- Remove unused import 'IndexOp' in model.rs
- Remove unused import 'DType' in speech_tokenizer.rs
- Add #[allow(dead_code)] to codebook_dim field in RvqCodebook
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* feat: add voice loading and selection for TTS cloning
Add voice reference audio loading so the TTS speak handler can perform
voice cloning using reference WAV files from the voices/ directory.
- Add voice.rs module: loads manifest.json and reference.wav for a given
voice_id, decodes via symphonia, resamples to 24kHz for the TTS engine
- Update speak.rs: resolve voice_id from the speak request (default
"makima"), load reference audio, pass it to engine.generate()
- Add voices/makima/README.md with instructions for obtaining reference
audio (extraction from YouTube, recording, ffmpeg conversion)
- Graceful fallback: if reference audio is missing, TTS proceeds without
voice cloning using the model's default voice
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* [WIP] Heartbeat checkpoint - 2026-01-28 03:49:13 UTC
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Diffstat (limited to 'voices/makima')
| -rw-r--r-- | voices/makima/README.md | 105 |
1 files changed, 105 insertions, 0 deletions
diff --git a/voices/makima/README.md b/voices/makima/README.md new file mode 100644 index 0000000..8553daf --- /dev/null +++ b/voices/makima/README.md @@ -0,0 +1,105 @@ +# Makima Voice Reference Audio + +This directory contains the voice profile for **Makima** — the default TTS voice used by the makima system for voice cloning. + +## What You Need + +A **reference audio clip** (`reference.wav`) of Makima's Japanese voice actress (Tomori Kusunoki) speaking English. + +### Requirements + +| Property | Value | +|----------------|------------------------------------------| +| **Filename** | `reference.wav` | +| **Duration** | 5–15 seconds (10s ideal) | +| **Format** | WAV (PCM) | +| **Sample Rate**| 24 kHz (will be resampled if different) | +| **Channels** | Mono (1 channel) | +| **Bit Depth** | 16-bit or 32-bit float | + +### Why These Parameters? + +- **5–15 seconds**: Enough for the TTS model to capture voice characteristics without being too long for memory. +- **24 kHz mono**: Native sample rate of the Qwen3-TTS model. Audio at other rates will be automatically resampled, but starting at 24 kHz avoids quality loss. +- **Clear speech**: Minimal background noise, no music overlay. A single speaker only. + +## How to Obtain Reference Audio + +### Option 1: Record or Find English Speech + +The best reference audio is a clean clip of the target voice speaking English. Sources: + +- **Anime convention panels or interviews** where the VA speaks English +- **Behind-the-scenes clips** from Chainsaw Man production +- **Fan events or promotional videos** with English speech segments + +### Option 2: Extract from YouTube + +You can extract audio from YouTube clips. Here are some potential sources: + +1. Search YouTube for: `"Tomori Kusunoki" english` or `"楠木ともり" english` +2. Look for interview clips, event recordings, or promotional content + +**Extraction steps using `yt-dlp` and `ffmpeg`:** + +```bash +# 1. Download audio from a YouTube clip +yt-dlp -x --audio-format wav -o "raw_audio.%(ext)s" "YOUTUBE_URL_HERE" + +# 2. Convert to 24kHz mono WAV, trimming to a 10-second segment +# Adjust -ss (start time) and -t (duration) as needed +ffmpeg -i raw_audio.wav \ + -ss 00:00:05 -t 00:00:10 \ + -ar 24000 -ac 1 \ + -acodec pcm_s16le \ + voices/makima/reference.wav + +# 3. Verify the output +ffprobe -v error -show_entries stream=sample_rate,channels,duration \ + -of default=noprint_wrappers=1 voices/makima/reference.wav +``` + +### Option 3: Use Any Japanese-Accented English Voice + +If you cannot find clips of the specific VA, any clear recording of a female Japanese speaker speaking English will work as a starting point. The voice cloning will adapt to the reference audio's characteristics. + +```bash +# Example: record your own reference using sox (if available) +sox -d -r 24000 -c 1 -b 16 voices/makima/reference.wav trim 0 10 +``` + +## Tips for Best Quality + +1. **Clean audio**: Remove any background music or noise. Use a noise gate or audio editor if needed. +2. **Natural speech**: Conversational tone works better than reading. The model captures prosody and rhythm. +3. **Consistent volume**: Normalize the audio to avoid clipping or very quiet segments. +4. **Single speaker**: Only the target voice should be present in the clip. + +```bash +# Normalize audio volume with ffmpeg +ffmpeg -i reference_raw.wav \ + -af "loudnorm=I=-16:TP=-1.5:LRA=11" \ + -ar 24000 -ac 1 -acodec pcm_s16le \ + voices/makima/reference.wav +``` + +## File Structure + +``` +voices/makima/ +├── manifest.json # Voice metadata (name, sample rate, backend) +├── reference.wav # Reference audio clip (YOU PROVIDE THIS) +└── README.md # This file +``` + +## Verification + +After placing `reference.wav`, you can verify the system loads it correctly: + +```bash +# The TTS handler will log voice loading on first speak request: +# INFO makima::server::handlers::voice: Loaded voice reference audio +# voice_id="makima" voice_name="Makima" samples_len=240000 duration_secs=10.0 +``` + +If the reference audio is missing, the TTS system will still work but without voice cloning — it will use the model's default voice instead. |
