summaryrefslogtreecommitdiff
path: root/voices/makima
diff options
context:
space:
mode:
authorsoryu <soryu@soryu.co>2026-01-28 03:50:45 +0000
committerGitHub <noreply@github.com>2026-01-28 03:50:45 +0000
commit9b53f6c6b01da85ef73bd5960b32ec319df0b947 (patch)
tree8c5e9983e1a5e75afab4a7d7a18ba22b75211628 /voices/makima
parentc14192cc8b0e82369c93c1aee615fcc9cfad5911 (diff)
downloadsoryu-9b53f6c6b01da85ef73bd5960b32ec319df0b947.tar.gz
soryu-9b53f6c6b01da85ef73bd5960b32ec319df0b947.zip
Replace TTS endpoint with Rust-native Qwen3-TTS (#41)
* chore: fix unused import warnings in qwen3-tts module - Remove unused import 'IndexOp' in model.rs - Remove unused import 'DType' in speech_tokenizer.rs - Add #[allow(dead_code)] to codebook_dim field in RvqCodebook Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat: add voice loading and selection for TTS cloning Add voice reference audio loading so the TTS speak handler can perform voice cloning using reference WAV files from the voices/ directory. - Add voice.rs module: loads manifest.json and reference.wav for a given voice_id, decodes via symphonia, resamples to 24kHz for the TTS engine - Update speak.rs: resolve voice_id from the speak request (default "makima"), load reference audio, pass it to engine.generate() - Add voices/makima/README.md with instructions for obtaining reference audio (extraction from YouTube, recording, ffmpeg conversion) - Graceful fallback: if reference audio is missing, TTS proceeds without voice cloning using the model's default voice Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * [WIP] Heartbeat checkpoint - 2026-01-28 03:49:13 UTC --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Diffstat (limited to 'voices/makima')
-rw-r--r--voices/makima/README.md105
1 files changed, 105 insertions, 0 deletions
diff --git a/voices/makima/README.md b/voices/makima/README.md
new file mode 100644
index 0000000..8553daf
--- /dev/null
+++ b/voices/makima/README.md
@@ -0,0 +1,105 @@
+# Makima Voice Reference Audio
+
+This directory contains the voice profile for **Makima** — the default TTS voice used by the makima system for voice cloning.
+
+## What You Need
+
+A **reference audio clip** (`reference.wav`) of Makima's Japanese voice actress (Tomori Kusunoki) speaking English.
+
+### Requirements
+
+| Property | Value |
+|----------------|------------------------------------------|
+| **Filename** | `reference.wav` |
+| **Duration** | 5–15 seconds (10s ideal) |
+| **Format** | WAV (PCM) |
+| **Sample Rate**| 24 kHz (will be resampled if different) |
+| **Channels** | Mono (1 channel) |
+| **Bit Depth** | 16-bit or 32-bit float |
+
+### Why These Parameters?
+
+- **5–15 seconds**: Enough for the TTS model to capture voice characteristics without being too long for memory.
+- **24 kHz mono**: Native sample rate of the Qwen3-TTS model. Audio at other rates will be automatically resampled, but starting at 24 kHz avoids quality loss.
+- **Clear speech**: Minimal background noise, no music overlay. A single speaker only.
+
+## How to Obtain Reference Audio
+
+### Option 1: Record or Find English Speech
+
+The best reference audio is a clean clip of the target voice speaking English. Sources:
+
+- **Anime convention panels or interviews** where the VA speaks English
+- **Behind-the-scenes clips** from Chainsaw Man production
+- **Fan events or promotional videos** with English speech segments
+
+### Option 2: Extract from YouTube
+
+You can extract audio from YouTube clips. Here are some potential sources:
+
+1. Search YouTube for: `"Tomori Kusunoki" english` or `"楠木ともり" english`
+2. Look for interview clips, event recordings, or promotional content
+
+**Extraction steps using `yt-dlp` and `ffmpeg`:**
+
+```bash
+# 1. Download audio from a YouTube clip
+yt-dlp -x --audio-format wav -o "raw_audio.%(ext)s" "YOUTUBE_URL_HERE"
+
+# 2. Convert to 24kHz mono WAV, trimming to a 10-second segment
+# Adjust -ss (start time) and -t (duration) as needed
+ffmpeg -i raw_audio.wav \
+ -ss 00:00:05 -t 00:00:10 \
+ -ar 24000 -ac 1 \
+ -acodec pcm_s16le \
+ voices/makima/reference.wav
+
+# 3. Verify the output
+ffprobe -v error -show_entries stream=sample_rate,channels,duration \
+ -of default=noprint_wrappers=1 voices/makima/reference.wav
+```
+
+### Option 3: Use Any Japanese-Accented English Voice
+
+If you cannot find clips of the specific VA, any clear recording of a female Japanese speaker speaking English will work as a starting point. The voice cloning will adapt to the reference audio's characteristics.
+
+```bash
+# Example: record your own reference using sox (if available)
+sox -d -r 24000 -c 1 -b 16 voices/makima/reference.wav trim 0 10
+```
+
+## Tips for Best Quality
+
+1. **Clean audio**: Remove any background music or noise. Use a noise gate or audio editor if needed.
+2. **Natural speech**: Conversational tone works better than reading. The model captures prosody and rhythm.
+3. **Consistent volume**: Normalize the audio to avoid clipping or very quiet segments.
+4. **Single speaker**: Only the target voice should be present in the clip.
+
+```bash
+# Normalize audio volume with ffmpeg
+ffmpeg -i reference_raw.wav \
+ -af "loudnorm=I=-16:TP=-1.5:LRA=11" \
+ -ar 24000 -ac 1 -acodec pcm_s16le \
+ voices/makima/reference.wav
+```
+
+## File Structure
+
+```
+voices/makima/
+├── manifest.json # Voice metadata (name, sample rate, backend)
+├── reference.wav # Reference audio clip (YOU PROVIDE THIS)
+└── README.md # This file
+```
+
+## Verification
+
+After placing `reference.wav`, you can verify the system loads it correctly:
+
+```bash
+# The TTS handler will log voice loading on first speak request:
+# INFO makima::server::handlers::voice: Loaded voice reference audio
+# voice_id="makima" voice_name="Makima" samples_len=240000 duration_secs=10.0
+```
+
+If the reference audio is missing, the TTS system will still work but without voice cloning — it will use the model's default voice instead.