Learn-skills.dev qwen-voice
Use Qwen (DashScope/百炼) for speech tasks: (1) ASR speech-to-text transcription of user audio/voice messages (Telegram .ogg opus, wav, mp3) using qwen3-asr-flash, optionally with coarse timestamps via chunking; (2) TTS text-to-speech voice reply using qwen3-tts-flash with selectable voice (default Cherry) and output as .ogg voice note for Telegram.
install
source · Clone the upstream repo
git clone https://github.com/NeverSight/learn-skills.dev
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/NeverSight/learn-skills.dev "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/skills-md/ada20204/qwen-voice/qwen-voice" ~/.claude/skills/neversight-learn-skills-dev-qwen-voice && rm -rf "$T"
manifest:
data/skills-md/ada20204/qwen-voice/qwen-voice/SKILL.mdsource content
Qwen Voice (ASR + TTS)
Use the bundled scripts. Prefer environment variable
DASHSCOPE_API_KEY. If missing, scripts attempt to read it from ~/.bashrc.
ASR (speech → text)
Non-timestamp (default)
python3 skills/qwen-voice/scripts/qwen_asr.py --in /path/to/audio.ogg
With timestamps (chunk-based)
python3 skills/qwen-voice/scripts/qwen_asr.py --in /path/to/audio.ogg --timestamps --chunk-sec 3
Notes:
- Timestamps are generated by fixed-length chunking (not word-level alignment).
- Input audio is converted to mono 16kHz WAV before sending.
TTS (text → speech)
Preset voice (default: Cherry)
python3 skills/qwen-voice/scripts/qwen_tts.py --text '你好,我是 Pi。' --voice Cherry --out /tmp/out.ogg
Clone voice (create once, reuse)
- Create a voice profile from a sample audio:
python3 skills/qwen-voice/scripts/qwen_voice_clone.py --in ./voice_sample.ogg --name george --out work/qwen-voice/george.voice.json
- Use the cloned voice to synthesize:
python3 skills/qwen-voice/scripts/qwen_tts.py --text '你好,我是 George。' --voice-profile work/qwen-voice/george.voice.json --out /tmp/out.ogg
Notes:
output is Opus, suitable for Telegram voice messages..ogg- Voice cloning uses DashScope customization endpoint + Qwen realtime TTS model.
- Scripts use a local venv at
(auto-created on first run).work/venv-dashscope
Typical chat workflow
- When user sends voice message/audio: run ASR and reply with the transcribed text.
- When user explicitly asks for voice reply: run TTS and send the generated
as a voice note..ogg