install
source · Clone the upstream repo
git clone https://github.com/openclaw/skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/araa47/gemini-stt" ~/.claude/skills/clawdbot-skills-gemini-stt && rm -rf "$T"
manifest:
skills/araa47/gemini-stt/SKILL.mdsource content
Gemini Speech-to-Text Skill
Transcribe audio files using Google's Gemini API or Vertex AI. Default model is
gemini-2.0-flash-lite for fastest transcription.
Authentication (choose one)
Option 1: Vertex AI with Application Default Credentials (Recommended)
gcloud auth application-default login gcloud config set project YOUR_PROJECT_ID
The script will automatically detect and use ADC when available.
Option 2: Direct Gemini API Key
Set
GEMINI_API_KEY in environment (e.g., ~/.env or ~/.clawdbot/.env)
Requirements
- Python 3.10+ (no external dependencies)
- Either GEMINI_API_KEY or gcloud CLI with ADC configured
Supported Formats
/.ogg
(Telegram voice messages).opus.mp3.wav.m4a
Usage
# Auto-detect auth (tries ADC first, then GEMINI_API_KEY) python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg # Force Vertex AI python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg --vertex # With a specific model python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg --model gemini-2.5-pro # Vertex AI with specific project and region python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg --vertex --project my-project --region us-central1 # With Clawdbot media python ~/.claude/skills/gemini-stt/transcribe.py ~/.clawdbot/media/inbound/voice-message.ogg
Options
| Option | Description |
|---|---|
| Path to the audio file (required) |
, | Gemini model to use (default: ) |
, | Force use of Vertex AI with ADC |
, | GCP project ID (for Vertex, defaults to gcloud config) |
, | GCP region (for Vertex, default: ) |
Supported Models
Any Gemini model that supports audio input can be used. Recommended models:
| Model | Notes |
|---|---|
| Default. Fastest transcription speed. |
| Fast and cost-effective. |
| Lightweight 2.5 model. |
| Balanced speed and quality. |
| Higher quality, slower. |
| Latest flash model. |
| Latest pro model, best quality. |
See Gemini API Models for the latest list.
How It Works
- Reads the audio file and base64 encodes it
- Auto-detects authentication:
- If ADC is available (gcloud), uses Vertex AI endpoint
- Otherwise, uses GEMINI_API_KEY with direct Gemini API
- Sends to the selected Gemini model with transcription prompt
- Returns the transcribed text
Example Integration
For Clawdbot voice message handling:
# Transcribe incoming voice message TRANSCRIPT=$(python ~/.claude/skills/gemini-stt/transcribe.py "$AUDIO_PATH") echo "User said: $TRANSCRIPT"
Error Handling
The script exits with code 1 and prints to stderr on:
- No authentication available (neither ADC nor GEMINI_API_KEY)
- File not found
- API errors
- Missing GCP project (when using Vertex)
Notes
- Uses Gemini 2.0 Flash Lite by default for fastest transcription
- No external Python dependencies (uses stdlib only)
- Automatically detects MIME type from file extension
- Prefers Vertex AI with ADC when available (no API key management needed)