Skills gemini-stt

Transcribe audio files using Google's Gemini API or Vertex AI

install
source · Clone the upstream repo
git clone https://github.com/openclaw/skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/araa47/gemini-stt" ~/.claude/skills/clawdbot-skills-gemini-stt && rm -rf "$T"
manifest: skills/araa47/gemini-stt/SKILL.md
source content

Gemini Speech-to-Text Skill

Transcribe audio files using Google's Gemini API or Vertex AI. Default model is

gemini-2.0-flash-lite
for fastest transcription.

Authentication (choose one)

Option 1: Vertex AI with Application Default Credentials (Recommended)

gcloud auth application-default login
gcloud config set project YOUR_PROJECT_ID

The script will automatically detect and use ADC when available.

Option 2: Direct Gemini API Key

Set

GEMINI_API_KEY
in environment (e.g.,
~/.env
or
~/.clawdbot/.env
)

Requirements

  • Python 3.10+ (no external dependencies)
  • Either GEMINI_API_KEY or gcloud CLI with ADC configured

Supported Formats

  • .ogg
    /
    .opus
    (Telegram voice messages)
  • .mp3
  • .wav
  • .m4a

Usage

# Auto-detect auth (tries ADC first, then GEMINI_API_KEY)
python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg

# Force Vertex AI
python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg --vertex

# With a specific model
python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg --model gemini-2.5-pro

# Vertex AI with specific project and region
python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg --vertex --project my-project --region us-central1

# With Clawdbot media
python ~/.claude/skills/gemini-stt/transcribe.py ~/.clawdbot/media/inbound/voice-message.ogg

Options

OptionDescription
<audio_file>
Path to the audio file (required)
--model
,
-m
Gemini model to use (default:
gemini-2.0-flash-lite
)
--vertex
,
-v
Force use of Vertex AI with ADC
--project
,
-p
GCP project ID (for Vertex, defaults to gcloud config)
--region
,
-r
GCP region (for Vertex, default:
us-central1
)

Supported Models

Any Gemini model that supports audio input can be used. Recommended models:

ModelNotes
gemini-2.0-flash-lite
Default. Fastest transcription speed.
gemini-2.0-flash
Fast and cost-effective.
gemini-2.5-flash-lite
Lightweight 2.5 model.
gemini-2.5-flash
Balanced speed and quality.
gemini-2.5-pro
Higher quality, slower.
gemini-3-flash-preview
Latest flash model.
gemini-3-pro-preview
Latest pro model, best quality.

See Gemini API Models for the latest list.

How It Works

  1. Reads the audio file and base64 encodes it
  2. Auto-detects authentication:
    • If ADC is available (gcloud), uses Vertex AI endpoint
    • Otherwise, uses GEMINI_API_KEY with direct Gemini API
  3. Sends to the selected Gemini model with transcription prompt
  4. Returns the transcribed text

Example Integration

For Clawdbot voice message handling:

# Transcribe incoming voice message
TRANSCRIPT=$(python ~/.claude/skills/gemini-stt/transcribe.py "$AUDIO_PATH")
echo "User said: $TRANSCRIPT"

Error Handling

The script exits with code 1 and prints to stderr on:

  • No authentication available (neither ADC nor GEMINI_API_KEY)
  • File not found
  • API errors
  • Missing GCP project (when using Vertex)

Notes

  • Uses Gemini 2.0 Flash Lite by default for fastest transcription
  • No external Python dependencies (uses stdlib only)
  • Automatically detects MIME type from file extension
  • Prefers Vertex AI with ADC when available (no API key management needed)