Claude-skill-registry faion-multimodal-ai

Multimodal AI: vision, image/video generation, speech-to-text, text-to-speech, voice synthesis.

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/faion-multimodal-ai" ~/.claude/skills/majiayu000-claude-skill-registry-faion-multimodal-ai && rm -rf "$T"
manifest: skills/data/faion-multimodal-ai/SKILL.md
source content

Entry point:

/faion-net
— invoke this skill for automatic routing to the appropriate domain.

Multimodal AI Skill

Communication: User's language. Code: English.

Purpose

Handles multimodal AI applications. Covers vision, image generation, video generation, speech, and voice synthesis.

Context Discovery

Auto-Investigation

Check these project signals before asking questions:

SignalWhere to CheckWhat to Look For
Dependenciespackage.json, requirements.txtopenai, PIL/pillow, ffmpeg-python, elevenlabs
Media files/images, /audio, /videoInput files to process
API usageGrep for "images.generate", "audio.transcriptions"Existing multimodal APIs
Output dirs/generated, /outputWhere generated content goes

Discovery Questions

question: "Which modality are you working with?"
header: "Modality"
multiSelect: true
options:
  - label: "Vision (image understanding)"
    description: "GPT-4o Vision, Gemini Vision for OCR/analysis"
  - label: "Image generation"
    description: "DALL-E 3, Midjourney, Stable Diffusion"
  - label: "Video generation/understanding"
    description: "Sora, Runway, or video analysis"
  - label: "Speech-to-text"
    description: "Whisper, Deepgram for transcription"
  - label: "Text-to-speech"
    description: "OpenAI TTS, ElevenLabs for voice synthesis"
question: "What's your primary use case?"
header: "Use Case"
multiSelect: false
options:
  - label: "Document/receipt OCR and analysis"
    description: "Extract structured data from images"
  - label: "Content generation (images/videos)"
    description: "Create marketing/creative assets"
  - label: "Accessibility (vision/speech conversion)"
    description: "Convert between modalities for a11y"
  - label: "Voice assistant/bot"
    description: "Speech → Text → LLM → TTS pipeline"
question: "Volume and latency requirements?"
header: "Scale"
multiSelect: false
options:
  - label: "Low volume, quality over speed"
    description: "Use premium models (HD TTS, GPT-4o Vision)"
  - label: "High volume, optimize for cost"
    description: "Batch APIs, smaller models"
  - label: "Real-time required"
    description: "Streaming APIs (Deepgram, OpenAI TTS)"
  - label: "Async processing OK"
    description: "Queue-based approach"

Scope

AreaCoverage
VisionGPT-4o Vision, Gemini Vision, image understanding
Image GenerationDALL-E 3, Midjourney, Stable Diffusion
Video GenerationSora, Runway, Pika
Speech-to-TextWhisper, Deepgram, AssemblyAI
Text-to-SpeechOpenAI TTS, ElevenLabs, Google TTS
VoiceReal-time voice, voice cloning

Quick Start

TaskFiles
Vision APIvision-basics.md → vision-applications.md
Image generationimg-gen-basics.md → img-gen-tools.md
Video generationvideo-gen-basics.md → video-gen-tools.md
Speech-to-textspeech-to-text-basics.md → speech-to-text-advanced.md
Text-to-speechtts-basics.md → tts-implementation.md
Voice synthesisvoice-basics.md → voice-implementation.md

Methodologies (12)

Vision (2):

  • vision-basics: Image understanding, OCR, scene analysis
  • vision-applications: Use cases, production patterns

Image Generation (2):

  • img-gen-basics: Prompt engineering, models
  • img-gen-tools: DALL-E 3, Midjourney, Stable Diffusion

Video Generation (2):

  • video-gen-basics: Fundamentals, prompting
  • video-gen-tools: Sora, Runway, Pika, Luma

Speech-to-Text (2):

  • speech-to-text-basics: Whisper API, real-time
  • speech-to-text-advanced: Diarization, timestamps

Text-to-Speech (2):

  • tts-basics: Voice selection, SSML
  • tts-implementation: Production patterns, streaming

Voice (2):

  • voice-basics: Real-time voice, cloning
  • voice-implementation: Integration patterns

Code Examples

GPT-4o Vision

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "https://..."}}
        ]
    }]
)

DALL-E 3 Image Generation

from openai import OpenAI

client = OpenAI()

response = client.images.generate(
    model="dall-e-3",
    prompt="A futuristic city with flying cars",
    size="1024x1024",
    quality="hd",
    n=1
)

image_url = response.data[0].url

Whisper Speech-to-Text

from openai import OpenAI

client = OpenAI()

audio_file = open("speech.mp3", "rb")
transcription = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    response_format="verbose_json",
    timestamp_granularities=["word"]
)

print(transcription.text)

OpenAI TTS

from openai import OpenAI
from pathlib import Path

client = OpenAI()

response = client.audio.speech.create(
    model="tts-1-hd",
    voice="alloy",
    input="Hello, this is a test of text to speech."
)

response.stream_to_file("speech.mp3")

Gemini Vision

import google.generativeai as genai

genai.configure(api_key="...")
model = genai.GenerativeModel('gemini-pro-vision')

image = PIL.Image.open("image.jpg")
response = model.generate_content([
    "Describe this image in detail",
    image
])

print(response.text)

Model Comparison

Vision Models

ModelBest ForMax Image Size
GPT-4oGeneral vision, OCR20MB
Gemini Pro VisionHigh-res images20MB
Claude Sonnet 4Document analysis5MB

Image Generation

ModelBest ForCost
DALL-E 3Photorealistic, text$$$
MidjourneyArtistic, creative$$
Stable DiffusionCustom, open-sourceFree/$

Speech-to-Text

ServiceBest ForLanguages
WhisperGeneral, multilingual99
DeepgramReal-time, low latency30+
AssemblyAIFeatures, diarization10+

Text-to-Speech

ServiceBest ForVoices
OpenAI TTSQuality, variety6
ElevenLabsCloning, realismCustom
Google TTSLanguages, SSML400+

Use Cases

Use CaseModalities
Document analysisVision → Text
Video narrationVideo → Speech → TTS
Voice assistantSpeech → LLM → TTS
Content generationText → Images/Video
AccessibilityVision → TTS, Speech → Text

Related Skills

SkillRelationship
faion-llm-integrationProvides vision APIs
faion-ai-agentsMultimodal agents

Multimodal AI v1.0 | 12 methodologies