Higgsfield-ai-prompt-skill higgsfield-audio
git clone https://github.com/OSideMedia/higgsfield-ai-prompt-skill
T=$(mktemp -d) && git clone --depth=1 https://github.com/OSideMedia/higgsfield-ai-prompt-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/higgsfield-audio" ~/.claude/skills/osidemedia-higgsfield-ai-prompt-skill-higgsfield-audio && rm -rf "$T"
skills/higgsfield-audio/SKILL.mdHiggsfield Audio Prompting Guide
Which Models Support Audio?
| Model | Audio type | Dialogue | SFX | Ambient | BGM | Lip-sync |
|---|---|---|---|---|---|---|
| Kling 3.0 / Omni | Native joint | ✅ | ✅ | ✅ | ✅ | ✅ Multi-language |
| Seedance 2.0 | Native joint | ✅ | ✅ | ✅ | ✅ | ✅ Multi-language |
| Seedance 1.5 Pro | Native joint | ✅ | ✅ | ✅ | ✅ | ✅ Best lip-sync |
| Veo 3 / 3.1 | Native joint | ✅ | ✅ | ✅ | ✅ | ✅ English best |
| Grok Imagine Video | Native joint | ✅ | ✅ | ✅ | ✅ | ✅ |
| All other models | ❌ | — | — | — | — | — |
"Native joint" means audio and video are generated simultaneously in one pass — not layered on after. This produces natural synchronization without post-production.
Models without native audio: add audio in post with Lipsync Studio or external tools.
The Four Audio Layers
Every audio-capable prompt should consider four layers. You don't need all four in every prompt, but knowing which to include gives the model clear direction.
1. Dialogue — What characters say
Put dialogue in quotes. Be explicit about who speaks, their tone, and language.
She says: "We need to leave. Now." He whispers: "Not yet."
Best practices:
- Keep dialogue short — 1-2 sentences per character per shot
- Specify emotional tone: "says urgently", "whispers", "shouts across the room"
- For non-English: specify language and dialect →
She speaks in Cantonese: "走啦" - For Seedance 1.5 Pro: supports English, Chinese (incl. Sichuanese, Cantonese, Taiwanese Mandarin, Shanghainese), Japanese, Korean, Spanish, Indonesian
2. SFX — Specific sound events tied to action
Describe SFX at the point they happen. Tie them to visible actions.
The glass shatters on the floor — sharp crack, then settling tinkle. Footsteps on wet concrete — splashing, rhythmic. A door slams shut — heavy metal, echoing.
Best practices:
- One SFX description per action beat
- Use onomatopoeia sparingly — descriptive phrases work better than "BANG" or "CRASH"
- Tie timing to action: "as she sets the cup down" not "cup sound at 4 seconds"
3. Ambient — Background soundscape
Set the acoustic environment. This is the continuous sound bed.
Ambient: quiet café murmur, espresso machine, rain against windows. Ambient: forest at night — crickets, distant owl, gentle wind through leaves. Ambient: busy intersection — traffic, horns, construction in the distance.
Best practices:
- 2-3 ambient elements maximum — more gets muddy
- Describe the space acoustics: "reverberant church hall", "tight car interior"
- Contrast silence with sound for impact: "Dead silence. Then — a single footstep."
4. BGM — Background music mood
Don't name songs or artists (content filter). Describe the musical texture.
BGM: slow piano, minor key, melancholic. BGM: tense orchestral build — low strings, rising. BGM: lo-fi hip-hop beat, warm vinyl crackle, relaxed.
Best practices:
- Describe instrumentation, tempo, mood — not genre labels alone
- "Tense strings, building" works better than "suspenseful music"
- Specify when music enters/exits: "Piano enters at the midpoint, builds to the end"
- For beat-sync content: "Cuts match the downbeat" or "Movement peaks on the drop"
Audio Prompt Structure
Add audio cues naturally within your prompt or as a dedicated block at the end.
Inline method (preferred for short prompts):
A woman walks into a quiet library. Her heels click on the marble floor — each step echoing. She whispers to the librarian: "Do you have the Collected Letters?" Distant page turns. A clock ticks somewhere above.
Dedicated block method (better for complex audio):
[Scene description — visual content, action, camera] Audio: Dialogue: She says "We leave at dawn." He replies: "I'll be ready." SFX: coffee cup set down, chair scraping back Ambient: early morning kitchen — birds outside, kettle just boiled BGM: none — silence emphasizes the tension
Lip-Sync Rules
Lip-sync is the most failure-prone audio feature. Follow these rules strictly:
Do:
- Keep dialogue clips 3–8 seconds (sweet spot for accuracy)
- Use medium close-up or closer framing — model needs to see the mouth clearly
- One speaking face per shot — multiple faces break audio routing
- Lock the camera:
orlocked-off static camera
onlyslow Dolly In - Remove all head/face motion tokens:
,nodding
,turning head
compete with the lip engine and cause desynclooking around
Don't:
- Don't combine dialogue with vigorous head movement in the same prompt
- Don't use 15s clips for lip-sync — technical max but accuracy degrades past 8s
- Don't include ambient or music tokens if lip-sync is the priority — they invite the generative audio engine to override your dialogue
- Don't use non-MP3 audio for Seedance 2.0 (when available) — WAV/AAC/OGG fail silently
Multi-character dialogue workaround:
Multi-person lip-sync matching is an unresolved limitation across all models. The production workaround:
- Generate each character separately with their own audio segment
- Composite in CapCut/Premiere using picture-in-picture + linear mask (15% feather)
- Static image for the listening character; generated video for the speaking character
Audio by Model — What Works Best Where
Kling 3.0 (V3) / 3.0 Omni (O3)
- Best overall audio-visual integration
- Multi-language dialogue (English, Chinese, Japanese, Korean, Spanish + regional accents: American English, British English, Indian English)
- Multi-character dialogue: 3+ characters with correct speaker attribution and lip-sync per character
- Voice Binding: lock specific voice profiles to specific characters across shots
- O3 adds Voice Extraction from static images: upload audio clip (min 3s) + image to build a voice profile
- O3 adds Performance Cloning: act out a scene on camera → AI re-renders preserving likeness and voice
- Include dialogue, ambient, and SFX naturally in the prompt
- Prompt like a script: action + camera + mood + dialogue cues together
Audio Speaker Attribution Format (V3/O3):
[Speaker: Character Name] "dialogue" in a [warm/confident/excited] [male/female] voice with [accent]. Add [sound: footsteps / rain / door closing] when [action]. Background ambient: [environment description].
Seedance 1.5 Pro
- Best lip-sync accuracy of all models
- Class-leading multilingual support including Chinese dialects
- Most stable emotional tone control
- Use for: professional dialogue scenes, multilingual content
Seedance 2.0
- Upload MP3 audio as @Audio reference (part of Rule of 12)
- MP3 only — WAV/AAC/OGG/FLAC fail silently with no error
- Max 15s per clip, 3 audio files, 10MB each
- Timestamp anchoring:
Then remove all ambient/SFX/music tokens to prevent the generative engine from overriding."Audio @Audio1 plays exactly as uploaded from 0s to end. Do not modify."
Veo 3 / 3.1
- Strong native audio for English dialogue and environmental sounds
- Dialogue in quotes:
"This must be it," he murmured. - SFX explicitly:
tires screeching loudly - Ambient as environment soundscape descriptions
Grok Imagine Video
- Improved audio as of Video Imagine 1.0 (Feb 2026)
- Include audio intent directly in prompt — same inline style as other models
- Best for: social clips where audio adds polish but isn't the hero
Common Audio Failures and Fixes
| Problem | Cause | Fix |
|---|---|---|
| Lip-sync completely off | Audio > 8s, or head motion tokens present | Trim to 5s, remove nodding/turning tokens |
| Model replaces uploaded audio | Ambient/music tokens in prompt invite generative override | Add timestamp anchoring phrase, remove all ambient/music tokens |
| Dialogue missing entirely | Non-MP3 format used (Seedance 2.0) | Convert to MP3 128-320kbps |
| SFX drowns out dialogue | Too many SFX cues competing | Reduce to 1-2 SFX per shot, prioritize dialogue |
| Audio sounds robotic | Flat emotional cues | Add emotional direction: "says warmly", "whispers with urgency" |
| Background music too loud | BGM description too prominent in prompt | Move BGM to end of prompt, reduce detail, or say "subtle BGM" |
When to Skip Audio
Not every prompt needs audio direction. Skip audio cues when:
- Using a model without native audio (Kling 2.6, Wan 2.6, Seedance Pro, Minimax Hailuo 2.3/02)
- The content is purely visual (product beauty shots, abstract motion, landscape)
- Audio will be added entirely in post-production
- The prompt is already at the 200-word limit and visual direction is more important
Negative constraints: For audio-specific artifacts (lip-sync desync, background music overriding dialogue, SFX drowning dialogue) and their prevention phrases, see
— Temporal/Consistency Artifacts section.../shared/negative-constraints.md
Cinema Studio 3.0 Audio (Business/Team Plan)
Cinema Studio 3.0 introduces native audio-video joint generation — a fundamental shift from models that treat audio as a post-processing step.
Native Audio-Video Joint Generation
Audio is generated simultaneously with video via a unified multimodal architecture. This means:
- Audio and video are temporally aligned by default — no manual sync needed
- Dual-channel stereo output
- Sound design prompts directly influence both audio AND visual generation
- Audio is not "added on" — it's part of the same generation pass
Audio as Prompt Element (SCELA)
Always describe audio as a separate section in your prompts. The generation engine handles three parallel audio tracks:
- BGM — background music, score
- Ambient SFX — environmental sounds, foley
- Dialogue — character speech, voiceover
A chef slices vegetables rapidly on a wooden cutting board. Camera: tight close-up tracking the knife. Style: warm kitchen lighting, shallow depth of field. Audio: rhythmic chopping on wood, oil sizzling in a nearby pan, soft clinking of ceramic bowls. Light acoustic guitar BGM.
Input Constraints
| Parameter | Limit |
|---|---|
| Accepted formats | MP3, WAV |
| Max audio clips | 3 per generation |
| Combined duration | ≤15s total |
| Single file size | <15MB |
| MP3 bitrate | 128–320 kbps |
Lip-Sync
Available but experimental in Cinema Studio 3.0:
- Focus on emotion and general mouth movement, not perfect phoneme sync
- Single face per generation only — multi-face lip-sync is not supported
- For best results, keep dialogue segments under 8 seconds
- Pair with clear frontal or 3/4 face angle reference images
Tone / Voice Cloning via @Reference
Control speaking style, accent, and language by referencing a video with the desired voice:
Voiceover tone references @Video1. The narrator describes the product in a warm, conversational tone. "This changes everything." Audio @Audio1 plays exactly as uploaded from 0s to end. Do not modify or replace the audio content.
Dialect Support
Dialects written directly in the prompt work — the model understands regional speech patterns. Write dialogue in the target dialect for authentic delivery.
Timestamp Anchoring
When uploading reference audio that must play unmodified:
Audio @Audio1 plays exactly as uploaded from 0s to end. Do not modify or replace the audio content.
Then remove all ambient/SFX/music tokens from the prompt to prevent the generation engine from overriding the uploaded audio with generated sound.
Sound Design Specificity
Describe specific foley, not generic moods:
Wrong:
nice ambient sounds, pleasant background noise
Right:
the scratch of frosted glass, rustling of plush fabric, gentle tapping on acrylic, popping of bubble wrap, wooden floor creaking under bare feet
Specific sound descriptions directly influence the generated audio output. The more precise the foley description, the more accurate the result.
Related skills
— Which models support native audiohiggsfield-models
— Audio failure diagnosishiggsfield-troubleshoot
— Cinema Studio audio workflow with Kling 3.0higgsfield-cinema
— Motion graphics with audio (different from AI-generated audio)higgsfield-vibe-motion