Ai-video-generator-claude seedance-podcast-visual

Generate podcast clip visualization video prompts for Seedance 2.0 on Higgsfield. Use for podcast clip videos, audio-to-visual content, audiogram alternatives, podcast highlight reels, interview clip visuals, or any video that transforms audio content into engaging visual format. Triggers on podcast, audio clip, audiogram, interview clip, sound bite, audio visual, podcast video, episode highlight, podcast clip.

install
source · Clone the upstream repo
git clone https://github.com/rediumvex/ai-video-generator-claude
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/rediumvex/ai-video-generator-claude "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/10-podcast-visual" ~/.claude/skills/rediumvex-ai-video-generator-claude-seedance-podcast-visual && rm -rf "$T"
manifest: skills/10-podcast-visual/SKILL.md
source content

Podcast Visual — Audio-to-Video Transformation Prompts

Transform podcast audio into cinematic visual content using Seedance 2.0 on Higgsfield. This skill produces video prompts that replace static audiograms with storytelling-driven visual experiences built entirely from constructed imagery.


Input Specifications

Primary inputs:

  • Up to 3 audio files (podcast clips, interview excerpts, sound bites, episode highlights)
  • Transcript or key quote text from the audio
  • Speaker name(s) and brief context (topic, show name, tone)
  • Desired visual style (abstract, cinematic, interview reconstruction, kinetic)
  • Target platform (Instagram Reels, YouTube Shorts, LinkedIn, TikTok)
  • Aspect ratio: 9:16 (vertical/mobile-first), 16:9 (widescreen), or 1:1 (square)

Audio file handling:

  • File 1: Primary clip — the main sound bite or key quote being visualized
  • File 2 (optional): Intro or context clip — sets up the narrative before the hook
  • File 3 (optional): Reaction or follow-up clip — speaker response, co-host moment, audience reaction
  • Duration guidance: each clip should be 15–90 seconds; total sequence up to 3 minutes

What you extract from audio before writing prompts:

  • The single most quotable sentence (becomes the visual anchor)
  • The emotional register: contemplative, fired-up, vulnerable, instructive, funny
  • Pacing: fast and punchy vs. slow and deliberate delivery
  • Natural pauses: where silence lives (these become visual breath moments)
  • Speaker energy level: seated calm, animated gesturing, emotional peak

Philosophy

Old model (audiogram)New model (podcast visual)
Show the waveformShow what the words feel like
Static background imageConstructed cinematic environment
Speaker photo as thumbnailSpeaker reconstructed in scene
Generic brand colorsLighting and atmosphere matched to tone
Passive viewingActive emotional engagement
Optimized for "audio on"Compelling even on mute

2-Second Hook Patterns

The hook is the opening frame that stops the scroll. It must communicate emotion, intrigue, or tension before a single word is heard. Four proven structures:

The Quote Impact

Display the most provocative line from the clip as large kinetic text before audio begins. The text arrives with weight — not a gentle fade, but a hard cut or a push-in. The visual behind it is blurred or dark, forcing the text into full focus.

When to use: clips with a single devastating sentence, contrarian takes, counterintuitive statistics, direct challenges to conventional wisdom.

Visual execution in prompt: specify "bold white sans-serif typography slams onto dark background, camera holds for 1.5 seconds, then cuts to speaker close-up, shallow depth of field, background softly bokeh'd."

The Reaction Shot

Open on the speaker's face at the moment of peak emotional expression — surprise, laughter, conviction, vulnerability — before any context is given. This creates a curiosity gap: the viewer needs to hear what caused that expression.

When to use: interview moments where a genuine reaction occurs, storytelling clips where the speaker relives something visceral, moments of realization or revelation.

Visual execution in prompt: specify "extreme close-up on speaker's face, caught mid-expression, eyes slightly wide, ambient room sound implied by environment, camera slowly eases back over 3 seconds to reveal setting."

The Visual Metaphor

Instead of showing the speaker at all, open with an environmental or abstract image that represents the core concept of the clip. A podcast about burnout opens on dying embers. A clip about compounding returns opens on a single drop rippling outward. The metaphor does expository work so the audio can focus on depth.

When to use: concept-heavy clips, philosophical discussions, any clip where the idea is more powerful than the person delivering it.

Visual execution in prompt: specify the metaphor object explicitly, its lighting, its motion quality, and a precise camera behavior (slow push, orbital, static hold with foreground element drifting through).

The Sound Wave Art

Not a functional audiogram waveform — instead, an artistic rendering of sound as visual sculpture. Particles forming and dissolving in rhythm with imagined speech cadence. Light bending through air as if vibrated by voice. Sound made beautiful, not informational.

When to use: music-adjacent podcasts, high-production brand content, moments where you want to foreground the craft of the medium itself.

Visual execution in prompt: specify particle behavior, color palette tied to the emotional register of the clip, and whether motion is rhythmic/predictable or fluid/organic. Avoid the word "waveform" — describe it as "acoustic particle field" or "resonant light diffusion."


Visual Formats

Abstract Visualization

The audio inspires a visual world that does not contain the speaker at all. Instead, abstract imagery — light, texture, particle systems, color gradients, fluid dynamics — evolves in response to the imagined emotional arc of the audio.

Core parameters:

  • Color temperature must match emotional tone (cool/blue for analytical, warm/amber for intimate, high-contrast for confrontational)
  • Motion should breathe with speech rhythm — slowing during pauses, accelerating during emphasis
  • Avoid literal representation; the visual is interpretive, not illustrative
  • Works best at 9:16 for mobile, full-bleed composition

Prompt elements to always include: dominant color palette, motion behavior (fluid, particle, crystalline, liquid, smoke), camera behavior (static, slow push, orbital), and whether the environment is finite (a room implied by light edges) or infinite (void space)

Cinematic B-Roll Narrative

Construct a series of visuals that would, in a traditional documentary, accompany the audio as b-roll. Except here every frame is generated — no stock footage, no compromises. The b-roll tells the story of the words.

Core parameters:

  • Each visual beat corresponds to a sentence or phrase in the clip
  • Environments are specific: not "a city" but "a rain-slicked street at 11 PM, single sodium-vapor streetlight, no pedestrians"
  • Objects carry symbolic weight: a speaker discussing scarcity shows empty shelves; one discussing abundance shows an overflowing market
  • Camera movement is motivated — zoom-in when tension builds, cut to wide when perspective expands

Prompt elements to always include: specific environment (time of day, weather, geography implied), one or two key objects in frame, camera move, lighting source, color grade direction (film noir, golden hour, overcast flat light, neon-saturated).

Split-Screen Interview Reconstruction

Reconstruct the podcast conversation as if it were a filmed interview, split-screen between two constructed environments. Each speaker occupies a distinct visual space — differentiated by lighting color temperature, depth of field, and environmental detail — while remaining in visual dialogue with each other.

Core parameters:

  • Left panel and right panel are visually asymmetric by design, not just mirrored
  • Lighting on each speaker communicates their role: warmer for the guest/storyteller, cooler-neutral for the host/interrogator
  • Camera behavior between panels should differ: one speaker gets a slow push-in, the other a static hold
  • Invisible edit: both panels feel like they belong to the same moment even though they are compositionally separate

Prompt elements to always include: panel ratio (50/50, 60/40, or dynamic shift), description of each environment, lighting scheme for each, camera behavior for each, whether there is any visual bleed or hard line between panels.

Kinetic Typography

The words themselves become the visual. The transcript animates — letters forming, words scaling, phrases colliding, key terms expanding to fill frame. The background is subordinate to type.

Core parameters:

  • Font weight must be heavy — no light or thin weights for podcast visuals; impact requires mass
  • Motion choreography follows the rhythm of speech, not just the meaning
  • Color contrast must survive both light and dark mode viewing
  • Key words get visual differentiation: scale, color pop, brief hold, or shake-on-impact
  • Secondary text (speaker name, show name, timestamp) should not compete with primary quote text

Prompt elements to always include: primary font style (condensed sans, geometric sans, slab serif, display), animation behavior per word type (verbs get thrust/motion, nouns get scale, punctuation gets pause), background treatment (solid, gradient, subtle texture, barely-there environmental photo), and transition behavior between lines.


Audio-Visual Sync Techniques

Seedance 2.0 generates video from text, not from audio input directly. Syncing visual rhythm to audio content is a craft challenge solved at the prompting stage.

Beat-Matching Visuals to Speech Rhythm

Map the cadence of the clip before writing the prompt. Fast-talking speakers with high information density need visuals with shorter hold times and more cuts implied. Slower, deliberate speakers — the kind who pause for effect — need visuals that can breathe with them.

Practical approach:

  • Count syllables per second in the clip (rough: slow = under 3/sec, normal = 3-5/sec, fast = 5+/sec)
  • For slow speakers: specify long camera holds, environmental drift, single sustained composition
  • For fast speakers: specify implied cut rhythm, multiple visual planes in frame, foreground elements that create natural layering

Emphasis Moments

Identify the 2-3 words or phrases in the clip that carry the most weight. These become visual events — a push-in, a light shift, a particle burst, a text scale event. The prompt should describe these precisely.

Example language for prompts:

  • "At the word 'everything,' camera completes its slow push-in to extreme close-up"
  • "The phrase 'no one told me' appears in oversized type, centered, holds for 1.5 seconds before dissolving"
  • "Warm backlight intensifies for 0.8 seconds aligned with the speaker's raised hand gesture"

Pause Visualization

Silence in a podcast clip is intentional. Great speakers use pauses like punctuation. These moments are visual opportunities: a held frame, a slow-motion exhale, a breath of empty space in the composition. Do not cut away during a pause — hold, and let the tension live.

Prompt language for pauses: "camera holds completely static," "particle motion slows to near-still," "background light dims 15% and recovers," "single dust mote drifts through foreground."


Camera and Composition for Non-Live Content

Everything in a Seedance 2.0 podcast visual is constructed, not captured. This is not a limitation — it is the entire point. Camera behavior must be specified deliberately.

Core camera vocabulary for podcast visuals:

  • Slow push-in (dolly in): builds intimacy and urgency over time; use for building conviction in the speaker's words
  • Static hold: signals stability and authority; the speaker's words are certain enough that the camera need not move
  • Slow rack focus: transitions attention from background to foreground or vice versa; use during pivots in the narrative
  • Orbital (360 slow arc): creates dimensionality around a subject or object; use for abstract or metaphor sequences
  • Floating handheld: simulates presence and documentary intimacy; specify "subtle organic camera drift, not mechanical" to distinguish from tripod hold
  • Extreme close-up (ECU): a detail shot — eye, hand, mouth, object — that signals emotional intensity without requiring full frame context

Composition principles:

  • For vertical (9:16): place the subject at eye-height in the upper third; leave space in the lower third for typography overlay
  • For widescreen (16:9): observe the rule of thirds strictly; avoid center-frame static compositions for interview reconstruction
  • For square (1:1): treat it as a poster — every frame should be self-contained and graphically bold
  • Always specify negative space intentionality: "empty space to the right of the speaker implies room for text overlay"

Lighting

Lighting is the single most powerful emotional signal in a constructed podcast visual. Specify it precisely.

Studio Interview Recreation

A clean, professional environment that reads as controlled but not sterile. Two-point or three-point lighting setup. The subject is clearly lit; the background is separated but not overlit.

Prompt language: "soft key light from camera-left, warm 4800K, catchlight visible in eyes; subtle fill from camera-right at 1/3 power; dark grey background with faint rim separation from practical studio light; no hard shadows on face."

Abstract Void

The subject or object exists in pure darkness, lit by a single motivated source. Extreme contrast. The background gives nothing — all attention is on the subject. This style works for any clip where the words are the only thing that matters.

Prompt language: "single motivated light source from above-left, 3200K tungsten warmth, subject is lit in isolation against black void, no ambient fill, deep shadow on right side of face, high contrast ratio."

Warm Conversation

Intimate, inviting, lived-in. Suggests a real conversation happening in a real space — a kitchen, a living room, a café booth. Light sources are practical: a window, a lamp, candles. Color temperature is warm (2700K–3200K). This is the lighting of trust.

Prompt language: "warm practical lamp at frame-right, 2700K, soft pool of light; secondary fill from large window implied off-frame left, cool daylight, low intensity; environment reads as domestic interior, bookshelves soft in background, overall low-contrast warmth."

Dramatic Single Source

One hard light. Everything else is darkness or very deep shadow. This is the lighting of revelation, of stakes, of honesty forced into the open. Use for clips where the speaker says something they'd normally not say publicly.

Prompt language: "single hard light source from above, 5500K, 45-degree angle to subject, no fill, deep shadows across lower face and neck, light falls off sharply, background completely unlit, high-drama chiaroscuro treatment."


Typography

When text appears in the video — quotes, speaker IDs, timestamps — it must be designed with the same care as the visual.

Quote Display Techniques

Full-frame typographic hold: the quote appears as the primary visual. Large, bold, white or cream text on dark background. The text itself is the frame. Used for the most powerful single sentence.

Lower-third persistent text: the quote runs as a lower-third during the speaker visual. Not subtitles — the quote is selected for impact, not transcription accuracy. Use a font weight heavier than body text. Maximum 10-12 words.

Word-by-word reveal: each word appears individually, timed to the rhythm of delivery. Prompt must specify animation behavior: "words appear with a hard cut at each syllable boundary, no fades, duration of each word matches speech pacing."

Floating quote with depth: the text appears to exist in three-dimensional space within the scene, as if the words are physically present. Prompt: "quote text appears to float in mid-air within the environment, slight parallax offset from background movement, letters cast soft shadow onto the space behind them."

Speaker Identification

Speaker name and show/platform title appear within the first 5 seconds and again at the end. Keep it subordinate to the quote unless brand-building is the primary goal.

Placement options:

  • Lower-left corner, smaller type, contrasting color to background
  • Upper-right corner with show logo treatment
  • End frame only — full-screen speaker name, episode title, call-to-action

Typography rule: speaker ID font should be same family as quote text, lower weight (regular vs. bold). Never two competing display fonts.

Running Captions

For podcast clips on TikTok/Reels, burned-in subtitles are expected:

  • Position: center-bottom, above platform UI (20% from bottom edge)
  • Style: bold sans-serif, white with black outline or dark background bar
  • Timing: word-by-word or phrase-by-phrase sync to audio
  • Prompt phrasing: "Burned-in subtitle text appears word-by-word synced to speech rhythm. Bold white sans-serif, center-bottom positioning above platform UI. Dark semi-transparent bar behind text for legibility."

Timestamp Markers

For longer clips with multiple audio files, timestamp markers orient the viewer within the episode. These are subtle — a small badge in the upper-right, episode time in small sans-serif. They add credibility (this is a real episode worth finding) without competing with the primary visual.


Complete Example Prompts

Example 1: The Conviction Statement (Abstract Void, 9:16)

Context: A founder describing the moment she decided to quit her corporate job. Single audio clip, 45 seconds. Quote: "I wasn't jumping. I was finally standing still."


Seedance 2.0 video prompt, 9:16 vertical, 45 seconds:

Opening frame: pure black void, single tungsten-warm spotlight descends from directly above, illuminating only the empty center of frame. Dust particles drift slowly through the beam. Camera is completely static. Hold 2 seconds.

At second 2: bold white condensed sans-serif text appears with a hard cut — "I wasn't jumping." — centered at mid-frame, massive scale, filling 60% of horizontal width. Text holds for 1.8 seconds with absolute stillness. No animation, no fade. The words have weight.

Hard cut to black. 0.5 second black.

Text reappears: "I was finally standing still." — same font, same scale, same position. Camera begins an imperceptibly slow push-in toward the text, barely perceptible over 3 seconds. Text dissolves.

At second 8: transition to extreme close-up on a woman's hands resting palms-down on a dark wooden desk, ring light catchlight visible on skin, warm 3200K key light from camera-left. Hands are still, completely still. Camera holds.

Slow pull-back over 12 seconds reveals the desk, a single sheet of paper face-down on the surface, pen beside it. The environment implies a home office — dark bookshelf edge visible, one small practical lamp providing ambient fill. The scene is intimate and decisive.

At second 28: the hands turn the paper over. Slow motion, 50% speed. We do not see what is written. Camera slowly rises from hands to the window behind the desk, soft late-afternoon light, city skyline slightly out of focus.

Final 8 seconds: frame holds on the window with golden-hour light. Speaker name appears lower-left in clean white regular-weight sans-serif: "— [Speaker Name]." below it in smaller type: "@showname | Episode 47." Fade to black.

Color grade: high contrast, warm shadows, teal-pulled highlights. Cinematic aspect, no vignette.


Example 2: The Expert Take (Split-Screen Interview Reconstruction, 16:9)

Context: Two-person podcast. Host and guest discussing why most startup pivots fail. Two audio clips — host question and guest response. Combined 75 seconds.


Seedance 2.0 video prompt, 16:9 widescreen, 75 seconds:

Opening: split-screen, hard vertical line at 50% horizontal. Left panel: host environment — modern home office, cool-toned ambient light, bookshelves visible in background out of focus, camera at eye level, static hold, subject framing leaves slight negative space on their right for visual breathing room. Right panel: guest environment — warmer, slightly softer ambience, practical desk lamp at frame-right, 2800K warmth, camera position slightly lower than eye-level suggesting the guest is more relaxed.

Both panels are simultaneously visible. The subjects are constructed, not real footage — described as follows:

Left panel subject: man, early 40s, lean posture, dark shirt, slightly leaning forward toward camera, expression attentive, one hand resting on desk, neutral background with soft depth separation. Light: key light from camera-left 4500K, minimal fill, clean catchlight.

Right panel subject: woman, mid-30s, natural posture, light-colored top, relaxed but engaged expression, hands occasionally gesture just below frame. Light: warm key from camera-right 3000K, softer shadow wrap, environment feels more inhabited.

At second 8: left panel pushes to 60% of screen width as host speaks. Right panel reduces to 40%, subject slightly defocused. No title card yet.

At second 22: guest begins speaking. Panels return to 50/50. Left panel camera begins barely perceptible slow zoom-in on host's listening expression. Right panel camera holds static on guest.

At second 35: key quote from guest appears as lower-third overlay on the right panel, white bold condensed sans-serif on semi-transparent dark strip: "Most pivots fail because the founders pivot away from pain instead of toward insight." Text persists for 6 seconds, then dissolves.

Final 15 seconds: both panels hold in conversation mode, cameras hold static. Full-screen end frame fades in from black: episode title centered in clean display type, both names and handles below in smaller weight, podcast logo upper-right corner. Dark background, minimal, professional.

Color grade: both panels intentionally slightly different — left cooler, right warmer — reinforcing their distinct perspectives. No filter effect, cinematic naturalism.


Example 3: The Vulnerable Moment (Cinematic B-Roll Narrative, 9:16)

Context: Solo podcast host reflecting on burnout. Single audio clip, 60 seconds. Emotional, introspective delivery. Key line: "I kept shipping things I didn't believe in anymore."


Seedance 2.0 video prompt, 9:16 vertical, 60 seconds:

Opening 3 seconds: extreme close-up on a laptop keyboard, single finger resting on the Enter key, not pressing. Shallow depth of field, bokeh'd background implies a dim room. Camera holds static. The finger does not move. No other motion in frame.

Cut at second 3: interior of a home office at night. Blue-grey ambient light from a monitor out of frame to the left. Desk covered in post-it notes, cups, cables — organized chaos of overwork. Camera positioned low, looking up at the desk from below-counter angle. Slow dolly-left movement, 5-second duration, revealing more of the desk's surface and a wall of notes behind. One note reads "SHIP IT" in visible handwriting but slightly out of focus.

Cut at second 14: back to the keyboard close-up, now slightly wider — we see both hands, neither moving. Camera begins an extremely slow push-in toward the hands over 8 seconds.

Cut at second 22: new environment. A coffee cup, half-full, cold. Steam: none. Morning light implied from window light on the cup's surface — but the light is pale, low-energy overcast. Camera static. Cup is centered in frame. 4 second hold.

Cut at second 26: corridor of a generic office building, late at night, fluorescent lights, empty. Camera at end of corridor looking down its length. Camera does not move for 5 seconds, then begins an extremely slow push forward down the corridor.

At second 31: kinetic text overlaid on the corridor shot: words appear one at a time, white heavy-weight sans-serif, centered: "things" — "I" — "didn't" — "believe" — "in" — "anymore." Each word appears at the rhythm of the imagined speech delivery (slow, deliberate). Final word "anymore." holds for 2 seconds, then the entire corridor shot and text fades to a held black.

From second 45: return to laptop close-up, tightest framing yet — just the trackpad, one thumb resting on it. Stillness. Over 8 seconds, a single small light — implied notification — pulses twice on the screen reflection in the trackpad surface. The thumb does not move.

Final 7 seconds: fade to deep navy. Speaker handle appears in small white regular-weight type, center-frame. Below it: episode name. Below that: "full episode linked in bio." All text fades in slowly, no animation.

Color grade: desaturated, cool blue-green shadows, low contrast in highlights. The visual language of exhaustion. No warmth until there is something to be warm about — and in this clip, there isn't.


Prompt Rules for Podcast Visualization

  1. Every visual element must serve the emotional content of the audio — no decorative shots without emotional purpose.

  2. Never use the word "waveform" in a prompt. A waveform is a technical representation of audio data. A Seedance 2.0 podcast visual is a cinematic translation of meaning. Use language like "acoustic particle field," "resonant light behavior," "sound-reactive atmospheric texture" — or, better, describe what the emotion looks like rather than what the sound looks like.

  3. Specify timing in seconds. Seedance 2.0 prompts that include "at second X" guidance produce more coherent visual sequences than prompts that describe only static states. Map every visual event to a time anchor.

  4. One dominant visual per scene. Avoid prompts that describe three competing visual elements in a single shot. The viewer can hold one thing at a time. Complexity comes from sequencing, not from stuffing each frame.

  5. Write the pause. Every description of motion must be balanced by a description of stillness. "Camera holds completely static for 4 seconds" is as important as any camera movement. Stillness communicates confidence.

  6. Match color temperature to emotional register. This is not a suggestion — it is a requirement. Cool temperatures (5000K+) for analytical, distanced, or revelatory content. Warm temperatures (under 3200K) for intimate, vulnerable, nostalgic content. Mixed temperatures for conflict or unresolved tension.

  7. Typography is visual, not verbal. When including quote text in prompts, describe its physical appearance (scale, weight, position, animation behavior) as precisely as you describe a person's expression. The font is a costume; the animation is a performance.

  8. Audio files inform the prompt; they do not dictate it. The visual narrative should be coherent on mute. If the video requires the audio to make sense, the prompt has not done its job. Test by asking: "would a viewer who cannot hear this clip understand its emotional content within 5 seconds?"

  9. End on identity. The last 5-10 seconds of every podcast visual should establish the creator's identity (name, handle, show, episode). This is not vanity — it is the conversion moment. The viewer has just felt something; tell them where to find more.

  10. Multi-clip sequencing. When using 2 or 3 audio files, each clip gets its own visual arc (opening hook, development, resolution), but all arcs share a unified color grade and typographic system. Consistency across clips signals craft; variation within clips signals dynamism.