Skillsbench gemini-count-in-video
Analyze and count objects in videos using Google Gemini API (object counting, pedestrian detection, vehicle tracking, and surveillance video analysis).
install
source · Clone the upstream repo
git clone https://github.com/benchflow-ai/skillsbench
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/benchflow-ai/skillsbench "$T" && mkdir -p ~/.claude/skills && cp -r "$T/tasks/pedestrian-traffic-counting/environment/skills/gemini-count-in-video" ~/.claude/skills/benchflow-ai-skillsbench-gemini-count-in-video && rm -rf "$T"
manifest:
tasks/pedestrian-traffic-counting/environment/skills/gemini-count-in-video/SKILL.mdsource content
Gemini Video Understanding Skill
Purpose
This skill enables video analysis and object counting using the Google Gemini API, with a focus on counting pedestrians, detecting objects, tracking movement, and analyzing surveillance footage. It supports precise prompting for differentiated counting (e.g., pedestrians vs cyclists vs vehicles).
When to Use
- Counting pedestrians, vehicles, or other objects in surveillance videos
- Distinguishing between different types of objects (walkers vs cyclists, cars vs trucks)
- Analyzing traffic patterns and movement through a scene
- Processing multiple videos for batch object counting
- Extracting structured count data from video footage
Required Libraries
The following Python libraries are required:
from google import genai from google.genai import types import os import time
Input Requirements
- File formats: MP4, MPEG, MOV, AVI, FLV, MPG, WebM, WMV, 3GPP
- Size constraints:
- Use inline bytes for small files (rule of thumb: <20MB).
- Use the File API upload flow for larger videos (most surveillance footage).
- Always wait for processing to complete before analysis.
- Video quality: Higher resolution provides better counting accuracy for distant objects
- Duration: Longer videos may require longer processing times; consider the full video length for accurate counting
Output Schema
For object counting tasks, structure results as JSON:
{ "success": true, "video_file": "surveillance_001.mp4", "model": "gemini-2.0-flash-exp", "counts": { "pedestrians": 12, "cyclists": 3, "vehicles": 5 }, "notes": "Optional observations about the counting process or edge cases" }
Field Descriptions
: Whether the analysis completed successfullysuccess
: Name of the analyzed video filevideo_file
: Gemini model used for the requestmodel
: Object counts by categorycounts
: Any clarifications or warnings about the countnotes
Code Examples
Basic Pedestrian Counting (File API Upload)
from google import genai import os import time import re client = genai.Client(api_key=os.getenv("GEMINI_API_KEY")) # Upload video (File API for >20MB) myfile = client.files.upload(file="surveillance.mp4") # Wait for processing while myfile.state.name == "PROCESSING": time.sleep(5) myfile = client.files.get(name=myfile.name) if myfile.state.name == "FAILED": raise ValueError("Video processing failed") # Prompt for counting pedestrians with clear exclusion criteria prompt = """Count the total number of pedestrians who are WALKING through the scene in this surveillance video. IMPORTANT RULES: - ONLY count people who are walking on foot - DO NOT count people riding bicycles - DO NOT count people driving cars or other vehicles - Count each unique pedestrian only once, even if they appear in multiple frames Provide your answer as a single integer number representing the total count of pedestrians. Answer with just the number, nothing else. Your answer should be enclosed in <answer> and </answer> tags, such as <answer>5</answer>. """ response = client.models.generate_content( model="gemini-2.0-flash-exp", contents=[prompt, myfile], ) # Parse the response response_text = response.text.strip() match = re.search(r"<answer>(\d+)</answer>", response_text) if match: count = int(match.group(1)) print(f"Pedestrian count: {count}") else: print("Could not parse count from response")
Batch Processing Multiple Videos
from google import genai import os import time import re def upload_and_wait(client, file_path: str, max_wait_s: int = 300): """Upload video and wait for processing.""" myfile = client.files.upload(file=file_path) waited = 0 while myfile.state.name == "PROCESSING" and waited < max_wait_s: time.sleep(5) waited += 5 myfile = client.files.get(name=myfile.name) if myfile.state.name == "FAILED": raise ValueError(f"Video processing failed: {myfile.state.name}") if myfile.state.name == "PROCESSING": raise TimeoutError(f"Processing timeout after {max_wait_s}s") return myfile client = genai.Client(api_key=os.getenv("GEMINI_API_KEY")) # Process all videos in directory video_dir = "/app/video" video_extensions = {".mp4", ".mkv", ".avi", ".mov"} results = {} for filename in os.listdir(video_dir): if any(filename.lower().endswith(ext) for ext in video_extensions): video_path = os.path.join(video_dir, filename) print(f"Processing {filename}...") # Upload and analyze myfile = upload_and_wait(client, video_path) response = client.models.generate_content( model="gemini-2.0-flash-exp", contents=["Count pedestrians walking through the scene. Answer with just the number.", myfile], ) # Extract count count = int(re.search(r'\d+', response.text).group()) results[filename] = count print(f" Count: {count}") print(f"\nProcessed {len(results)} videos") # Results dictionary can now be used for further processing or saving
Differentiating Object Types
# Count different categories separately prompt = """Analyze this surveillance video and count: 1. Pedestrians (people walking on foot) 2. Cyclists (people riding bicycles) 3. Vehicles (cars, trucks, motorcycles) RULES: - Count each unique individual/vehicle only once - If someone switches from walking to cycling, count them in their primary mode - Provide counts as three separate numbers Format your answer as: Pedestrians: <number> Cyclists: <number> Vehicles: <number> """ response = client.models.generate_content( model="gemini-2.0-flash-exp", contents=[prompt, myfile], ) # Parse multiple counts text = response.text pedestrians = int(re.search(r'Pedestrians:\s*(\d+)', text).group(1)) cyclists = int(re.search(r'Cyclists:\s*(\d+)', text).group(1)) vehicles = int(re.search(r'Vehicles:\s*(\d+)', text).group(1))
Using Answer Tags for Reliable Parsing
# Request structured output with XML-like tags prompt = """Count the total number of pedestrians walking through the scene. You should reason and think step by step. Provide your answer as a single integer. Your answer should be enclosed in <answer> and </answer> tags, such as <answer>5</answer>. """ response = client.models.generate_content( model="gemini-2.0-flash-exp", contents=[prompt, myfile], ) # Robust extraction match = re.search(r"<answer>(\d+)</answer>", response.text) if match: count = int(match.group(1)) else: # Fallback: try to find any number in response numbers = re.findall(r'\d+', response.text) count = int(numbers[0]) if numbers else 0
Best Practices
- Use the File API for all surveillance videos (typically >20MB) and always wait for processing to complete.
- Be specific in prompts: Clearly define what to count and what to exclude (e.g., "walking pedestrians only, not cyclists").
- Use structured output formats: Request answers in specific formats (like
) for reliable parsing.<answer>N</answer> - Ask for reasoning: Include "think step by step" to improve counting accuracy.
- Handle edge cases: Specify rules for partial appearances, people entering/exiting frame, and mode changes.
- Use gemini-2.0-flash-exp or gemini-2.5-flash: These models provide good balance of speed and accuracy for object counting.
- Test with sample videos: Verify prompt effectiveness on representative samples before batch processing.
Error Handling
import time def upload_and_wait(client, file_path: str, max_wait_s: int = 300): """Upload video and wait for processing with timeout.""" myfile = client.files.upload(file=file_path) waited = 0 while myfile.state.name == "PROCESSING" and waited < max_wait_s: time.sleep(5) waited += 5 myfile = client.files.get(name=myfile.name) if myfile.state.name == "FAILED": raise ValueError(f"Video processing failed: {myfile.state.name}") if myfile.state.name == "PROCESSING": raise TimeoutError(f"Processing timeout after {max_wait_s}s") return myfile def count_with_fallback(client, video_path): """Count pedestrians with error handling and fallback.""" try: myfile = upload_and_wait(client, video_path) prompt = """Count pedestrians walking through the scene. Answer with just the number in <answer></answer> tags.""" response = client.models.generate_content( model="gemini-2.0-flash-exp", contents=[prompt, myfile], ) # Try structured parsing first match = re.search(r"<answer>(\d+)</answer>", response.text) if match: return int(match.group(1)) # Fallback to any number found numbers = re.findall(r'\d+', response.text) if numbers: return int(numbers[0]) print(f"Warning: Could not parse count, defaulting to 0") return 0 except Exception as e: print(f"Error processing video: {e}") return 0
Common issues:
- Upload processing stuck: Use timeout logic and fail gracefully after max wait time
- Ambiguous responses: Use structured output tags like
for reliable parsing<answer></answer> - Rate limits: Add retry logic with exponential backoff for batch processing
- Inconsistent counts: Be very explicit in prompts about counting rules and exclusions
Limitations
- Counting accuracy depends on video quality, camera angle, and object size/distance
- Very crowded scenes may have higher counting variance
- Occlusion (objects blocking each other) can affect accuracy
- Long videos require longer processing times (typically 5-30 seconds per video)
- The model may occasionally misclassify similar objects (e.g., motorcyclist as cyclist)
- For highest accuracy, use clear prompts with explicit inclusion/exclusion criteria
Version History
- 1.0.0 (2026-01-21): Tailored for pedestrian traffic counting with focus on object counting, differentiation, and batch processing