prompt-engineer
Expert prompt engineering for building AI agents via Claude, GPT, and Gemini APIs. Triggers when: writing or editing system prompts, tool descriptions, agent instructions, function calling schemas, tool response design, context engineering, agentic system design, or discussing prompt quality for any LLM API. Also triggers on: prompt optimization, tool-use accuracy, cross-provider compatibility, or prompt review. Examples: 'improve the system prompt', 'write a tool description', 'the agent keeps calling the wrong tool', 'make this work on Gemini too', 'the tool returns too much data', 'design the agent context flow'.
git clone https://github.com/shaharsha/prompt-engineer-skill
T=$(mktemp -d) && git clone --depth=1 https://github.com/shaharsha/prompt-engineer-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/prompt-engineer" ~/.claude/skills/shaharsha-prompt-engineer-skill-prompt-engineer && rm -rf "$T"
skills/prompt-engineer/SKILL.mdPrompt Engineering for AI Agent APIs
Guidelines for writing system prompts, tool descriptions, and agent instructions for building AI agents via the Claude, GPT, and Gemini APIs.
Process
When editing an existing prompt, follow this order:
- Read the full existing prompt. Understand its intent, structure, and target provider.
- Identify the specific failure mode or improvement needed. Don't rewrite what isn't broken.
- Draft the minimal change that addresses the issue, following the guidelines below.
- Re-read the full prompt after editing to check for contradictions or broken flow.
- Test the prompt on the target provider with representative inputs.
For new prompts: start minimal (role + constraints + examples + output format), test, then add instructions only when you observe failure modes.
Before submitting any prompt edit: check for contradictions (if two rules conflict, the model picks arbitrarily — remove one), and verify clarity (could a colleague with no context follow this prompt unambiguously?).
A. Universal Best Practices
These apply to all three providers and cover ~70% of prompt engineering work.
System Prompt Structure
- Set a clear role in one sentence at the top. Every provider respects persona framing.
- Use XML tags (
,<role>
,<constraints>
,<examples>
) to separate sections. All three providers parse XML — it is the safest cross-provider delimiter.<output_format> - Be explicit and specific. Explain why a behavior matters, not just what to do. Models generalize better from explanations than from rigid rules.
- Tell the model what TO do, not what NOT to do. Positive framing ("Write in prose paragraphs") beats negative ("Don't use markdown").
- Start minimal, then iterate. Add instructions only when you observe failure modes. Over-engineered prompts cause over-analysis on Gemini and overtriggering on Claude.
Few-Shot Examples
- Include 3-5 diverse examples — one of the most reliable steering mechanisms across all providers.
- Cover typical cases, edge cases, and at least one "what not to do" example.
- Wrap in
tags with<example>
and<input>
sub-tags.<output> - Budget models need MORE examples (5-6 minimum) with simpler patterns.
Output Format
- Define output format explicitly — JSON schema, markdown template, or structured text.
- Provide a concrete example of the expected output, not just a description.
- Use enum arrays for valid values rather than prose descriptions — works on all providers and improves accuracy.
Context Engineering
- Context window is working memory. Every token competes for attention — more is not always better. Find the smallest set of high-signal tokens that maximize the likelihood of your desired outcome.
- Put long documents at the top, queries and instructions at the bottom. This ordering improves response quality by up to 30% on complex multi-document inputs.
- Prefer progressive discovery — let agents find context through tools rather than front-loading everything. Maintain lightweight identifiers (paths, names, links) and load full content dynamically. Combine upfront retrieval (fast, known-relevant) with autonomous exploration (discovers unknowns).
- Use semantic names over technical IDs — replace UUIDs, mime_types, internal codes with human-readable names (file names, descriptive labels). LLMs reason more accurately over natural language than opaque identifiers.
- When context grows large, ask the model to quote relevant sections first before reasoning — this cuts through noise and grounds the response.
- For multi-turn agents: compact and summarize conversation history at context limits, preserving key decisions and findings. Use a cheap model (Flash, Haiku) for summarization.
- For long-horizon tasks: maintain structured notes (scratchpad, state JSON) outside the conversation for information that must survive compaction.
- Irrelevant context actively degrades performance — ruthlessly trim what the current step does not need. Performance degrades gradually (not a cliff) as context grows.
Task Decomposition
- Break complex tasks into phases with clear handoff points.
- Define success criteria for each phase so the model can self-check.
- Use separate prompt templates for distinct subtasks (analysis, extraction, generation).
- For complex workflows, break into sequential chained prompts — output from one becomes input for the next. Both GPT and Gemini perform better on focused prompts than bundled mega-prompts.
Agentic Systems
When designing multi-step or multi-agent workflows:
- Subagent design — give each subagent a focused prompt and a minimal tool set. Choose whether to pass parent context based on the task: isolated subagents (no inherited context) prevent context pollution; context-aware subagents (with injected state) enable continuity. Either way, return a condensed summary (1-2K tokens) to the orchestrator, not raw exploration.
- State management — use structured JSON for trackable state (phase progress, scores, pass/fail) and free text for qualitative notes (findings, reasoning). Emit state updates the orchestrator can act on.
- Context across windows — when a task spans multiple context windows, start the new window with a structured summary of prior findings, not raw conversation history.
- Autonomy calibration — be explicit about what the agent MAY do autonomously vs. what requires confirmation. Default: read operations are autonomous, write/delete operations require confirmation. Claude 4.6 is highly proactive — dial back aggressive prompting or it will over-act.
- Delegation discipline — Claude 4.6 over-spawns subagents for simple tasks. Add: "Only delegate when the task requires specialized tools or a clean context. For simple lookups, do it yourself."
B. Tool Descriptions
Tool descriptions are the single most impactful quality factor for tool-use accuracy across all three providers.
Minimum Requirements
- Full models: 3-4 sentences minimum. Budget models: 5-6 sentences.
- Every tool description must cover:
- What it does — one clear sentence
- When to use it — specific triggers and conditions
- When NOT to use it — common mistakes and overlapping tools
- Parameters — type, constraints, valid values, format for each
- Return value — what comes back and what does NOT
- Caveats — rate limits, data freshness, error conditions
Architecture
- Limit active tools to 10-20 for best accuracy on all providers. Use tool search or dynamic loading for larger sets — both Anthropic and OpenAI recommend deferring infrequently-used tools.
- Curate a minimal set — if a human cannot definitively choose between two tools, the model cannot either.
- Build tools around workflows, not API wrappers.
(finds availability AND books) beats separateschedule_event
+list_users
+list_events
. Fewer, higher-value tools outperform many narrow ones.create_event - Use meaningful namespaced names in snake_case (
, notgithub_list_prs
). Group related tools with prefixes. Some providers reject names with spaces or special characters.list - Design tool responses carefully — see Tool Response Design below.
- All three providers support parallel tool calling — design tools to be independently callable.
- Iterate via evaluation: small description improvements yield dramatic gains. Create realistic multi-tool test cases, run programmatically, analyze transcripts. When a tool is misused, fix the description first — it's the highest-leverage fix.
Provider-Specific Tool Guidance
| Aspect | Claude | GPT | Gemini |
|---|---|---|---|
| Description style | Detailed narrative (3-4 sentences) | CTCO: Context, Task, Constraints, Output | Short and direct; use enum arrays heavily |
| Strict schema | Supported | for 100% schema adherence | Up to 512 declarations; 10-20 active recommended |
| Tool preambles | Not needed | "Before calling a tool, explain why" boosts accuracy | Not needed |
| Error recovery | Handles well natively | Handles well natively | Add: "Don't repeat failed calls with identical arguments" |
| Scope creep risk | Low | High — add "Do ONLY what is requested" | Moderate |
Effective Patterns from Production Systems
Patterns from Claude Code's tool descriptions and other production agents:
- Decision-tree routing: "ALWAYS use X for Y. NEVER use Z — use W instead." Direct, unambiguous tool selection.
- When to use / When NOT to use as first-class sections — put these at the top of the description, not buried after parameter docs. Include specific triggers, not just "when relevant."
- Name alternatives explicitly: "Do NOT use for searching employees — use search_employees instead." The model needs to know what to call instead.
- Safety constraints separated — mark irreversible or dangerous operations distinctly from functional description.
- Concrete thresholds — "3+ steps" not "complex tasks"; "at least 2 characters" not "enough text." Numeric where possible.
Input Examples
For complex tools with nested objects, format-sensitive parameters, or tools that are easily confused with each other — add 1-5 input examples showing realistic parameter values. This improves parameter accuracy from 72% to 90%.
See
${CLAUDE_SKILL_DIR}/templates/tool-description-template.md for the full structured format.
Tool Response Design
Tool responses are context — bloated responses waste tokens and degrade reasoning on subsequent steps.
- Return only high-signal information. Strip internal IDs, metadata, and fields the model will not use.
- Use semantic values over technical ones — return
notfile_type: "spreadsheet"
. Returnmime_type: "application/vnd.openxmlformats..."
notstatus: "approved"
.status_code: 3 - Paginate large results with sensible defaults. Include total count and a guidance message:
— steer the model toward targeted searches."Showing 20 of 487 results. Narrow your query or use offset for next page." - Truncate long text with a structural outline — return the first N lines plus an outline (headers, sections) so the model can request specific parts.
- Explicit absence — state what is NOT included:
This prevents hallucinated fields."Does NOT include financial data — use get_project_details." - Actionable errors — return specific, actionable error messages, not opaque codes.
beats"No results for 'XYZ'. Try broader terms or check spelling."
."Error 404" - Response format parameter — expose a
parameter (format
vs"detailed"
) letting agents choose output verbosity. A concise response can be 3x fewer tokens than detailed, saving context for reasoning."concise"
C. Provider Differences
System Prompt Role and Placement
| Aspect | Claude | GPT | Gemini |
|---|---|---|---|
| System role name | | (prioritized over user) | |
| Instruction placement | Top works best | BOTH start AND end (recency bias) | Critical constraints go LAST |
| Multi-turn drift | System prompt persists well | Reappend key instructions every 3-5 messages | System instruction persists well |
| Default verbosity | Verbose — request conciseness if needed | Moderate — use param | Concise — request elaboration if needed |
Prompting Style
| Aspect | Claude | GPT | Gemini |
|---|---|---|---|
| Aggressive language | Avoid — 4.6 is proactive by default; aggressive prompting causes overtriggering and over-action | Unnecessary on GPT-5 (most steerable model); contradictions actively harmful | Avoid — causes over-analysis |
| Chain-of-thought | Not needed — use adaptive thinking | Harmful on reasoning models — degrades performance | Not needed — use thinkingLevel |
| Prompt length | Medium-length prompts work well | Longer prompts tolerated | Short, direct prompts work best |
| Structuring | XML tags (strongest support) | XML or markdown (JSON wrapping degrades perf) | XML, markdown, or plain text — be consistent |
| Persona handling | Follows but maintains guardrails | Follows reasonably | Takes personas VERY seriously — may override other instructions |
| Non-English output | Follows prompt language naturally | Needs mild nudging | Requires aggressive: "RESPOND IN {LANGUAGE}. YOU MUST RESPOND UNMISTAKABLY IN {LANGUAGE}." |
Reasoning and Thinking
| Aspect | Claude | GPT | Gemini |
|---|---|---|---|
| Mechanism | Adaptive by default; use param (low/medium/high) for control | : minimal/medium/high; peaks across multiple agent turns | : minimal/low/medium/high |
| Default state | Adaptive (model decides) — no config needed | OFF — must enable explicitly | Cannot disable on latest Pro models |
| Temperature | 0.0-1.0 typical range | 0.0-1.0 typical range | Keep at 1.0 — lowering causes looping/degradation |
| Trace reuse | Not supported | saves tokens in multi-turn | Not supported |
Caching
| Aspect | Claude | GPT | Gemini |
|---|---|---|---|
| Mechanism | Developer-controlled cache breakpoints () | Automatic prefix-based (≥1024 tokens, 128-token granularity) | Implicit (automatic) + explicit ( objects) |
| Cost savings | Cache reads at 10% of input cost (90% savings) | Cache reads ~80% cheaper than standard input | Model-dependent; explicit caching reduces per-request cost |
| Minimum tokens | 1,024-4,096 depending on model | 1,024 tokens | Varies by model |
| TTL | 5 min (default) or 1 hour; reads reset TTL | ~5-10 min automatic; 24h with extended caching | 1 hour default; configurable per CachedContent |
Cache-aware prompt design (applies to all providers, saves 80-90% on input costs):
- Structure static → dynamic: place system prompt, tool definitions, and examples first (stable prefix). Put user input and conversation history last. The prefix must be byte-identical across requests.
- Never reorder: changing tool order, image order, or message order between requests breaks the cache on all providers. Append new messages — never modify earlier ones.
- Multi-turn: append assistant response + new user message at the end. The stable prefix (system + tools + earlier messages) hits cache automatically.
- Monitor: check
(Claude),cache_read_input_tokens
(GPT), or cache metadata (Gemini) in responses to verify hits.cached_tokens
Unique Features
- Claude: Prefill removed in 4.6. Adaptive thinking is default — no config needed. Context-aware (can track remaining context window). Parallel tool calling ~100% success with explicit instruction. Use
param (noteffort
) for thinking control. Tool search available for deferring large tool sets.budget_tokens - GPT:
role prioritized overdeveloper
— security-sensitive instructions go in developer message.user
guarantees 100% JSON schema adherence. Independentstrict: true
param (low/medium/high) controls output length separately from reasoning depth.text.verbosity
preserves reasoning across turns (73.9% → 78.2% on benchmarks). Contradictions in prompts are more damaging in GPT-5 — it spends tokens reconciling conflicts instead of ignoring them.previous_response_id - Gemini: Native Google Search grounding (exclusive). Media resolution control for multimodal (image/video/PDF token budgets). Cannot combine built-in tools with custom function calling simultaneously. Few-shot examples are critical — "prompts without few-shot examples are likely to be less effective." Default verbosity is terse — request elaboration explicitly. Add temporal grounding ("Remember it is {YEAR} this year") for time-sensitive tasks. Completion priming (start the response, let model continue) is more reliable than describing format preferences. Thought signatures (
) preserve reasoning chains across multi-turn calls — capture and return them to maintain coherence.thoughtSignatures
D. Budget Models
Budget models (Claude Haiku 4.5, GPT-5 Mini, Gemini 3 Flash / 3.1 Flash Lite) share common patterns:
What Changes
- More explicit instructions — less capable at inferring intent from context.
- More examples — 5-6 minimum, simpler patterns, covering more edge cases.
- Simpler tool sets — fewer tools with clearer boundaries. Consolidate where possible.
- Shorter system prompts — trim context aggressively; budget models lose more from noise.
- Longer tool descriptions — 5-6 sentences minimum instead of 3-4.
Best Uses by Tier
| Task Type | Haiku 4.5 | GPT-5 Mini | Flash / Flash Lite |
|---|---|---|---|
| Classification / routing | Excellent | Good | Excellent |
| Structured extraction | Good | Good | Good |
| Simple tool use | Good | Good | Good |
| Complex reasoning | Use full model | Use full model | Use full model |
| Multimodal (image/PDF) | Adequate | Adequate | Flash excellent |
| High-volume batch | Good value | Good value | Flash Lite best value |
Practical Pattern
Use budget models for fast/cheap phases (extraction, classification, routing) and full models for complex phases (analysis, recommendation, generation). This is the standard multi-tier agent pipeline.
Warning: tools designed for weaker models can actively harm stronger ones. Detailed workarounds and hand-holding that help Haiku may cause Opus to overtrigger or over-act. When supporting multiple model tiers, test tool descriptions on each tier — or use model-conditional tool descriptions.
E. Cross-Provider Compatibility
When writing prompts that must work across providers or when provider-switching is likely:
Safe Everywhere
- XML tags for structure
- Markdown headers and lists
- Few-shot examples in
tags — also work as implicit constraint enforcers (can sometimes replace explicit instructions if examples are clear enough)<example> - Enum arrays for valid values
- Role/persona in the first sentence
- Positive framing ("do X" not "don't do Y")
- Explicit output format with concrete examples
Must Abstract Per Provider (in your agent framework, not in the prompt)
- Thinking/reasoning configuration (API parameter, not prompt content)
- Temperature settings (especially Gemini's 1.0 requirement)
- System message role name (
vssystem
vsdeveloper
)system_instruction - Caching strategy (manual breakpoints vs automatic prefix vs automatic)
- Tool schema strictness (
is GPT-specific)strict: true - Reasoning effort levels (different scales per provider)
Cross-Provider Prompt Pattern
Write the core prompt once using XML structure, then wrap provider-specific adjustments in your agent framework:
- Core prompt: role + constraints + tools + examples + output format (XML tags)
- Provider adapter: instruction placement, language enforcement, anti-scope-creep guardrails
- Model adapter: thinking config, temperature, caching, max tokens
F. Examples
Good vs Bad System Prompt Opening
<example> <label>Good — clear role, explains why</label> <content> You are a senior auditor analyzing government tender documents. Your goal is to identify eligibility requirements and scoring criteria so the firm can decide whether to bid.When requirements are ambiguous, flag them explicitly rather than guessing. The cost of missing a requirement is much higher than the cost of flagging a false positive. </content> <reasoning> Sets a clear role in one sentence. Explains the goal. Provides a decision-making principle with the WHY behind it (cost asymmetry). The model can now generalize this principle to novel situations. </reasoning> </example>
<example> <label>Bad — vague, no reasoning</label> <content> You are a helpful assistant. Be thorough and accurate. Don't make mistakes. Always double-check your work. </content> <reasoning> No specific role. "Be thorough" and "don't make mistakes" are meaningless — every model already tries to be accurate. No explanation of WHY or HOW to prioritize. The model has nothing to generalize from. </reasoning> </example>Good vs Bad Tool Description
<example> <label>Good — covers all 6 required elements</label> <content> Search for projects in the database using full-text search. Use when the user asks about past work, specific projects, or experience in a domain. Do NOT use for searching employees or clients — use search_employees or search_clients instead.Args: query: str. Search terms (Hebrew or English). For Hebrew prefix matching, use at least 2 characters. Example: "ביקורת רשויות" limit: int, optional. Max results to return. Default 10.
Returns a list of records with: project_id, name, client_name, year, scope, team_members. Does NOT include financial data — use get_project_details for billing. </content> <reasoning> Covers: what it does, when to use it, when NOT to use it (with alternatives), parameter details with example, return value with explicit exclusions. 6 sentences. A model reading this cannot misuse the tool. </reasoning> </example>
<example> <label>Bad — one-liner</label> <content> Searches projects. </content> <reasoning> Missing: when to use, when not to use, parameters, return value, caveats. The model will guess — and guess wrong. This is the #1 cause of poor tool-use accuracy. </reasoning> </example>G. Common Anti-Patterns
Recognize these urges and resist them:
- "Let me add more detail to be safe" — Over-engineered prompts cause over-analysis (Gemini) and overtriggering (Claude). Start minimal, add only what fixes observed failures.
- "CRITICAL: YOU MUST ALWAYS..." — Aggressive language overtriggers on Claude 4.6 and causes over-analysis on Gemini. Use calm, direct instructions.
- "Think step by step" — Harmful on GPT reasoning models. Unnecessary on Claude/Gemini (use their thinking APIs instead). Remove manual chain-of-thought from prompts targeting reasoning-capable models.
- "Don't hallucinate" — Negative framing is less effective. Instead: "Ground all claims in provided documents. Quote the relevant section before answering." (See Context Engineering.)
- Adding 20+ tools — Too many overlapping tools is the #1 failure mode across all providers. If you can't choose between two tools as a human, the model can't either.
- Wrapping APIs as tools — building 1:1 API-to-tool mappings instead of workflow-oriented tools.
(finds availability AND books) beats separateschedule_event
+list_users
+list_events
.create_event - Over-delegating to subagents — Claude 4.6 spawns subagents for tasks it could handle directly. If the task needs no specialized tools or clean context, do it in-line.
- Rewriting the whole prompt — When editing, preserve existing structure and tone. Make the minimal change that fixes the issue. Re-read the full prompt after editing.
- Prompt archaeology neglect — instructions effective in GPT-4/Claude 3.5 may backfire in newer models. When upgrading models, audit prompts for obsolete aggressive encouragement — native capabilities make external prodding redundant.
- Assuming all providers behave the same — They don't. Check Section C for differences in instruction placement, verbosity defaults, persona handling, and temperature.
H. Testing Prompts
Don't iterate blindly. Build simple evaluations:
- Identify failures — run the agent on representative tasks without changes. Note specific failures.
- Fix one thing at a time — make a single change, re-test, measure. Don't bundle multiple changes.
- Test on the target model tier — what works on Opus/GPT-5.4/Gemini Pro may need more detail for Haiku/Mini/Flash.
- Use adversarial inputs — empty strings, unexpected languages, edge cases, tools that shouldn't be called.
- After 2 failed correction attempts — stop iterating. Start fresh with a better initial prompt incorporating lessons learned.
Templates
Reference templates in
${CLAUDE_SKILL_DIR}/templates/:
— Contract-style system prompt structuresystem-prompt-template.md
— Structured tool description formattool-description-template.md