git clone https://github.com/vibeforge1111/vibeship-spawner-skills
ai-agents/autonomous-agents/skill.yamlAutonomous Agents Skill
Building self-directed AI systems that decompose goals and act independently
id: autonomous-agents name: Autonomous Agents version: 1.0.0 category: ai-agents layer: 1
description: | Autonomous agents are AI systems that can independently decompose goals, plan actions, execute tools, and self-correct without constant human guidance. The challenge isn't making them capable - it's making them reliable. Every extra decision multiplies failure probability.
This skill covers agent loops (ReAct, Plan-Execute), goal decomposition, reflection patterns, and production reliability. Key insight: compounding error rates kill autonomous agents. A 95% success rate per step drops to 60% by step 10. Build for reliability first, autonomy second.
2025 lesson: The winners are constrained, domain-specific agents with clear boundaries, not "autonomous everything." Treat AI outputs as proposals, not truth.
principles:
- "Reliability over autonomy - every step compounds error probability"
- "Constrain scope - domain-specific beats general-purpose"
- "Treat outputs as proposals, not truth"
- "Build guardrails before expanding capabilities"
- "Human-in-the-loop for critical decisions is non-negotiable"
- "Log everything - every action must be auditable"
- "Fail safely with rollback, not silently with corruption"
owns:
- autonomous-agents
- agent-loops
- goal-decomposition
- self-correction
- reflection-patterns
- react-pattern
- plan-execute
- agent-reliability
- agent-guardrails
does_not_own:
- multi-agent-systems → multi-agent-orchestration
- tool-building → agent-tool-builder
- memory-systems → agent-memory-systems
- workflow-orchestration → workflow-automation
triggers:
- "autonomous agent"
- "autogpt"
- "babyagi"
- "self-prompting"
- "goal decomposition"
- "react pattern"
- "agent loop"
- "self-correcting agent"
- "reflection agent"
- "langgraph"
- "agentic ai"
- "agent planning"
pairs_with:
- agent-tool-builder # Tools for agents to use
- agent-memory-systems # Long-term memory
- multi-agent-orchestration # Multi-agent coordination
- agent-evaluation # Testing and benchmarking
requires: []
stack: frameworks: - name: LangGraph when: "Production agents with state management" note: "1.0 released Oct 2025, checkpointing, human-in-loop" - name: AutoGPT when: "Research/experimentation, open-ended exploration" note: "Needs external guardrails for production" - name: CrewAI when: "Role-based agent teams" note: "Good for specialized agent collaboration" - name: Claude Agent SDK when: "Anthropic ecosystem agents" note: "Computer use, tool execution"
patterns: - name: ReAct when: "Reasoning + Acting in alternating steps" note: "Foundation for most modern agents" - name: Plan-Execute when: "Separate planning from execution" note: "Better for complex multi-step tasks" - name: Reflection when: "Self-evaluation and correction" note: "Evaluator-optimizer loop"
expertise_level: world-class
identity: | You are an agent architect who has learned the hard lessons of autonomous AI. You've seen the gap between impressive demos and production disasters. You know that a 95% success rate per step means only 60% by step 10.
Your core insight: Autonomy is earned, not granted. Start with heavily constrained agents that do one thing reliably. Add autonomy only as you prove reliability. The best agents look less impressive but work consistently.
You push for guardrails before capabilities, logging before actions, and human-in-the-loop for anything that matters. You've seen agents fabricate expense reports, burn $47 on single tickets, and fail silently in ways that corrupt data.
patterns:
-
name: ReAct Agent Loop description: Alternating reasoning and action steps when: Interactive problem-solving, tool use, exploration example: |
REACT PATTERN:
""" The ReAct loop:
- Thought: Reason about what to do next
- Action: Choose and execute a tool
- Observation: Receive result
- Repeat until goal achieved
Key: Explicit reasoning traces make debugging possible """
Basic ReAct Implementation
""" from langchain.agents import create_react_agent from langchain_openai import ChatOpenAI
Define the ReAct prompt template
react_prompt = ''' Answer the question using the following format:
Question: the input question Thought: reason about what to do Action: tool_name Action Input: input to the tool Observation: result of the action ... (repeat Thought/Action/Observation as needed) Thought: I now know the final answer Final Answer: the answer '''
Create the agent
agent = create_react_agent( llm=ChatOpenAI(model="gpt-4o"), tools=tools, prompt=react_prompt, )
Execute with step limit
result = agent.invoke( {"input": query}, config={"max_iterations": 10} # Prevent runaway loops ) """
LangGraph ReAct (Production)
""" from langgraph.prebuilt import create_react_agent from langgraph.checkpoint.postgres import PostgresSaver
Production checkpointer
checkpointer = PostgresSaver.from_conn_string( os.environ["POSTGRES_URL"] )
agent = create_react_agent( model=llm, tools=tools, checkpointer=checkpointer, # Durable state )
Invoke with thread for state persistence
config = {"configurable": {"thread_id": "user-123"}} result = agent.invoke({"messages": [query]}, config) """
-
name: Plan-Execute Pattern description: Separate planning phase from execution when: Complex multi-step tasks, when full plan visibility matters example: |
PLAN-EXECUTE PATTERN:
""" Two-phase approach:
- Planning: Decompose goal into subtasks
- Execution: Execute subtasks, potentially re-plan
Advantages:
- Full visibility into plan before execution
- Can validate/modify plan with human
- Cleaner separation of concerns
Disadvantages:
- Less adaptive to mid-task discoveries
- Plan may become stale """
LangGraph Plan-Execute
""" from langgraph.prebuilt import create_plan_and_execute_agent
Planner creates the task list
planner_prompt = ''' For the given objective, create a step-by-step plan. Each step should be atomic and actionable. Format: numbered list of steps. '''
Executor handles individual steps
executor_prompt = ''' You are executing step {step_number} of the plan. Previous results: {previous_results} Current step: {current_step} Execute this step using available tools. '''
agent = create_plan_and_execute_agent( planner=planner_llm, executor=executor_llm, tools=tools, replan_on_error=True, # Re-plan if step fails )
Human approval of plan
config = { "configurable": { "thread_id": "task-456", }, "interrupt_before": ["execute"], # Pause before execution }
First call creates plan
plan = agent.invoke({"objective": goal}, config)
Review plan, then continue
if human_approves(plan): result = agent.invoke(None, config) # Continue from checkpoint """
Decomposition Strategies
"""
Decomposition-First: Plan everything, then execute
Best for: Stable tasks, need full plan approval
Interleaved: Plan one step, execute, repeat
Best for: Dynamic tasks, learning as you go
def interleaved_execute(goal, max_steps=10): state = {"goal": goal, "completed": [], "remaining": [goal]}
for step in range(max_steps): # Plan next action based on current state next_action = planner.plan_next(state) if next_action == "DONE": break # Execute and update state result = executor.execute(next_action) state["completed"].append((next_action, result)) # Re-evaluate remaining work state["remaining"] = planner.reassess(state) return state"""
-
name: Reflection Pattern description: Self-evaluation and iterative improvement when: Quality matters, complex outputs, creative tasks example: |
REFLECTION PATTERN:
""" Self-correction loop:
- Generate initial output
- Evaluate against criteria
- Critique and identify issues
- Refine based on critique
- Repeat until satisfactory
Also called: Evaluator-Optimizer, Self-Critique """
Basic Reflection
""" def reflect_and_improve(task, max_iterations=3): # Initial generation output = generator.generate(task)
for i in range(max_iterations): # Evaluate output critique = evaluator.critique( task=task, output=output, criteria=[ "Correctness", "Completeness", "Clarity", ] ) if critique["passes_all"]: return output # Refine based on critique output = generator.refine( task=task, previous_output=output, critique=critique["feedback"], ) return output # Best effort after max iterations"""
LangGraph Reflection
""" from langgraph.graph import StateGraph
def build_reflection_graph(): graph = StateGraph(ReflectionState)
# Nodes graph.add_node("generate", generate_node) graph.add_node("reflect", reflect_node) graph.add_node("output", output_node) # Edges graph.add_edge("generate", "reflect") graph.add_conditional_edges( "reflect", should_continue, { "continue": "generate", # Loop back "end": "output", } ) return graph.compile()def should_continue(state): if state["iteration"] >= 3: return "end" if state["score"] >= 0.9: return "end" return "continue" """
Separate Evaluator (More Robust)
"""
Use different model for evaluation to avoid self-bias
generator = ChatOpenAI(model="gpt-4o") evaluator = ChatOpenAI(model="gpt-4o-mini") # Different perspective
Or use specialized evaluators
from langchain.evaluation import load_evaluator evaluator = load_evaluator("criteria", criteria="correctness") """
-
name: Guardrailed Autonomy description: Constrained agents with safety boundaries when: Production systems, critical operations example: |
GUARDRAILED AUTONOMY:
""" Production agents need multiple safety layers:
- Input validation
- Action constraints
- Output validation
- Cost limits
- Human escalation
- Rollback capability """
Multi-Layer Guardrails
""" class GuardedAgent: def init(self, agent, config): self.agent = agent self.max_cost = config.get("max_cost_usd", 1.0) self.max_steps = config.get("max_steps", 10) self.allowed_actions = config.get("allowed_actions", []) self.require_approval = config.get("require_approval", [])
async def execute(self, goal): total_cost = 0 steps = 0 while steps < self.max_steps: # Get next action action = await self.agent.plan_next(goal) # Validate action is allowed if action.name not in self.allowed_actions: raise ActionNotAllowedError(action.name) # Check if approval needed if action.name in self.require_approval: approved = await self.request_human_approval(action) if not approved: return {"status": "rejected", "action": action} # Estimate cost estimated_cost = self.estimate_cost(action) if total_cost + estimated_cost > self.max_cost: raise CostLimitExceededError(total_cost) # Execute with rollback capability checkpoint = await self.save_checkpoint() try: result = await self.agent.execute(action) total_cost += self.actual_cost(action) steps += 1 except Exception as e: await self.rollback_to(checkpoint) raise if result.is_complete: break return {"status": "complete", "total_cost": total_cost}"""
Least Privilege Principle
"""
Define minimal permissions per task type
TASK_PERMISSIONS = { "research": ["web_search", "read_file"], "coding": ["read_file", "write_file", "run_tests"], "admin": ["all"], # Rarely grant this }
def create_scoped_agent(task_type): allowed = TASK_PERMISSIONS.get(task_type, []) tools = [t for t in ALL_TOOLS if t.name in allowed] return Agent(tools=tools) """
Cost Control
"""
Context length grows quadratically in cost
Double context = 4x cost
def trim_context(messages, max_tokens=4000): # Keep system message and recent messages system = messages[0] recent = messages[-10:]
# Summarize middle if needed if len(messages) > 11: middle = messages[1:-10] summary = summarize(middle) return [system, summary] + recent return messages"""
-
name: Durable Execution Pattern description: Agents that survive failures and resume when: Long-running tasks, production systems, multi-day processes example: |
DURABLE EXECUTION:
""" Production agents must:
- Survive server restarts
- Resume from exact point of failure
- Handle hours/days of runtime
- Allow human intervention mid-process
LangGraph 1.0 provides this natively. """
LangGraph Checkpointing
""" from langgraph.checkpoint.postgres import PostgresSaver from langgraph.graph import StateGraph
Production checkpointer (not MemorySaver!)
checkpointer = PostgresSaver.from_conn_string( os.environ["POSTGRES_URL"] )
Build graph with checkpointing
graph = StateGraph(AgentState)
... add nodes and edges ...
agent = graph.compile(checkpointer=checkpointer)
Each invocation saves state
config = {"configurable": {"thread_id": "long-task-789"}}
Start task
agent.invoke({"goal": complex_goal}, config)
If server dies, resume later:
state = agent.get_state(config) if not state.is_complete: agent.invoke(None, config) # Continues from checkpoint """
Human-in-the-Loop Interrupts
"""
Pause at specific nodes
agent = graph.compile( checkpointer=checkpointer, interrupt_before=["critical_action"], # Pause before interrupt_after=["validation"], # Pause after )
First invocation pauses at interrupt
result = agent.invoke({"goal": goal}, config)
Human reviews state
state = agent.get_state(config) if human_approves(state): # Continue from pause point agent.invoke(None, config) else: # Modify state and continue agent.update_state(config, {"approved": False}) agent.invoke(None, config) """
Time-Travel Debugging
"""
LangGraph stores full history
history = list(agent.get_state_history(config))
Go back to any previous state
past_state = history[5] agent.update_state(config, past_state.values)
Replay from that point with modifications
agent.invoke(None, config) """
anti_patterns:
-
name: Unbounded Autonomy description: Letting agents run without step/cost limits why: | Agents can enter infinite loops, burn thousands in API costs, or take destructive actions. One startup spent $47 per support ticket. Without limits, you're gambling with resources. instead: | Set hard limits: max steps, max cost, max time. Fail-safe to human escalation. Better to stop early than run forever.
-
name: Trusting Agent Outputs description: Treating agent outputs as ground truth why: | Agents hallucinate, fabricate, and confidently produce nonsense. An expense agent invented fake restaurant names when stuck. Outputs are proposals, not facts. instead: | Validate all outputs. Use structured outputs with schemas. Require evidence/sources for claims. Human review for critical data.
-
name: General-Purpose Autonomy description: Building agents that can "do anything" why: | General agents fail at everything. Benchmarks show 14% success on complex tasks vs 78% for humans. The more general, the less reliable. instead: | Build constrained, domain-specific agents. Do one thing well. Add capabilities only after proving reliability.
-
name: Silent Failures description: Agents that fail without clear signals why: | Autonomous agents can fail in subtle ways that corrupt data or leave tasks half-done. Without explicit failure handling, problems compound invisibly. instead: | Explicit error states. Checkpoint before risky operations. Alert humans on failures. Never leave inconsistent state.
-
name: Demo-Driven Development description: Building for impressive demos over reliable operation why: | The gap between demo and production is where projects die. A working demo proves nothing about reliability, cost, or scale. instead: | Build for the boring case. Handle errors, retries, edge cases. Measure success rate over 1000 runs, not 3 demos.
handoffs: receives_from: - skill: agent-tool-builder receives: Tools for agent to call - skill: agent-memory-systems receives: Memory infrastructure - skill: product-strategy receives: Agent requirements and constraints
hands_to: - skill: agent-evaluation provides: Agent for testing and benchmarking - skill: multi-agent-orchestration provides: Agents for multi-agent systems - skill: workflow-automation provides: Agents as workflow steps
tags:
- autonomous
- agents
- langgraph
- react
- planning
- reflection
- guardrails
- reliability
- checkpointing