Vibeship-spawner-skills computer-use-agents

Computer Use Agents Skill

install
source · Clone the upstream repo
git clone https://github.com/vibeforge1111/vibeship-spawner-skills
manifest: ai-agents/computer-use-agents/skill.yaml
source content

Computer Use Agents Skill

AI agents that control computers through visual perception and actions

id: computer-use-agents name: Computer Use Agents category: ai-agents description: | Build AI agents that interact with computers like humans do - viewing screens, moving cursors, clicking buttons, and typing text. Covers Anthropic's Computer Use, OpenAI's Operator/CUA, and open-source alternatives. Critical focus on sandboxing, security, and handling the unique challenges of vision-based control.

version: 1.0.0

triggers:

  • "computer use"
  • "desktop automation agent"
  • "screen control AI"
  • "vision-based agent"
  • "GUI automation"
  • "Claude computer"
  • "OpenAI Operator"
  • "browser agent"
  • "visual agent"
  • "RPA with AI"

============================================================================

CORE PATTERNS

============================================================================

patterns:

  • id: perception-reasoning-action-loop name: Perception-Reasoning-Action Loop description: | The fundamental architecture of computer use agents: observe screen, reason about next action, execute action, repeat. This loop integrates vision models with action execution through an iterative pipeline.

    Key components:

    1. PERCEPTION: Screenshot captures current screen state
    2. REASONING: Vision-language model analyzes and plans
    3. ACTION: Execute mouse/keyboard operations
    4. FEEDBACK: Observe result, continue or correct

    Critical insight: Vision agents are completely still during "thinking" phase (1-5 seconds), creating a detectable pause pattern. when_to_use:

    • "Building any computer use agent from scratch"
    • "Integrating vision models with desktop control"
    • "Understanding agent behavior patterns" implementation: | from anthropic import Anthropic from PIL import Image import base64 import pyautogui import time

    class ComputerUseAgent: """ Perception-Reasoning-Action loop implementation. Based on Anthropic Computer Use patterns. """

      def __init__(self, client: Anthropic, model: str = "claude-sonnet-4-20250514"):
          self.client = client
          self.model = model
          self.max_steps = 50  # Prevent runaway loops
          self.action_delay = 0.5  # Seconds between actions
    
      def capture_screenshot(self) -> str:
          """Capture screen and return base64 encoded image."""
          screenshot = pyautogui.screenshot()
          # Resize for token efficiency (1280x800 is good balance)
          screenshot = screenshot.resize((1280, 800), Image.LANCZOS)
    
          import io
          buffer = io.BytesIO()
          screenshot.save(buffer, format="PNG")
          return base64.b64encode(buffer.getvalue()).decode()
    
      def execute_action(self, action: dict) -> dict:
          """Execute mouse/keyboard action on the computer."""
          action_type = action.get("type")
    
          if action_type == "click":
              x, y = action["x"], action["y"]
              button = action.get("button", "left")
              pyautogui.click(x, y, button=button)
              return {"success": True, "action": f"clicked at ({x}, {y})"}
    
          elif action_type == "type":
              text = action["text"]
              pyautogui.typewrite(text, interval=0.02)
              return {"success": True, "action": f"typed {len(text)} chars"}
    
          elif action_type == "key":
              key = action["key"]
              pyautogui.press(key)
              return {"success": True, "action": f"pressed {key}"}
    
          elif action_type == "scroll":
              direction = action.get("direction", "down")
              amount = action.get("amount", 3)
              scroll = -amount if direction == "down" else amount
              pyautogui.scroll(scroll)
              return {"success": True, "action": f"scrolled {direction}"}
    
          elif action_type == "move":
              x, y = action["x"], action["y"]
              pyautogui.moveTo(x, y)
              return {"success": True, "action": f"moved to ({x}, {y})"}
    
          else:
              return {"success": False, "error": f"Unknown action: {action_type}"}
    
      def run(self, task: str) -> dict:
          """
          Run perception-reasoning-action loop until task complete.
    
          The loop:
          1. Screenshot current state
          2. Send to vision model with task context
          3. Parse action from response
          4. Execute action
          5. Repeat until done or max steps
          """
          messages = []
          step_count = 0
    
          system_prompt = """You are a computer use agent. You can see the screen
          and control mouse/keyboard.
    
          Available actions (respond with JSON):
          - {"type": "click", "x": 100, "y": 200, "button": "left"}
          - {"type": "type", "text": "hello world"}
          - {"type": "key", "key": "enter"}
          - {"type": "scroll", "direction": "down", "amount": 3}
          - {"type": "done", "result": "task completed successfully"}
    
          Always respond with ONLY a JSON action object.
          Be precise with coordinates - click exactly where needed.
          If you see an error, try to recover.
          """
    
          while step_count < self.max_steps:
              step_count += 1
    
              # 1. PERCEPTION: Capture current screen
              screenshot_b64 = self.capture_screenshot()
    
              # 2. REASONING: Send to vision model
              user_content = [
                  {"type": "text", "text": f"Task: {task}\n\nStep {step_count}. What action should I take?"},
                  {"type": "image", "source": {
                      "type": "base64",
                      "media_type": "image/png",
                      "data": screenshot_b64
                  }}
              ]
    
              messages.append({"role": "user", "content": user_content})
    
              response = self.client.messages.create(
                  model=self.model,
                  max_tokens=1024,
                  system=system_prompt,
                  messages=messages
              )
    
              assistant_message = response.content[0].text
              messages.append({"role": "assistant", "content": assistant_message})
    
              # 3. Parse action from response
              import json
              try:
                  action = json.loads(assistant_message)
              except json.JSONDecodeError:
                  # Try to extract JSON from response
                  import re
                  match = re.search(r'\{[^}]+\}', assistant_message)
                  if match:
                      action = json.loads(match.group())
                  else:
                      continue
    
              # Check if done
              if action.get("type") == "done":
                  return {
                      "success": True,
                      "result": action.get("result"),
                      "steps": step_count
                  }
    
              # 4. ACTION: Execute
              result = self.execute_action(action)
    
              # Small delay for UI to update
              time.sleep(self.action_delay)
    
          return {
              "success": False,
              "error": "Max steps reached",
              "steps": step_count
          }
    

    Usage

    agent = ComputerUseAgent(Anthropic()) result = agent.run("Open Chrome and search for 'weather today'") anti_patterns:

    • "Running without step limits (infinite loops)"
    • "No delay between actions (UI can't keep up)"
    • "Screenshots at full resolution (token explosion)"
    • "Ignoring action failures (no recovery)"
  • id: sandboxed-environment-pattern name: Sandboxed Environment Pattern description: | Computer use agents MUST run in isolated, sandboxed environments. Never give agents direct access to your main system - the security risks are too high. Use Docker containers with virtual desktops.

    Key isolation requirements:

    1. NETWORK: Restrict to necessary endpoints only
    2. FILESYSTEM: Read-only or scoped to temp directories
    3. CREDENTIALS: No access to host credentials
    4. SYSCALLS: Filter dangerous system calls
    5. RESOURCES: Limit CPU, memory, time

    The goal is "blast radius minimization" - if the agent goes wrong, damage is contained to the sandbox. when_to_use:

    • "Deploying any computer use agent"
    • "Testing agent behavior safely"
    • "Running untrusted automation tasks" implementation: |

    Dockerfile for sandboxed computer use environment

    Based on Anthropic's reference implementation pattern

    FROM ubuntu:22.04

    Install desktop environment

    RUN apt-get update && apt-get install -y
    xvfb
    x11vnc
    fluxbox
    xterm
    firefox
    python3
    python3-pip
    supervisor

    Security: Create non-root user

    RUN useradd -m -s /bin/bash agent &&
    mkdir -p /home/agent/.vnc

    Install Python dependencies

    COPY requirements.txt /tmp/ RUN pip3 install -r /tmp/requirements.txt

    Security: Drop capabilities

    RUN apt-get install -y --no-install-recommends libcap2-bin &&
    setcap -r /usr/bin/python3 || true

    Copy agent code

    COPY --chown=agent:agent . /app WORKDIR /app

    Supervisor config for virtual display + VNC

    COPY supervisord.conf /etc/supervisor/conf.d/

    Expose VNC port only (not desktop directly)

    EXPOSE 5900

    Run as non-root

    USER agent

    CMD ["/usr/bin/supervisord", "-c", "/etc/supervisor/conf.d/supervisord.conf"]


    docker-compose.yml with security constraints

    version: '3.8'

    services: computer-use-agent: build: . ports: - "5900:5900" # VNC for observation - "8080:8080" # API for control

      # Security constraints
      security_opt:
        - no-new-privileges:true
        - seccomp:seccomp-profile.json
    
      # Resource limits
      deploy:
        resources:
          limits:
            cpus: '2'
            memory: 4G
          reservations:
            cpus: '0.5'
            memory: 1G
    
      # Network isolation
      networks:
        - agent-network
    
      # No access to host filesystem
      volumes:
        - agent-tmp:/tmp
    
      # Read-only root filesystem
      read_only: true
      tmpfs:
        - /run
        - /var/run
    
      # Environment
      environment:
        - DISPLAY=:99
        - NO_PROXY=localhost
    

    networks: agent-network: driver: bridge internal: true # No internet by default

    volumes: agent-tmp:


    Python wrapper with additional runtime sandboxing

    import subprocess import os from dataclasses import dataclass from typing import Optional

    @dataclass class SandboxConfig: """Configuration for agent sandbox.""" network_allowed: list[str] = None # Allowed domains max_runtime_seconds: int = 300 max_memory_mb: int = 2048 allow_downloads: bool = False allow_clipboard: bool = False

    class SandboxedAgent: """ Run computer use agent in Docker sandbox. """

      def __init__(self, config: SandboxConfig):
          self.config = config
          self.container_id: Optional[str] = None
    
      def start(self):
          """Start sandboxed environment."""
          # Build network rules
          network_rules = ""
          if self.config.network_allowed:
              for domain in self.config.network_allowed:
                  network_rules += f"--add-host={domain}:$(dig +short {domain}) "
          else:
              network_rules = "--network=none"
    
          cmd = f"""
          docker run -d \
              --name computer-use-sandbox-$$ \
              --security-opt no-new-privileges \
              --cap-drop ALL \
              --memory {self.config.max_memory_mb}m \
              --cpus 2 \
              --read-only \
              --tmpfs /tmp \
              {network_rules} \
              computer-use-agent:latest
          """
    
          result = subprocess.run(cmd, shell=True, capture_output=True)
          self.container_id = result.stdout.decode().strip()
    
          # Set up kill timer
          subprocess.Popen([
              "sh", "-c",
              f"sleep {self.config.max_runtime_seconds} && docker kill {self.container_id}"
          ])
    
          return self.container_id
    
      def execute_task(self, task: str) -> dict:
          """Execute task in sandbox."""
          if not self.container_id:
              self.start()
    
          # Send task to agent via API
          import requests
          response = requests.post(
              f"http://localhost:8080/task",
              json={"task": task},
              timeout=self.config.max_runtime_seconds
          )
    
          return response.json()
    
      def stop(self):
          """Stop and remove sandbox."""
          if self.container_id:
              subprocess.run(f"docker rm -f {self.container_id}", shell=True)
              self.container_id = None
    

    anti_patterns:

    • "Running agents on host system directly"
    • "Giving sandbox full network access"
    • "Running as root in container"
    • "No resource limits (denial of service)"
    • "Persistent storage (data can leak between runs)"
  • id: anthropic-computer-use name: Anthropic Computer Use Implementation description: | Official implementation pattern using Claude's computer use capability. Claude 3.5 Sonnet was the first frontier model to offer computer use. Claude Opus 4.5 is now the "best model in the world for computer use."

    Key capabilities:

    • screenshot: Capture current screen state
    • mouse: Click, move, drag operations
    • keyboard: Type text, press keys
    • bash: Run shell commands
    • text_editor: View and edit files

    Tool versions:

    • computer_20251124 (Opus 4.5): Adds zoom action for detailed inspection
    • computer_20250124 (All other models): Standard capabilities

    Critical limitation: "Some UI elements (like dropdowns and scrollbars) might be tricky for Claude to manipulate" - Anthropic docs when_to_use:

    • "Building production computer use agents"
    • "Need highest quality vision understanding"
    • "Full desktop control (not just browser)" implementation: | from anthropic import Anthropic from anthropic.types.beta import ( BetaToolComputerUse20241022, BetaToolBash20241022, BetaToolTextEditor20241022, ) import subprocess import base64 from PIL import Image import io

    class AnthropicComputerUse: """ Official Anthropic Computer Use implementation.

      Requires:
      - Docker container with virtual display
      - VNC for viewing agent actions
      - Proper tool implementations
      """
    
      def __init__(self):
          self.client = Anthropic()
          self.model = "claude-sonnet-4-20250514"  # Best for computer use
          self.screen_size = (1280, 800)
    
      def get_tools(self) -> list:
          """Define computer use tools."""
          return [
              BetaToolComputerUse20241022(
                  type="computer_20241022",
                  name="computer",
                  display_width_px=self.screen_size[0],
                  display_height_px=self.screen_size[1],
              ),
              BetaToolBash20241022(
                  type="bash_20241022",
                  name="bash",
              ),
              BetaToolTextEditor20241022(
                  type="text_editor_20241022",
                  name="str_replace_editor",
              ),
          ]
    
      def execute_tool(self, name: str, input: dict) -> dict:
          """Execute a tool and return result."""
    
          if name == "computer":
              return self._handle_computer_action(input)
          elif name == "bash":
              return self._handle_bash(input)
          elif name == "str_replace_editor":
              return self._handle_editor(input)
          else:
              return {"error": f"Unknown tool: {name}"}
    
      def _handle_computer_action(self, input: dict) -> dict:
          """Handle computer control actions."""
          action = input.get("action")
    
          if action == "screenshot":
              # Capture via xdotool/scrot
              subprocess.run(["scrot", "/tmp/screenshot.png"])
    
              with open("/tmp/screenshot.png", "rb") as f:
                  img_data = f.read()
    
              # Resize for efficiency
              img = Image.open(io.BytesIO(img_data))
              img = img.resize(self.screen_size, Image.LANCZOS)
    
              buffer = io.BytesIO()
              img.save(buffer, format="PNG")
    
              return {
                  "type": "image",
                  "source": {
                      "type": "base64",
                      "media_type": "image/png",
                      "data": base64.b64encode(buffer.getvalue()).decode()
                  }
              }
    
          elif action == "mouse_move":
              x, y = input.get("coordinate", [0, 0])
              subprocess.run(["xdotool", "mousemove", str(x), str(y)])
              return {"success": True}
    
          elif action == "left_click":
              subprocess.run(["xdotool", "click", "1"])
              return {"success": True}
    
          elif action == "right_click":
              subprocess.run(["xdotool", "click", "3"])
              return {"success": True}
    
          elif action == "double_click":
              subprocess.run(["xdotool", "click", "--repeat", "2", "1"])
              return {"success": True}
    
          elif action == "type":
              text = input.get("text", "")
              # Use xdotool type with delay for reliability
              subprocess.run(["xdotool", "type", "--delay", "50", text])
              return {"success": True}
    
          elif action == "key":
              key = input.get("key", "")
              # Map common key names
              key_map = {
                  "return": "Return",
                  "enter": "Return",
                  "tab": "Tab",
                  "escape": "Escape",
                  "backspace": "BackSpace",
              }
              xdotool_key = key_map.get(key.lower(), key)
              subprocess.run(["xdotool", "key", xdotool_key])
              return {"success": True}
    
          elif action == "scroll":
              direction = input.get("direction", "down")
              amount = input.get("amount", 3)
              button = "5" if direction == "down" else "4"
              for _ in range(amount):
                  subprocess.run(["xdotool", "click", button])
              return {"success": True}
    
          return {"error": f"Unknown action: {action}"}
    
      def _handle_bash(self, input: dict) -> dict:
          """Execute bash command."""
          command = input.get("command", "")
    
          # Security: Sanitize and limit commands
          dangerous_patterns = ["rm -rf", "mkfs", "dd if=", "> /dev/"]
          for pattern in dangerous_patterns:
              if pattern in command:
                  return {"error": "Dangerous command blocked"}
    
          try:
              result = subprocess.run(
                  command,
                  shell=True,
                  capture_output=True,
                  text=True,
                  timeout=30
              )
              return {
                  "stdout": result.stdout[:10000],  # Limit output
                  "stderr": result.stderr[:1000],
                  "returncode": result.returncode
              }
          except subprocess.TimeoutExpired:
              return {"error": "Command timed out"}
    
      def _handle_editor(self, input: dict) -> dict:
          """Handle text editor operations."""
          command = input.get("command")
          path = input.get("path")
    
          if command == "view":
              try:
                  with open(path, "r") as f:
                      content = f.read()
                  return {"content": content[:50000]}  # Limit size
              except Exception as e:
                  return {"error": str(e)}
    
          elif command == "str_replace":
              old_str = input.get("old_str")
              new_str = input.get("new_str")
              try:
                  with open(path, "r") as f:
                      content = f.read()
                  if old_str not in content:
                      return {"error": "old_str not found in file"}
                  content = content.replace(old_str, new_str, 1)
                  with open(path, "w") as f:
                      f.write(content)
                  return {"success": True}
              except Exception as e:
                  return {"error": str(e)}
    
          return {"error": f"Unknown editor command: {command}"}
    
      def run_task(self, task: str, max_steps: int = 50) -> dict:
          """Run computer use task with agentic loop."""
          messages = [{"role": "user", "content": task}]
          tools = self.get_tools()
    
          for step in range(max_steps):
              response = self.client.beta.messages.create(
                  model=self.model,
                  max_tokens=4096,
                  tools=tools,
                  messages=messages,
                  betas=["computer-use-2024-10-22"]
              )
    
              # Check for completion
              if response.stop_reason == "end_turn":
                  return {
                      "success": True,
                      "result": response.content[0].text if response.content else "",
                      "steps": step + 1
                  }
    
              # Handle tool use
              if response.stop_reason == "tool_use":
                  messages.append({"role": "assistant", "content": response.content})
    
                  tool_results = []
                  for block in response.content:
                      if block.type == "tool_use":
                          result = self.execute_tool(block.name, block.input)
                          tool_results.append({
                              "type": "tool_result",
                              "tool_use_id": block.id,
                              "content": result
                          })
    
                  messages.append({"role": "user", "content": tool_results})
    
          return {"success": False, "error": "Max steps reached"}
    

    anti_patterns:

    • "Not using betas=['computer-use-2024-10-22'] flag"
    • "Full resolution screenshots (wasteful)"
    • "No command sanitization for bash tool"
    • "Unbounded execution time"
  • id: browser-use-pattern name: Browser-Use Pattern (Playwright-based) description: | For browser-only automation, using structured DOM access is more efficient than pixel-based computer use. Playwright MCP allows LLMs to control browsers using accessibility snapshots rather than screenshots.

    Advantages over vision-based:

    • Faster: No image processing required
    • Cheaper: Text tokens vs image tokens
    • More precise: Direct element targeting
    • More reliable: No coordinate drift

    When to use vision vs structured:

    • Vision: Desktop apps, complex UIs, visual verification
    • Structured: Web automation, form filling, data extraction when_to_use:
    • "Browser-only automation tasks"
    • "Form filling and web interactions"
    • "When speed and cost matter more than visual understanding" implementation: | from playwright.async_api import async_playwright from dataclasses import dataclass from typing import Optional import asyncio

    @dataclass class BrowserAction: """Structured browser action.""" action: str # click, type, navigate, scroll, extract selector: Optional[str] = None text: Optional[str] = None url: Optional[str] = None

    class BrowserUseAgent: """ Browser automation using Playwright with structured commands. More efficient than pixel-based for web tasks. """

      def __init__(self):
          self.browser = None
          self.page = None
    
      async def start(self, headless: bool = True):
          """Start browser session."""
          self.playwright = await async_playwright().start()
          self.browser = await self.playwright.chromium.launch(headless=headless)
          self.page = await self.browser.new_page()
    
      async def get_page_snapshot(self) -> dict:
          """
          Get structured snapshot of page for LLM.
          Uses accessibility tree for efficiency.
          """
          # Get accessibility tree
          snapshot = await self.page.accessibility.snapshot()
    
          # Get simplified DOM info
          elements = await self.page.evaluate('''() => {
              const interactable = [];
              const selector = 'a, button, input, select, textarea, [role="button"]';
              document.querySelectorAll(selector).forEach((el, i) => {
                  const rect = el.getBoundingClientRect();
                  if (rect.width > 0 && rect.height > 0) {
                      interactable.push({
                          index: i,
                          tag: el.tagName.toLowerCase(),
                          text: el.textContent?.trim().slice(0, 100),
                          type: el.type,
                          placeholder: el.placeholder,
                          name: el.name,
                          id: el.id,
                          class: el.className
                      });
                  }
              });
              return interactable;
          }''')
    
          return {
              "url": self.page.url,
              "title": await self.page.title(),
              "accessibility_tree": snapshot,
              "interactable_elements": elements[:50]  # Limit for token efficiency
          }
    
      async def execute_action(self, action: BrowserAction) -> dict:
          """Execute structured browser action."""
    
          try:
              if action.action == "navigate":
                  await self.page.goto(action.url, wait_until="domcontentloaded")
                  return {"success": True, "url": self.page.url}
    
              elif action.action == "click":
                  await self.page.click(action.selector, timeout=5000)
                  await self.page.wait_for_load_state("networkidle", timeout=5000)
                  return {"success": True}
    
              elif action.action == "type":
                  await self.page.fill(action.selector, action.text)
                  return {"success": True}
    
              elif action.action == "scroll":
                  direction = action.text or "down"
                  distance = 500 if direction == "down" else -500
                  await self.page.evaluate(f"window.scrollBy(0, {distance})")
                  return {"success": True}
    
              elif action.action == "extract":
                  # Extract text content
                  if action.selector:
                      text = await self.page.text_content(action.selector)
                  else:
                      text = await self.page.text_content("body")
                  return {"success": True, "text": text[:5000]}
    
              elif action.action == "screenshot":
                  # Fall back to vision when needed
                  screenshot = await self.page.screenshot(type="png")
                  import base64
                  return {
                      "success": True,
                      "image": base64.b64encode(screenshot).decode()
                  }
    
          except Exception as e:
              return {"success": False, "error": str(e)}
    
          return {"success": False, "error": f"Unknown action: {action.action}"}
    
      async def run_with_llm(self, task: str, llm_client, max_steps: int = 20):
          """
          Run browser task with LLM decision making.
          Uses structured DOM instead of screenshots.
          """
    
          system_prompt = """You are a browser automation agent. You receive
          page snapshots with interactable elements and decide actions.
    
          Respond with JSON action:
          - {"action": "navigate", "url": "https://..."}
          - {"action": "click", "selector": "button.submit"}
          - {"action": "type", "selector": "input[name='email']", "text": "..."}
          - {"action": "scroll", "text": "down"}
          - {"action": "extract", "selector": ".results"}
          - {"action": "done", "result": "task completed"}
    
          Use CSS selectors based on the element info provided.
          Prefer id > name > class > text content for selectors.
          """
    
          messages = []
    
          for step in range(max_steps):
              # Get current page state
              snapshot = await self.get_page_snapshot()
    
              user_message = f"""Task: {task}
    
              Current page:
              URL: {snapshot['url']}
              Title: {snapshot['title']}
    
              Interactable elements:
              {snapshot['interactable_elements']}
    
              What action should I take?"""
    
              messages.append({"role": "user", "content": user_message})
    
              # Get LLM decision
              response = llm_client.messages.create(
                  model="claude-sonnet-4-20250514",
                  max_tokens=1024,
                  system=system_prompt,
                  messages=messages
              )
    
              assistant_text = response.content[0].text
              messages.append({"role": "assistant", "content": assistant_text})
    
              # Parse and execute
              import json
              action_dict = json.loads(assistant_text)
    
              if action_dict.get("action") == "done":
                  return {"success": True, "result": action_dict.get("result")}
    
              action = BrowserAction(**action_dict)
              result = await self.execute_action(action)
    
              if not result.get("success"):
                  messages.append({
                      "role": "user",
                      "content": f"Action failed: {result.get('error')}"
                  })
    
              await asyncio.sleep(0.5)  # Rate limit
    
          return {"success": False, "error": "Max steps reached"}
    
      async def close(self):
          """Clean up browser."""
          if self.browser:
              await self.browser.close()
          if hasattr(self, 'playwright'):
              await self.playwright.stop()
    

    Usage

    async def main(): agent = BrowserUseAgent() await agent.start(headless=False)

      from anthropic import Anthropic
      result = await agent.run_with_llm(
          "Go to weather.com and find the weather for New York",
          Anthropic()
      )
    
      print(result)
      await agent.close()
    

    asyncio.run(main()) anti_patterns:

    • "Using screenshots when DOM access works"
    • "Not waiting for page loads"
    • "Hardcoded selectors that break"
    • "No error recovery for stale elements"
  • id: user-confirmation-pattern name: User Confirmation Pattern description: | For sensitive actions, agents should pause and ask for human confirmation. "ChatGPT agent also pauses and asks for confirmation prior to taking sensitive steps such as completing a purchase."

    Sensitivity levels:

    1. LOW: Navigation, reading (auto-approve)
    2. MEDIUM: Form filling, clicking (log, maybe confirm)
    3. HIGH: Purchases, authentication, file operations (always confirm)
    4. CRITICAL: Credential entry, financial transactions (confirm + review) when_to_use:
    • "Actions with real-world consequences"
    • "Financial transactions"
    • "Authentication flows"
    • "File modifications" implementation: | from enum import Enum from dataclasses import dataclass from typing import Callable, Optional import asyncio

    class ActionSeverity(Enum): LOW = "low" # Auto-approve MEDIUM = "medium" # Log, optional confirm HIGH = "high" # Always confirm CRITICAL = "critical" # Confirm + review details

    @dataclass class SensitiveAction: """Action that may need user confirmation.""" action_type: str description: str severity: ActionSeverity details: dict

    class ConfirmationGate: """ Gate sensitive actions through user confirmation. """

      # Action type -> severity mapping
      ACTION_SEVERITY = {
          # LOW - auto-approve
          "navigate": ActionSeverity.LOW,
          "scroll": ActionSeverity.LOW,
          "read": ActionSeverity.LOW,
          "screenshot": ActionSeverity.LOW,
    
          # MEDIUM - log and maybe confirm
          "click": ActionSeverity.MEDIUM,
          "type": ActionSeverity.MEDIUM,
          "search": ActionSeverity.MEDIUM,
    
          # HIGH - always confirm
          "download": ActionSeverity.HIGH,
          "submit_form": ActionSeverity.HIGH,
          "login": ActionSeverity.HIGH,
          "file_write": ActionSeverity.HIGH,
    
          # CRITICAL - confirm with full review
          "purchase": ActionSeverity.CRITICAL,
          "enter_password": ActionSeverity.CRITICAL,
          "enter_credit_card": ActionSeverity.CRITICAL,
          "send_money": ActionSeverity.CRITICAL,
          "delete": ActionSeverity.CRITICAL,
      }
    
      def __init__(
          self,
          confirm_callback: Callable[[SensitiveAction], bool] = None,
          auto_confirm_low: bool = True,
          auto_confirm_medium: bool = False
      ):
          self.confirm_callback = confirm_callback or self._default_confirm
          self.auto_confirm_low = auto_confirm_low
          self.auto_confirm_medium = auto_confirm_medium
          self.action_log = []
    
      def _default_confirm(self, action: SensitiveAction) -> bool:
          """Default confirmation via CLI prompt."""
          print(f"\n{'='*60}")
          print(f"ACTION CONFIRMATION REQUIRED")
          print(f"{'='*60}")
          print(f"Type: {action.action_type}")
          print(f"Severity: {action.severity.value.upper()}")
          print(f"Description: {action.description}")
          print(f"Details: {action.details}")
          print(f"{'='*60}")
    
          while True:
              response = input("Allow this action? [y/n]: ").lower().strip()
              if response in ['y', 'yes']:
                  return True
              elif response in ['n', 'no']:
                  return False
    
      def classify_action(self, action_type: str, context: dict) -> ActionSeverity:
          """Classify action severity, considering context."""
          base_severity = self.ACTION_SEVERITY.get(action_type, ActionSeverity.MEDIUM)
    
          # Escalate based on context
          if context.get("involves_credentials"):
              return ActionSeverity.CRITICAL
          if context.get("involves_money"):
              return ActionSeverity.CRITICAL
          if context.get("irreversible"):
              return max(base_severity, ActionSeverity.HIGH, key=lambda x: x.value)
    
          return base_severity
    
      def check_action(
          self,
          action_type: str,
          description: str,
          details: dict = None
      ) -> tuple[bool, str]:
          """
          Check if action should proceed.
          Returns (approved, reason).
          """
          details = details or {}
          severity = self.classify_action(action_type, details)
    
          action = SensitiveAction(
              action_type=action_type,
              description=description,
              severity=severity,
              details=details
          )
    
          # Log all actions
          self.action_log.append({
              "action": action,
              "timestamp": __import__('datetime').datetime.now().isoformat()
          })
    
          # Auto-approve low severity
          if severity == ActionSeverity.LOW and self.auto_confirm_low:
              return True, "auto-approved (low severity)"
    
          # Maybe auto-approve medium
          if severity == ActionSeverity.MEDIUM and self.auto_confirm_medium:
              return True, "auto-approved (medium severity)"
    
          # Request confirmation
          approved = self.confirm_callback(action)
    
          if approved:
              return True, "user approved"
          else:
              return False, "user rejected"
    

    class ConfirmedComputerUseAgent: """ Computer use agent with confirmation gates. """

      def __init__(self, base_agent, confirmation_gate: ConfirmationGate):
          self.agent = base_agent
          self.gate = confirmation_gate
    
      def execute_action(self, action: dict) -> dict:
          """Execute action with confirmation check."""
          action_type = action.get("type", "unknown")
    
          # Build description
          if action_type == "click":
              desc = f"Click at ({action.get('x')}, {action.get('y')})"
          elif action_type == "type":
              text = action.get('text', '')
              # Mask if looks like password
              if self._looks_sensitive(text):
                  desc = f"Type sensitive text ({len(text)} chars)"
              else:
                  desc = f"Type: {text[:50]}..."
          else:
              desc = f"Execute: {action_type}"
    
          # Context for severity classification
          context = {
              "involves_credentials": self._looks_sensitive(action.get("text", "")),
              "involves_money": self._mentions_money(action),
          }
    
          # Check with gate
          approved, reason = self.gate.check_action(
              action_type, desc, context
          )
    
          if not approved:
              return {
                  "success": False,
                  "error": f"Action blocked: {reason}",
                  "action": action_type
              }
    
          # Execute if approved
          return self.agent.execute_action(action)
    
      def _looks_sensitive(self, text: str) -> bool:
          """Check if text looks like sensitive data."""
          if not text:
              return False
          # Common patterns
          patterns = [
              r'\b\d{16}\b',  # Credit card
              r'\b\d{3,4}\b.*\b\d{3,4}\b',  # CVV-like
              r'password',
              r'secret',
              r'api.?key',
              r'token'
          ]
          import re
          return any(re.search(p, text.lower()) for p in patterns)
    
      def _mentions_money(self, action: dict) -> bool:
          """Check if action involves money."""
          text = str(action)
          money_patterns = [
              r'\$\d+', r'pay', r'purchase', r'buy', r'checkout',
              r'credit', r'debit', r'invoice', r'payment'
          ]
          import re
          return any(re.search(p, text.lower()) for p in money_patterns)
    

    Usage

    gate = ConfirmationGate( auto_confirm_low=True, auto_confirm_medium=False # Confirm clicks, typing )

    agent = ConfirmedComputerUseAgent(base_agent, gate) result = agent.execute_action({"type": "click", "x": 500, "y": 300}) anti_patterns:

    • "Auto-approving all actions"
    • "Not logging rejected actions"
    • "Showing full passwords in confirmation"
    • "No timeout on confirmation (hangs forever)"
  • id: action-logging-pattern name: Action Logging Pattern description: | All computer use agent actions should be logged for:

    1. Debugging failed automations
    2. Security auditing
    3. Reproducibility
    4. Compliance requirements

    Log format should capture:

    • Timestamp
    • Action type and parameters
    • Screenshot before/after
    • Success/failure status
    • Model reasoning (if available) when_to_use:
    • "Production computer use deployments"
    • "Debugging automation failures"
    • "Security-sensitive environments" implementation: | from dataclasses import dataclass, field from datetime import datetime from typing import Optional, Any import json import os

    @dataclass class ActionLogEntry: """Single action log entry.""" timestamp: datetime action_type: str parameters: dict success: bool error: Optional[str] = None screenshot_before: Optional[str] = None # Path to screenshot screenshot_after: Optional[str] = None model_reasoning: Optional[str] = None duration_ms: Optional[int] = None

      def to_dict(self) -> dict:
          return {
              "timestamp": self.timestamp.isoformat(),
              "action_type": self.action_type,
              "parameters": self._sanitize_params(self.parameters),
              "success": self.success,
              "error": self.error,
              "screenshot_before": self.screenshot_before,
              "screenshot_after": self.screenshot_after,
              "model_reasoning": self.model_reasoning,
              "duration_ms": self.duration_ms
          }
    
      def _sanitize_params(self, params: dict) -> dict:
          """Remove sensitive data from params."""
          sanitized = {}
          sensitive_keys = ['password', 'secret', 'token', 'key', 'credit_card']
    
          for k, v in params.items():
              if any(s in k.lower() for s in sensitive_keys):
                  sanitized[k] = "[REDACTED]"
              elif isinstance(v, str) and len(v) > 100:
                  sanitized[k] = v[:100] + "...[truncated]"
              else:
                  sanitized[k] = v
    
          return sanitized
    

    @dataclass class TaskSession: """A complete task execution session.""" session_id: str task: str start_time: datetime end_time: Optional[datetime] = None actions: list[ActionLogEntry] = field(default_factory=list) success: bool = False final_result: Optional[str] = None

    class ActionLogger: """ Comprehensive action logging for computer use agents. """

      def __init__(self, log_dir: str = "./agent_logs"):
          self.log_dir = log_dir
          self.screenshot_dir = os.path.join(log_dir, "screenshots")
          os.makedirs(self.screenshot_dir, exist_ok=True)
    
          self.current_session: Optional[TaskSession] = None
    
      def start_session(self, task: str) -> str:
          """Start a new task session."""
          import uuid
          session_id = str(uuid.uuid4())[:8]
    
          self.current_session = TaskSession(
              session_id=session_id,
              task=task,
              start_time=datetime.now()
          )
    
          return session_id
    
      def log_action(
          self,
          action_type: str,
          parameters: dict,
          success: bool,
          error: Optional[str] = None,
          screenshot_before: bytes = None,
          screenshot_after: bytes = None,
          model_reasoning: str = None,
          duration_ms: int = None
      ):
          """Log a single action."""
          if not self.current_session:
              raise RuntimeError("No active session")
    
          # Save screenshots if provided
          screenshot_paths = {}
          timestamp_str = datetime.now().strftime("%Y%m%d_%H%M%S_%f")
    
          if screenshot_before:
              path = os.path.join(
                  self.screenshot_dir,
                  f"{self.current_session.session_id}_{timestamp_str}_before.png"
              )
              with open(path, "wb") as f:
                  f.write(screenshot_before)
              screenshot_paths["before"] = path
    
          if screenshot_after:
              path = os.path.join(
                  self.screenshot_dir,
                  f"{self.current_session.session_id}_{timestamp_str}_after.png"
              )
              with open(path, "wb") as f:
                  f.write(screenshot_after)
              screenshot_paths["after"] = path
    
          # Create log entry
          entry = ActionLogEntry(
              timestamp=datetime.now(),
              action_type=action_type,
              parameters=parameters,
              success=success,
              error=error,
              screenshot_before=screenshot_paths.get("before"),
              screenshot_after=screenshot_paths.get("after"),
              model_reasoning=model_reasoning,
              duration_ms=duration_ms
          )
    
          self.current_session.actions.append(entry)
    
          # Also append to running log file
          self._append_to_log(entry)
    
      def _append_to_log(self, entry: ActionLogEntry):
          """Append entry to JSONL log file."""
          log_file = os.path.join(
              self.log_dir,
              f"session_{self.current_session.session_id}.jsonl"
          )
    
          with open(log_file, "a") as f:
              f.write(json.dumps(entry.to_dict()) + "\n")
    
      def end_session(self, success: bool, result: str = None):
          """End current session."""
          if not self.current_session:
              return
    
          self.current_session.end_time = datetime.now()
          self.current_session.success = success
          self.current_session.final_result = result
    
          # Write session summary
          summary_file = os.path.join(
              self.log_dir,
              f"session_{self.current_session.session_id}_summary.json"
          )
    
          summary = {
              "session_id": self.current_session.session_id,
              "task": self.current_session.task,
              "start_time": self.current_session.start_time.isoformat(),
              "end_time": self.current_session.end_time.isoformat(),
              "duration_seconds": (
                  self.current_session.end_time -
                  self.current_session.start_time
              ).total_seconds(),
              "total_actions": len(self.current_session.actions),
              "successful_actions": sum(
                  1 for a in self.current_session.actions if a.success
              ),
              "failed_actions": sum(
                  1 for a in self.current_session.actions if not a.success
              ),
              "success": success,
              "final_result": result
          }
    
          with open(summary_file, "w") as f:
              json.dump(summary, f, indent=2)
    
          self.current_session = None
    
      def get_session_replay(self, session_id: str) -> list[dict]:
          """Get all actions from a session for replay/debugging."""
          log_file = os.path.join(self.log_dir, f"session_{session_id}.jsonl")
    
          actions = []
          with open(log_file, "r") as f:
              for line in f:
                  actions.append(json.loads(line))
    
          return actions
    

    Integration with agent

    class LoggedComputerUseAgent: """Computer use agent with comprehensive logging."""

      def __init__(self, base_agent, logger: ActionLogger):
          self.agent = base_agent
          self.logger = logger
    
      def run_task(self, task: str) -> dict:
          """Run task with full logging."""
          session_id = self.logger.start_session(task)
    
          try:
              result = self._run_with_logging(task)
              self.logger.end_session(
                  success=result.get("success", False),
                  result=result.get("result")
              )
              return result
          except Exception as e:
              self.logger.end_session(success=False, result=str(e))
              raise
    
      def _run_with_logging(self, task: str) -> dict:
          """Internal run with action logging."""
          # This would wrap the base agent's run method
          # and log each action
          pass
    

    anti_patterns:

    • "Not sanitizing sensitive data in logs"
    • "Storing screenshots indefinitely (storage costs)"
    • "Not rotating log files"
    • "Logging synchronously (blocks agent)"

============================================================================

DECISION FRAMEWORK

============================================================================

decision_framework: computer_use_type: question: "What type of computer use do you need?" options: full_desktop: when: "Need to control any desktop application" use: "Anthropic Computer Use or ScreenAgent" cost: "High (vision tokens)" reliability: "Medium (UI variance)"

  browser_only:
    when: "Only need web automation"
    use: "Browser-Use/Playwright MCP"
    cost: "Low (text tokens)"
    reliability: "High (structured DOM)"

  hybrid:
    when: "Mostly web, occasional desktop"
    use: "Browser-Use with Computer Use fallback"
    cost: "Medium"
    reliability: "Medium-High"

security_level: question: "What security level is required?" options: development: sandboxing: "Docker with loose restrictions" confirmation: "Low actions auto-approved" logging: "Minimal"

  production:
    sandboxing: "Docker with seccomp, capability drops"
    confirmation: "All actions logged, high confirmed"
    logging: "Full with screenshots"

  high_security:
    sandboxing: "Isolated VM, no network"
    confirmation: "All actions human-confirmed"
    logging: "Full audit trail with retention"

============================================================================

HANDOFFS

============================================================================

handoffs:

  • to: security-specialist when: "Deploying computer use to production" context: "Need security review of sandboxing, prompt injection defenses"

  • to: devops when: "Scaling computer use infrastructure" context: "Container orchestration, resource management"

  • to: browser-automation when: "Primarily web automation" context: "Playwright/Selenium patterns more appropriate"

  • to: autonomous-agents when: "Complex multi-step agent workflows" context: "Agent architecture patterns, state management"

============================================================================

QUICK REFERENCE

============================================================================

quick_reference: tools_comparison: anthropic_computer_use: type: "Full desktop" model: "Claude Sonnet/Opus" strengths: "Best vision understanding, full OS access" weaknesses: "Slow, expensive, tricky UI elements"

openai_operator:
  type: "Browser only"
  model: "GPT-4"
  strengths: "Good for web tasks"
  weaknesses: "$200/month tier, limited to browser"

playwright_mcp:
  type: "Browser structured"
  model: "Any LLM"
  strengths: "Fast, cheap, reliable"
  weaknesses: "Web only, no visual understanding"

screenagent:
  type: "Full desktop"
  model: "Open source VLMs"
  strengths: "Self-hostable, customizable"
  weaknesses: "Lower quality than commercial"

security_checklist: - "Agent runs in Docker container" - "Network restricted to necessary domains" - "Filesystem is read-only or scoped" - "No host credentials accessible" - "Resource limits (CPU, memory, time)" - "Confirmation required for sensitive actions" - "All actions logged with timestamps" - "Regular security audits"

cost_estimation: anthropic_per_task: "$0.50-$2.00 (screenshots in context)" browser_use_per_task: "$0.05-$0.20 (text only)" infrastructure: "$50-200/month (cloud containers)"

performance_expectations: simple_task: "30 seconds - 2 minutes" complex_task: "5-15 minutes" reliability: "70-90% success rate" human_compared: "2-5x slower than human"