Claude-skill-registry agent-safety

Ensure agent safety - guardrails, content filtering, monitoring, and compliance

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/agent-safety" ~/.claude/skills/majiayu000-claude-skill-registry-agent-safety && rm -rf "$T"
manifest: skills/data/agent-safety/SKILL.md
source content

Agent Safety

Implement safety systems for responsible AI agent deployment.

When to Use This Skill

Invoke this skill when:

  • Adding input/output guardrails
  • Implementing content filtering
  • Setting up rate limiting
  • Ensuring compliance (GDPR, SOC2)

Parameter Schema

ParameterTypeRequiredDescriptionDefault
task
stringYesSafety goal-
risk_level
enumNo
strict
,
moderate
,
permissive
strict
filters
listNoFilter types to enable
["injection", "pii", "toxicity"]

Quick Start

from guardrails import Guard
from guardrails.validators import ToxicLanguage, PIIFilter

guard = Guard.from_validators([
    ToxicLanguage(threshold=0.8, on_fail="exception"),
    PIIFilter(on_fail="fix")
])

# Validate output
validated = guard.validate(llm_response)

Guardrail Types

Input Guardrails

# Prompt injection detection
INJECTION_PATTERNS = [
    r"ignore (previous|all) instructions",
    r"you are now",
    r"forget everything"
]

Output Guardrails

# Content filtering
filters = [
    ToxicityFilter(),
    PIIRedactor(),
    HallucinationDetector()
]

Rate Limiting

class RateLimiter:
    def __init__(self, rpm=60, tpm=100000):
        self.rpm = rpm
        self.tpm = tpm

    def check(self, user_id, tokens):
        # Token bucket algorithm
        pass

Troubleshooting

IssueSolution
False positivesTune thresholds
Injection bypassAdd LLM-based detection
PII leakageAdd secondary validation
Performance hitCache filter results

Best Practices

  • Defense in depth (multiple layers)
  • Fail-safe defaults (deny by default)
  • Audit everything
  • Regular red team testing

Compliance Checklist

  • Input validation active
  • Output filtering enabled
  • Audit logging configured
  • Rate limits set
  • PII handling compliant

Related Skills

  • tool-calling
    - Input validation
  • llm-integration
    - API security
  • multi-agent
    - Per-agent permissions

References