Claude-skill-registry constitutional-ai-alignment
A framework for aligning AI agents to be helpful, harmless, and honest using a principles-based critique loop. Use this when you need to define an agent's personality, establish safety guardrails for high-risk domains (legal, medical, bio), or reduce "sycophancy" (the model simply agreeing with the user).
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/constitutional-ai-alignment" ~/.claude/skills/majiayu000-claude-skill-registry-constitutional-ai-alignment && rm -rf "$T"
skills/data/constitutional-ai-alignment/SKILL.mdConstitutional AI Alignment
Constitutional AI is a method to move beyond simple "human feedback" (which can be biased or inconsistent) toward a principled approach where the model aligns itself to a written "Constitution." This process ensures the AI understands the intent behind rules rather than just following surface-level instructions.
The Alignment Process
1. Define the Constitution
Create a list of natural language principles that represent your desired values. Instead of guessing what a model should do, use established frameworks as your source material.
- Global Standards: Reference the UN Declaration of Human Rights.
- Industry Standards: Use Apple’s Privacy Terms of Service or specific medical ethics codes.
- Custom Principles: Explicitly define "helpful, honest, and harmless" behaviors (e.g., "The agent should never prioritize user engagement over factual accuracy").
2. The Critique-and-Revision Loop
Operationalize these principles by forcing the model to evaluate its own performance before delivering a final result.
- Initial Output: Generate a response to a prompt.
- Principle Mapping: Identify which constitutional principles apply to this specific prompt.
- Critique: Ask the model: "Does this response abide by [Principle X]? If not, what are the specific flaws?"
- Revision: Ask the model: "Rewrite the response to address the flaws identified in the critique while maintaining the helpfulness of the original."
- Finalization: Deliver only the revised response, removing the internal "critique" logic.
3. Handle Stochastic Failure (The "Try 3 Times" Rule)
AI models are stochastic; they may fail to align on the first attempt even with a critique loop.
- If a high-stakes task fails, do not just tweak the prompt.
- Restart the process from scratch.
- If the model hits a wall, provide "negative examples" of its previous failed attempts as part of the critique phase ("You tried [X] and it failed because [Y]. Try a different approach").
4. Optimize for "Transformative" Capability
Evaluate your agent using the Economic Turing Test:
- Contract the agent for a specific job (e.g., data analysis, redlining a document).
- If the output is indistinguishable from a human expert hired for the same period, the alignment is successful.
- Focus on "ambitious changes" (e.g., asking for a full architectural rewrite) rather than simple autocompletes.
Examples
Example 1: Legal Document Review
- Context: An AI agent is tasked with redlining a contract for a procurement team.
- Constitutional Principle: "Privacy: Do not expose third-party credentials or sensitive financial data found in the context."
- Critique Loop:
- Initial Output: Redlines the contract but leaves a developer’s API key in a comment.
- Critique: "The response violates the Privacy principle by exposing a credential."
- Revision: Removes the API key and replaces it with a placeholder
.<SENSITIVE_DATA_REMOVED>
- Output: A safe, redlined document ready for legal review.
Example 2: Customer Service in Medical Tech
- Context: A user asks a health-tracking bot for a specific prescription dosage.
- Constitutional Principle: "Harmlessness: Do not provide specific medical prescriptions; redirect to professionals."
- Critique Loop:
- Initial Output: "The standard dose for [Medicine] is 50mg."
- Critique: "This violates Harmlessness by providing a specific dosage."
- Revision: "I cannot provide specific dosage instructions. You should consult a medical professional for prescription advice."
- Output: A firm but helpful refusal that maintains user trust.
Common Pitfalls
- Sycophancy (The "Yes-Man" Problem): Training models solely on "User Liked This" metrics leads to models that lie to please the user. Always include an "Honesty" principle that outweighs "User Satisfaction."
- The "Monkey Paw" Scenario: Defining a goal without principles leads the AI to take the shortest, most dangerous path to that goal. Always define how the AI should achieve the result, not just the result itself.
- Vague Principles: Principles like "be nice" are too subjective. Use specific instructions like "When refusing a request, explain the safety reason why instead of giving a generic 'I can't do that' response."
- Ignoring the Exponential: Building for today’s model capabilities. If a task works 20% of the time today, assume it will work 100% of the time in 6 months and build the infrastructure for that 100% success rate now.