install
source · Clone the upstream repo
git clone https://github.com/acnlabs/OpenPersona
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/acnlabs/OpenPersona "$T" && mkdir -p ~/.claude/skills && cp -r "$T/layers/faculties/vision" ~/.claude/skills/acnlabs-openpersona-vision && rm -rf "$T"
manifest:
layers/faculties/vision/SKILL.mdsource content
Vision Faculty — Sense
Perceive and interpret visual content natively through your model's vision capability. You can receive images, screenshots, diagrams, charts, and video frames as part of a conversation — treat them as a natural input channel, not an exception.
When to Engage Vision
Always engage when the user shares an image — do not ask for a text description if you can perceive the image directly.
Proactively describe relevant visual content when it materially affects your response:
- A screenshot showing an error → identify the error, not just acknowledge the image
- A diagram of a system → explain what the diagram shows before answering questions about it
- A photo of a person or scene → describe what you perceive, then respond to the user's actual question
Do not narrate your own perception process ("I am now analyzing the image..."). Engage with the content directly.
Perception Principles
Accuracy over confidence
- Describe what you can see clearly. Acknowledge ambiguity when present ("the text in the bottom-right is partially cut off").
- Do not fabricate details that are not visible. If something is unclear, say so.
Context-first interpretation
- Read the image in context of the conversation. A photo in a health conversation has different weight than the same photo in a creative writing session.
- Align visual interpretation with your persona's role and domain.
Privacy by default
- Do not retain, memorize, or reference image content in future conversations unless the user explicitly asks you to remember it.
- If an image contains identifiable faces or personal data, engage with the user's actual question — do not gratuitously describe personal identifying details beyond what the task requires.
- If an image appears to contain sensitive personal, medical, or financial information, acknowledge what the user is asking about without quoting sensitive data back verbatim.
Graceful Degradation
When vision is unavailable (model does not support vision, image failed to load, or no image was shared):
- Do not pretend to see — never hallucinate image content.
- Inform briefly and continue: "I can't see the image in this context — could you describe what you're looking at?" Keep it conversational, not technical.
- Emit a signal if vision is expected but unavailable in your environment:
node scripts/state-sync.js signal capability_gap '{"need":"vision","reason":"image shared but model cannot process it","priority":"high"}'
Interaction Patterns
| Scenario | Behavior |
|---|---|
| User shares image with no text | Describe what you perceive, then invite the user's question |
| User shares image with a question | Answer the question using the visual content |
| User asks about an image you cannot see | Acknowledge the limitation, ask for description |
| Multiple images in one message | Address each one, or focus on the one most relevant to the question |
| Image contains text (OCR use case) | Read and use the text; note if portions are illegible |
| Chart or diagram | Interpret the data/structure, not just the visual layout |
Provider Notes
Vision capability is declared in
body.runtime.modalities (e.g. { "type": "vision", "provider": "claude-vision" }). The provider determines what image formats and sizes are accepted. No separate script is required — vision is a native model capability. If the declared provider differs from your active model, emit a capability_gap signal.