OpenPersona persona-knowledge
Persistent, incremental, searchable persona knowledge base. Ingests data from Obsidian vaults, chat exports, X/Twitter archives, and more into a MemPalace-backed store with a Karpathy LLM Wiki knowledge layer. Exports training/ directories for persona-model-trainer.
git clone https://github.com/acnlabs/OpenPersona
T=$(mktemp -d) && git clone --depth=1 https://github.com/acnlabs/OpenPersona "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/persona-knowledge" ~/.claude/skills/acnlabs-openpersona-persona-knowledge && rm -rf "$T"
skills/persona-knowledge/SKILL.mdpersona-knowledge
Persistent, incremental, searchable persona knowledge base — the data layer between raw sources and persona training.
Architecture: MemPalace (storage + search) + Knowledge Graph (relationships + timeline) + Karpathy LLM Wiki (knowledge accumulation)
Dependency chain:
data sources → persona-knowledge → anyone-skill / persona-model-trainer
When to use this skill
Trigger phrases:
- "create a dataset for this persona"
- "add data to the dataset"
- "import my Obsidian vault"
- "import my Twitter archive"
- "build a knowledge base for X"
- "export training data"
- "search the persona dataset"
Not suitable when:
- User wants a quick one-shot distillation without persistent storage (use anyone-skill alone)
- User only has < 50 messages of data (too little to warrant a dataset)
Phase 1: Init
Create a new persona dataset:
python scripts/init_knowledge.py --slug {slug} --name "Display Name"
This creates
~/.openpersona/knowledge/{slug}/ with:
~/.openpersona/knowledge/{slug}/ dataset.json # metadata: slug, name, created_at, stats .mempalace/ # MemPalace local data (per-dataset isolation via palace_path) palace/ # MemPalace internal store (ChromaDB + KG) sources/ # immutable source file backups .source-index.json # per-file metadata: hash, import time, line count, PII flags wiki/ # Karpathy wiki (LLM-maintained derived artifact) _schema.md # wiki maintenance rules identity.md voice.md values.md thinking.md relationships.md # generated from Knowledge Graph timeline.md # generated from Knowledge Graph _contradictions.md _changelog.md _evidence.md
MemPalace palace structure:
- One Wing per persona (named by slug)
- Halls mapped to 5 persona dimensions:
— Identity (background, career, education)hall_facts
— Memory (key events, turning points)hall_events
— Personality (values, preferences, boundaries)hall_preferences
— Procedure (mental models, decision heuristics)hall_discoveries
— Interaction (vocabulary, rhythm, humor, emotional temperature)hall_voice
Gate: Confirm slug and display name with the user before proceeding.
Phase 2: Ingest
Import data sources into the dataset. Can be called multiple times for incremental ingestion.
python scripts/ingest.py --slug {slug} --source <path> [--adapter <name>] [--since <date>]
Three adapters cover all supported formats:
| Source | Adapter | Detection |
|---|---|---|
| Obsidian vault | | Directory containing or files |
| GBrain export dir | | Markdown directory with sidecar dirs |
/ / / | | File extension |
/ | | File extension |
| GBrain JSON export | | with key or flag |
WhatsApp export | | Matches WhatsApp timestamp pattern |
Telegram | | JSON with key |
| Signal export | | JSON with Signal message format |
iMessage | | SQLite with + tables |
| X (Twitter) archive | | Directory containing |
| Instagram archive | | Directory containing |
Ingest pipeline (per source):
- Parse — adapter converts source to unified
[{role, content, timestamp, source_file, source_type}] - PII scan — flag SSN, credit card, email, password patterns
- Hash dedup — SHA-256 content hash, skip already-ingested entries
- Write sources/ — save parsed data as JSONL backup (immutable, one file per source)
- Store in MemPalace — verbatim text into ChromaDB via palace wing/hall structure
- Extract KG triples — detect entities and relationships, write to Knowledge Graph with temporal validity
- Report — print source name, message count, assistant turns, PII flags, new KG entities
After each source is ingested, report:
✅ whatsapp-2024.txt → 1,247 messages (892 assistant turns) PII: none detected KG: +3 entities, +7 relationships → sources/whatsapp-2024.jsonl
Phase 3: Wiki Build (agent task — not a script)
After ingesting new data, the agent reads MemPalace content and Knowledge Graph relationships, then builds or updates the wiki pages following the Karpathy LLM Wiki pattern.
This phase is driven by agent intelligence (SKILL.md instructions), not by automated scripts. The LLM decides which pages to update, how to phrase entries, and how to tag evidence.
Ingest operation (after each Phase 2 run)
- Read new data from MemPalace (search the wing for recently added entries)
- For each relevant wiki page, check if the new data adds, contradicts, or refines existing content
- Update 5-15 wiki pages with new information, using evidence tags:
— direct quote, traceable[L1:source]
— reported/paraphrased, verifiable[L2]
— reasonably inferred from multiple signals[L3:inferred]
— impression-based[L4:inspired]
- Add backlinks between related pages using
wikilink syntax[[page]] - Record contradictions in
with both sides cited_contradictions.md - Append entry to
_changelog.md - Update counts in
_evidence.md
Query operation
When the user asks a question about the persona:
- Search MemPalace semantically for relevant memories
- Navigate wiki pages for structured knowledge
- Synthesize an answer
- If the query reveals new insights, write them back to the appropriate wiki page
Lint operation
Run periodically or before export:
python scripts/lint_wiki.py --slug {slug}
Checks:
- Broken
(referenced page doesn't exist)[[links]] - Empty pages (created but never populated)
- Contradictions without resolution notes
- Evidence coverage (pages with < 2 evidence tags)
- Source coverage (MemPalace entries not reflected in any wiki page)
Wiki page structure (see references/wiki-schema.md
for full spec)
references/wiki-schema.mdEach page follows this template:
# {Page Title} > One-sentence summary of this page's scope. ## Content {Structured content with [L?:source] evidence tags and [[backlinks]]} ## Sources - {source_file}: {what was extracted} [L?] ## See also - [[related_page]]
Knowledge Graph–driven pages
relationships.md and timeline.md are generated from the Knowledge Graph, not written freehand:
from mempalace.knowledge_graph import KnowledgeGraph kg = KnowledgeGraph(palace_path) kg.timeline(slug) # → chronological event list for timeline.md kg.query_entity(slug) # → current relationships for relationships.md
After generating, the agent may annotate with evidence tags and additional context.
Phase 4: Export
Generate a
training/ directory compatible with persona-model-trainer:
python scripts/export_training.py --slug {slug} --output training/
Each export is automatically versioned (
v1, v2, …). Override with --version:
python scripts/export_training.py --slug {slug} --output training/ --version v3
List export history:
python scripts/export_training.py --slug {slug} --list # v1 2026-04-01 10:00 142 turns sha256:a3f9c2d1 3 sources # v2 2026-04-10 14:22 198 turns sha256:c7d2e1f3 4 sources
Output:
training/ raw/ # copied from sources/ (authentic voice, unmodified) conversations.jsonl # generated from wiki pages (structured Q-A pairs) profile.md # summarized from wiki identity/voice/values metadata.json # slug, source count, turn count, export version + hash
How each file is built:
— direct copy oftraining/raw/
andsources/*.jsonl
filessources/*.txt
— the agent reads wiki pages and generates distilled user/assistant turn pairs representing the persona's voice, knowledge, and valuestraining/conversations.jsonl
— 300-500 word character sheet derived fromtraining/profile.md
,identity.md
,voice.mdvalues.md
—training/metadata.json
,slug
,name
,subject_type
,created_at
,source_count
,total_words
,distilled_turns
+ versioning fields:raw_files
— version tag (e.g.export_version
)"v2"
— SHA-256 ofexport_hash
(e.g.conversations.jsonl
)"sha256:c7d2e1f3..."
—source_snapshot
dict of all source files at export time{filename: sha256_hash}
Export history is appended to
dataset.json → export_history[] after each run.
Downstream traceability:
persona-model-trainer's pipeline.sh reads export_version and export_hash from metadata.json and injects them as dataset_version / dataset_export_hash into training_summary.json, forming a complete provenance chain from source data to trained model adapter.
This output is directly consumable by
persona-model-trainer's prepare_data.py — no changes needed downstream.
→ Next step — train a local persona model:
bash skills/persona-model-trainer/scripts/pipeline.sh \ --slug {slug} \ --model google/gemma-4-E4B-it \ --source ./training \ --method mlx \ # or: unsloth (NVIDIA GPU) / colab (no GPU) --preset gemma4 \ --probes ./training/probes.json
Full guide:
persona-model-trainer/references/pipeline-guide.md
Phase 5: Search
Query the dataset using MemPalace's semantic search and Knowledge Graph:
# Semantic search across all stored memories mempalace search "how does this person handle conflict" --wing {slug} # Knowledge Graph: look up an entity's relationships python scripts/query_kg.py --slug {slug} --entity "Tom" # Knowledge Graph: shortest path between two entities python scripts/query_kg.py --slug {slug} --path "Tom" "Alice" # Knowledge Graph: overall statistics python scripts/query_kg.py --slug {slug} --stats # Knowledge Graph: JSON output (for programmatic use) python scripts/query_kg.py --slug {slug} --entity "Tom" --json # Wake-up summary (~170 tokens) mempalace wake-up --wing {slug}
The agent can also search programmatically during wiki build or distillation:
from mempalace.searcher import search_memories results = search_memories("vocabulary patterns", palace_path="~/.openpersona/knowledge/{slug}/.mempalace/palace")
Phase 6: Maintain
Ongoing dataset management:
- Add new source: run Phase 2 (Ingest) again with new files → triggers wiki update
- Remove source: delete from
+ re-index → run wiki lint to flag orphaned contentsources/ - Wiki lint:
— health checkpython scripts/lint_wiki.py --slug {slug} - Dataset stats:
— show current statspython scripts/init_knowledge.py --slug {slug} --stats - List datasets:
— all available datasetsls ~/.openpersona/knowledge/
Tools
| Tool | Purpose |
|---|---|
| Run init, ingest, export, lint scripts; MemPalace CLI commands |
| Load source files, wiki pages, dataset.json |
| Update wiki pages, write training exports |
| Fetch public figure data for ingestion |
Scripts
| Script | Purpose |
|---|---|
| Initialize knowledge directory + MemPalace wing + KG |
| Unified ingestion: adapter dispatch + PII scan + dedup + MemPalace + KG |
| Export sources/ + wiki → training/ directory |
| Wiki health check: broken links, contradictions, coverage gaps |
| Knowledge Graph query: entity lookup, shortest path, statistics |
Adapters
| Adapter | Sources | Format |
|---|---|---|
| Obsidian vault, GBrain export, .md, .txt, .csv, .pdf, .jsonl, .json | All pure file reading |
| WhatsApp / Telegram / Signal / iMessage | .txt / JSON / SQLite (special parsing) |
| X (Twitter) / Instagram archive | JS wrapper stripping + archive dirs |
References
— Karpathy wiki structure specification and maintenance rulesreferences/wiki-schema.md
— supported data source formats and adapter detailsreferences/source-formats.md