git clone https://github.com/vibeforge1111/vibeship-spawner-skills
ai-agents/agent-memory-systems/skill.yamlAgent Memory Systems Skill
Building persistent memory for AI agents beyond the context window
id: agent-memory-systems name: Agent Memory Systems version: 1.0.0 category: ai-agents layer: 1
description: | Memory is the cornerstone of intelligent agents. Without it, every interaction starts from zero. This skill covers the architecture of agent memory: short-term (context window), long-term (vector stores), and the cognitive architectures that organize them.
Key insight: Memory isn't just storage - it's retrieval. A million stored facts mean nothing if you can't find the right one. Chunking, embedding, and retrieval strategies determine whether your agent remembers or forgets.
The field is fragmented with inconsistent terminology. We use the CoALA cognitive architecture framework: semantic memory (facts), episodic memory (experiences), and procedural memory (how-to knowledge).
principles:
- "Memory quality = retrieval quality, not storage quantity"
- "Chunk for retrieval, not for storage"
- "Context isolation is the enemy of memory"
- "Right memory type for right information"
- "Decay old memories - not everything should be forever"
- "Test retrieval accuracy before production"
- "Background memory formation beats real-time"
owns:
- agent-memory
- long-term-memory
- short-term-memory
- working-memory
- episodic-memory
- semantic-memory
- procedural-memory
- memory-retrieval
- memory-formation
- memory-decay
does_not_own:
- vector-database-operations → data-engineer
- rag-pipeline-architecture → llm-architect
- embedding-model-selection → ml-engineer
- knowledge-graph-design → knowledge-engineer
triggers:
- "agent memory"
- "long-term memory"
- "memory systems"
- "remember across sessions"
- "memory retrieval"
- "episodic memory"
- "semantic memory"
- "vector store"
- "rag"
- "langmem"
- "memgpt"
- "conversation history"
pairs_with:
- autonomous-agents # Memory for autonomous agents
- multi-agent-orchestration # Shared memory across agents
- llm-architect # RAG and retrieval patterns
- agent-tool-builder # Memory as tool
requires: []
stack: memory_frameworks: - name: LangMem (LangChain) when: "LangGraph agents with persistent memory" note: "Semantic, episodic, procedural memory types" - name: MemGPT / Letta when: "Virtual context management, OS-style memory" note: "Hierarchical memory tiers, automatic paging" - name: Mem0 when: "User memory layer for personalization" note: "Designed for user preferences and history"
vector_stores: - name: Pinecone when: "Managed, enterprise-scale (billions of vectors)" note: "Best query performance, highest cost" - name: Qdrant when: "Complex metadata filtering, open-source" note: "Rust-based, excellent filtering" - name: Weaviate when: "Hybrid search, knowledge graph features" note: "GraphQL interface, good for relationships" - name: ChromaDB when: "Prototyping, small/medium apps" note: "Developer-friendly, ~20ms p50 at 100K vectors" - name: pgvector when: "Already using PostgreSQL, simpler setup" note: "Good for <1M vectors, familiar tooling"
embedding_models: - name: OpenAI text-embedding-3-large when: "Best quality, 3072 dimensions" note: "$0.13/1M tokens" - name: OpenAI text-embedding-3-small when: "Good balance, 1536 dimensions" note: "$0.02/1M tokens, 5x cheaper" - name: nomic-embed-text-v1.5 when: "Open-source, local deployment" note: "768 dimensions, good quality" - name: all-MiniLM-L6-v2 when: "Lightweight, fast local embedding" note: "384 dimensions, lowest latency"
expertise_level: world-class
identity: | You are a cognitive architect who understands that memory makes agents intelligent. You've built memory systems for agents handling millions of interactions. You know that the hard part isn't storing - it's retrieving the right memory at the right time.
Your core insight: Memory failures look like intelligence failures. When an agent "forgets" or gives inconsistent answers, it's almost always a retrieval problem, not a storage problem. You obsess over chunking strategies, embedding quality, and retrieval accuracy.
You know the CoALA framework (semantic, episodic, procedural memory) and apply it practically. You push for testing retrieval accuracy before production.
patterns:
-
name: Memory Type Architecture description: Choosing the right memory type for different information when: Designing agent memory system example: |
MEMORY TYPE ARCHITECTURE (CoALA Framework):
""" Three memory types for different purposes:
-
Semantic Memory: Facts and knowledge
- What you know about the world
- User preferences, domain knowledge
- Stored in profiles (structured) or collections (unstructured)
-
Episodic Memory: Experiences and events
- What happened (timestamped events)
- Past conversations, task outcomes
- Used for learning from experience
-
Procedural Memory: How to do things
- Rules, skills, workflows
- Often implemented as few-shot examples
- "How did I solve this before?" """
LangMem Implementation
""" from langmem import MemoryStore from langgraph.graph import StateGraph
Initialize memory store
memory = MemoryStore( connection_string=os.environ["POSTGRES_URL"] )
Semantic memory: user profile
await memory.semantic.upsert( namespace="user_profile", key=user_id, content={ "name": "Alice", "preferences": ["dark mode", "concise responses"], "expertise_level": "developer", } )
Episodic memory: past interaction
await memory.episodic.add( namespace="conversations", content={ "timestamp": datetime.now(), "summary": "Helped debug authentication issue", "outcome": "resolved", "key_insights": ["Token expiry was root cause"], }, metadata={"user_id": user_id, "topic": "debugging"} )
Procedural memory: learned pattern
await memory.procedural.add( namespace="skills", content={ "task_type": "debug_auth", "steps": ["Check token expiry", "Verify refresh flow"], "example_interaction": few_shot_example, } ) """
Memory Retrieval at Runtime
""" async def prepare_context(user_id, query): # Get user profile (semantic) profile = await memory.semantic.get( namespace="user_profile", key=user_id )
# Find relevant past experiences (episodic) similar_experiences = await memory.episodic.search( namespace="conversations", query=query, filter={"user_id": user_id}, limit=3 ) # Find relevant skills (procedural) relevant_skills = await memory.procedural.search( namespace="skills", query=query, limit=2 ) return { "profile": profile, "past_experiences": similar_experiences, "relevant_skills": relevant_skills, }"""
-
-
name: Vector Store Selection Pattern description: Choosing the right vector database for your use case when: Setting up persistent memory storage example: |
VECTOR STORE SELECTION:
""" Decision matrix:
Pinecone Qdrant Weaviate ChromaDB pgvector Scale Billions 100M+ 100M+ 1M 1M Managed Yes Both Both Self Self Filtering Basic Best Good Basic SQL Hybrid No Yes Best No Yes Cost High Medium Medium Free Free Latency 5ms 7ms 10ms 20ms 15ms """ Pinecone (Enterprise Scale)
""" from pinecone import Pinecone
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"]) index = pc.Index("agent-memory")
Upsert with metadata
index.upsert( vectors=[ { "id": f"memory-{uuid4()}", "values": embedding, "metadata": { "user_id": user_id, "timestamp": datetime.now().isoformat(), "type": "episodic", "content": memory_text, } } ], namespace=namespace )
Query with filter
results = index.query( vector=query_embedding, filter={"user_id": user_id, "type": "episodic"}, top_k=5, include_metadata=True ) """
Qdrant (Complex Filtering)
""" from qdrant_client import QdrantClient from qdrant_client.models import PointStruct, Filter, FieldCondition
client = QdrantClient(url="http://localhost:6333")
Complex filtering with Qdrant
results = client.search( collection_name="agent_memory", query_vector=query_embedding, query_filter=Filter( must=[ FieldCondition(key="user_id", match={"value": user_id}), FieldCondition(key="type", match={"value": "semantic"}), ], should=[ FieldCondition(key="topic", match={"any": ["auth", "security"]}), ] ), limit=5 ) """
ChromaDB (Prototyping)
""" import chromadb
client = chromadb.PersistentClient(path="./memory_db") collection = client.get_or_create_collection("agent_memory")
Simple and fast for prototypes
collection.add( ids=[str(uuid4())], embeddings=[embedding], documents=[memory_text], metadatas=[{"user_id": user_id, "type": "episodic"}] )
results = collection.query( query_embeddings=[query_embedding], n_results=5, where={"user_id": user_id} ) """
-
name: Chunking Strategy Pattern description: Breaking documents into retrievable chunks when: Processing documents for memory storage example: |
CHUNKING STRATEGIES:
""" The chunking dilemma:
- Too large: Vector loses specificity
- Too small: Loses context
Optimal chunk size depends on:
- Document type (code vs prose vs data)
- Query patterns (factual vs exploratory)
- Embedding model (each has sweet spot)
General guidance: 256-512 tokens for most use cases """
Fixed-Size Chunking (Baseline)
""" from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter( chunk_size=500, # Characters chunk_overlap=50, # Overlap prevents cutting sentences separators=["\n\n", "\n", ". ", " ", ""] # Priority order )
chunks = splitter.split_text(document) """
Semantic Chunking (Better Quality)
""" from langchain_experimental.text_splitter import SemanticChunker from langchain_openai import OpenAIEmbeddings
Splits based on semantic similarity
splitter = SemanticChunker( embeddings=OpenAIEmbeddings(), breakpoint_threshold_type="percentile", breakpoint_threshold_amount=95 )
chunks = splitter.split_text(document) """
Structure-Aware Chunking (Documents with Hierarchy)
""" from langchain.text_splitter import MarkdownHeaderTextSplitter
Respect document structure
splitter = MarkdownHeaderTextSplitter( headers_to_split_on=[ ("#", "Header 1"), ("##", "Header 2"), ("###", "Header 3"), ] )
chunks = splitter.split_text(markdown_doc)
Each chunk has header metadata for context
"""
Contextual Chunking (Anthropic's Approach)
"""
Add context to each chunk before embedding
Reduces retrieval failures by 35%
def add_context_to_chunk(chunk, document_summary): context_prompt = f''' Document summary: {document_summary}
The following is a chunk from this document: {chunk} ''' return context_promptEmbed the contextualized chunk, not raw chunk
for chunk in chunks: contextualized = add_context_to_chunk(chunk, summary) embedding = embed(contextualized) store(chunk, embedding) # Store original, embed contextualized """
Code-Specific Chunking
""" from langchain.text_splitter import Language, RecursiveCharacterTextSplitter
Language-aware splitting
python_splitter = RecursiveCharacterTextSplitter.from_language( language=Language.PYTHON, chunk_size=1000, chunk_overlap=200 )
Respects function/class boundaries
chunks = python_splitter.split_text(python_code) """
-
name: Background Memory Formation description: Processing memories asynchronously for better quality when: You want higher recall without slowing interactions example: |
BACKGROUND MEMORY FORMATION:
""" Real-time memory extraction slows conversations and adds complexity to agent tool calls. Background processing after conversations yields higher quality memories.
Pattern: Subconscious memory formation """
LangGraph Background Processing
""" from langgraph.graph import StateGraph from langgraph.checkpoint.postgres import PostgresSaver
async def background_memory_processor(thread_id: str): # Run after conversation ends or goes idle conversation = await load_conversation(thread_id)
# Extract insights without time pressure insights = await llm.invoke(''' Analyze this conversation and extract: 1. Key facts learned about the user 2. User preferences revealed 3. Tasks completed or pending 4. Patterns in user behavior Be thorough - this runs in background. Conversation: {conversation} ''') # Store to long-term memory for insight in insights: await memory.semantic.upsert( namespace="user_insights", key=generate_key(insight), content=insight, metadata={"source_thread": thread_id} )Trigger on conversation end or idle timeout
@on_conversation_idle(timeout_minutes=5) async def process_conversation(thread_id): await background_memory_processor(thread_id) """
Memory Consolidation (Like Sleep)
"""
Periodically consolidate and deduplicate memories
async def consolidate_memories(user_id: str): # Get all memories for user memories = await memory.semantic.list( namespace="user_insights", filter={"user_id": user_id} )
# Find similar memories (potential duplicates) clusters = cluster_by_similarity(memories, threshold=0.9) # Merge similar memories for cluster in clusters: if len(cluster) > 1: merged = await llm.invoke(f''' Consolidate these related memories into one: {cluster} Preserve all important information. ''') await memory.semantic.upsert( namespace="user_insights", key=generate_key(merged), content=merged ) # Delete originals for old in cluster: await memory.semantic.delete(old.id)"""
-
name: Memory Decay Pattern description: Forgetting old, irrelevant memories when: Memory grows large, retrieval slows down example: |
MEMORY DECAY:
""" Not all memories should live forever:
- Old preferences may be outdated
- Task details lose relevance
- Conflicting memories confuse retrieval
Implement intelligent decay based on:
- Recency (when was it created/accessed?)
- Frequency (how often is it retrieved?)
- Importance (is it a core fact or detail?) """
Time-Based Decay
""" from datetime import datetime, timedelta
async def decay_old_memories(namespace: str, max_age_days: int): cutoff = datetime.now() - timedelta(days=max_age_days)
old_memories = await memory.episodic.list( namespace=namespace, filter={"last_accessed": {"$lt": cutoff.isoformat()}} ) for mem in old_memories: # Soft delete (mark as archived) await memory.episodic.update( id=mem.id, metadata={"archived": True, "archived_at": datetime.now()} )"""
Utility-Based Decay (MIRIX Approach)
""" def calculate_memory_utility(memory): ''' Composite utility score inspired by cognitive science: - Recency: When was it last accessed? - Frequency: How often is it accessed? - Importance: How critical is this information? ''' now = datetime.now()
# Recency score (exponential decay with 72h half-life) hours_since_access = (now - memory.last_accessed).total_seconds() / 3600 recency_score = 0.5 ** (hours_since_access / 72) # Frequency score frequency_score = min(memory.access_count / 10, 1.0) # Importance (from metadata or heuristic) importance = memory.metadata.get("importance", 0.5) # Weighted combination utility = ( 0.4 * recency_score + 0.3 * frequency_score + 0.3 * importance ) return utilityasync def prune_low_utility_memories(threshold=0.2): all_memories = await memory.list_all() for mem in all_memories: if calculate_memory_utility(mem) < threshold: await memory.archive(mem.id) """
anti_patterns:
-
name: Store Everything Forever description: Never deleting or archiving memories why: | Memory bloat slows retrieval, increases costs, and introduces noise. Outdated memories can conflict with current facts. Infinite storage isn't infinite retrieval. instead: | Implement decay policies. Archive old episodic memories. Consolidate duplicate semantic memories. Test retrieval quality as memory grows.
-
name: Chunk Without Testing Retrieval description: Choosing chunk size without measuring retrieval accuracy why: | Chunking destroys context. The "right" chunk size varies by document type, query pattern, and embedding model. Without testing, you're guessing. instead: | Create retrieval test sets. Measure recall@k for different chunk sizes. Optimize for your actual queries.
-
name: Single Memory Type for All Data description: Storing everything as generic "memories" why: | Different information needs different treatment. User profile (structured, small) shouldn't be stored like conversation history (unstructured, large). instead: | Use CoALA types: semantic for facts, episodic for events, procedural for skills. Each has different storage and retrieval patterns.
-
name: Real-Time Memory Formation description: Extracting memories during conversation why: | Real-time extraction adds latency, complicates tool calls, and produces lower quality memories under time pressure. Users notice the delay. instead: | Use background/subconscious memory formation. Process conversations after they end or go idle. Higher quality, no latency impact.
-
name: Ignoring Memory Conflicts description: Storing contradictory facts without resolution why: | "User prefers dark mode" and "user prefers light mode" both retrieved creates confusion. Agent gives inconsistent answers. instead: | Detect conflicts on storage. Either replace (for preferences) or version (for temporal facts). Consolidate periodically.
handoffs: receives_from: - skill: autonomous-agents receives: Memory requirements for agents - skill: llm-architect receives: RAG and retrieval architecture - skill: product-strategy receives: Personalization requirements
hands_to: - skill: autonomous-agents provides: Memory infrastructure for agents - skill: multi-agent-orchestration provides: Shared memory for multi-agent systems - skill: data-engineer provides: Vector store requirements at scale
tags:
- memory
- vector-store
- rag
- retrieval
- embedding
- episodic
- semantic
- procedural
- langmem
- memgpt
- pinecone
- qdrant
- chromadb