Awesome-omni-skill qdrant-memory
Intelligent token optimization through Qdrant-powered semantic caching and long-term memory. Use for (1) Semantic Cache - avoid LLM calls entirely for semantically similar queries with 100% token savings, (2) Long-Term Memory - retrieve only relevant context chunks instead of full conversation history with 80-95% context reduction, (3) Hybrid Search - combine vector similarity with keyword filtering for technical queries, (4) Memory Management - store and retrieve conversation memories, decisions, and code patterns with metadata filtering. Triggers when needing to cache responses, remember past interactions, optimize context windows, or implement RAG patterns.
git clone https://github.com/diegosouzapw/awesome-omni-skill
T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/ai-agents/qdrant-memory" ~/.claude/skills/diegosouzapw-awesome-omni-skill-qdrant-memory && rm -rf "$T"
skills/ai-agents/qdrant-memory/SKILL.md- pip install
- makes HTTP requests (curl)
Qdrant Memory Skill
Token optimization engine using Qdrant vector database for semantic caching and intelligent memory retrieval.
Architecture Overview
┌─────────────────────────────────────────────────────────────┐ │ USER QUERY │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ 1. SEMANTIC CACHE CHECK (Cache Hit = 100% Token Savings) │ │ ┌─────────────────┐ ┌─────────────────────────────────┐ │ │ │ Embed Query │───▶│ Search Qdrant (similarity>0.9) │ │ │ └─────────────────┘ └─────────────────────────────────┘ │ │ │ │ │ ┌─────────────────┴──────────────────┐ │ │ ▼ ▼ │ │ [CACHE HIT] [CACHE MISS] │ │ Return cached Continue to │ │ response LLM │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ 2. CONTEXT RETRIEVAL (RAG - 80-95% Context Reduction) │ │ ┌─────────────────┐ ┌─────────────────────────────────┐ │ │ │ Identify Need │───▶│ Retrieve Top-K Relevant Chunks │ │ │ └─────────────────┘ └─────────────────────────────────┘ │ │ Instead of 20K tokens ───▶ Only 500-1000 tokens │ └─────────────────────────────────────────────────────────────┘
Prerequisites
Qdrant (Vector Database)
# Option 1: Docker (recommended) docker run -d -p 6333:6333 -v qdrant_storage:/qdrant/storage qdrant/qdrant # Option 2: Docker Compose (persistent) # See references/complete_guide.md for docker-compose.yml
Embeddings Provider
Choose based on your needs:
| Provider | Privacy | Cost | Speed | Setup |
|---|---|---|---|---|
| Ollama (recommended) | ✅ Fully Local | Free | Fast (Metal) | |
| Bedrock (AWS/Kiro) | ⚡ AWS Cloud | ~$0.02/1M tokens | Fast | Uses AWS profile (no key) |
| OpenAI | ❌ Cloud | ~$0.02/1M tokens | Fast | API key required |
Ollama Setup (M3 Mac Optimized)
# 1. Install Ollama (if not already installed) brew install ollama # 2. Start server (choose one option) ollama serve # Foreground (Ctrl+C to stop) ollama serve & # Background (current terminal) nohup ollama serve & # Background (survives terminal close) # 3. Pull embedding model (768 dimensions, excellent quality) ollama pull nomic-embed-text # 4. Verify server is running curl http://localhost:11434/api/tags # 5. Test embedding generation curl http://localhost:11434/api/embeddings -d '{"model":"nomic-embed-text","prompt":"hello"}'
Tip: To auto-start Ollama on login, add
to yourollama serve &or use~/.zshrc.brew services start ollama
Note: For Ollama, use
when creating collections.--dimension 768
Amazon Bedrock Setup (AWS/Kiro Subscription)
Uses your existing AWS credentials - no secrets stored in code.
# 1. Ensure AWS CLI is configured (uses ~/.aws/credentials) aws configure # Or set AWS_PROFILE for specific profile # 2. Install boto3 if not present pip install boto3 # 3. Set environment variables export EMBEDDING_PROVIDER=bedrock export AWS_REGION=eu-west-1 # Default region # 4. Test authentication python3 skills/qdrant-memory/scripts/embedding_utils.py
Models Available (cheapest first):
| Model | Dimensions | Pricing |
|---|---|---|
| 1024 | ~$0.02/1M tokens |
| 1536 | ~$0.02/1M tokens |
| 1024 | ~$0.10/1M tokens |
Note: For Bedrock Titan V2, use
when creating collections.--dimension 1024
OpenAI Setup (Cloud)
export OPENAI_API_KEY="sk-..."
Quick Start
MCP Server Configuration
{ "qdrant-mcp": { "command": "npx", "args": ["-y", "@qdrant/mcp-server-qdrant"], "env": { "QDRANT_URL": "http://localhost:6333", "QDRANT_API_KEY": "${QDRANT_API_KEY}", "COLLECTION_NAME": "agent_memory" } } }
Initialize Memory Collection
Run
scripts/init_collection.py to create the optimized collection:
# For Ollama (nomic-embed-text - 768 dimensions) python3 scripts/init_collection.py --collection agent_memory --dimension 768 # For OpenAI (text-embedding-3-small - 1536 dimensions) python3 scripts/init_collection.py --collection agent_memory --dimension 1536
Core Capabilities
1. Semantic Cache (Maximum Token Savings)
Purpose: Avoid LLM calls entirely for semantically similar queries.
Flow:
- Embed incoming query
- Search Qdrant for similar past queries (threshold > 0.9)
- If match found → return cached response (100% token savings)
- If no match → proceed to LLM, then cache result
Implementation:
# Cache check before LLM call from scripts.semantic_cache import check_cache, store_response # Check cache first cached = check_cache(query, similarity_threshold=0.92) if cached: return cached["response"] # 100% token savings # Generate response with LLM response = llm.generate(query) # Store for future cache hits store_response(query, response, metadata={ "type": "cache", "model": "gpt-4", "tokens_saved": len(response.split()) })
Collection Schema:
{ "collection": "semantic_cache", "vectors": { "size": 1536, "distance": "Cosine" }, "payload_schema": { "query": "keyword", "response": "text", "timestamp": "datetime", "model": "keyword", "token_count": "integer" } }
2. Long-Term Memory (Context Optimization)
Purpose: Retrieve only relevant context instead of full conversation history.
Problem: 20,000 token conversation history → Expensive + Confuses model Solution: Query Qdrant → Return only top 3-5 relevant chunks (500-1000 tokens)
Memory Types:
| Type | Payload Filter | Use Case |
|---|---|---|
| | Past architectural/design decisions |
| | Previously written code patterns |
| | How past errors were resolved |
| | Key conversation points |
| | Technical knowledge/docs |
Implementation:
from scripts.memory_retrieval import retrieve_context # Instead of passing 20K tokens of history: relevant_chunks = retrieve_context( query="What did we decide about the database architecture?", filters={"type": "decision"}, top_k=5, score_threshold=0.7 ) # Build optimized prompt with only relevant context prompt = f""" Relevant Context: {relevant_chunks} User Question: {user_query} """ # Now only ~1000 tokens instead of 20,000
3. Hybrid Search (Vector + Keyword)
Purpose: Combine semantic similarity with exact keyword matching for technical queries.
When to use: Error codes, variable names, specific identifiers
from scripts.hybrid_search import hybrid_query results = hybrid_query( text_query="kubernetes deployment failed", keyword_filters={ "error_code": "ImagePullBackOff", "namespace": "production" }, fusion_weights={"text": 0.7, "keyword": 0.3} )
MCP Tools Reference
| Tool | Purpose |
|---|---|
| Store embeddings with metadata |
| Semantic search with filters |
| Remove memories by ID or filter |
| View available collections |
| Collection stats and config |
Store Memory
{ "tool": "qdrant_store_memory", "arguments": { "content": "We decided to use PostgreSQL for user data due to ACID compliance requirements", "metadata": { "type": "decision", "project": "api-catalogue", "date": "2026-01-22", "tags": ["database", "architecture"] } } }
Search Memory
{ "tool": "qdrant_search_memory", "arguments": { "query": "database architecture decisions", "filter": { "must": [{ "key": "type", "match": { "value": "decision" } }] }, "limit": 5, "score_threshold": 0.7 } }
Payload Filtering Patterns
Filter by Type
{ "filter": { "must": [{ "key": "type", "match": { "value": "technical" } }] } }
Filter by Project + Date Range
{ "filter": { "must": [ { "key": "project", "match": { "value": "api-catalogue" } }, { "key": "timestamp", "range": { "gte": "2026-01-01" } } ] } }
Exclude Certain Tags
{ "filter": { "must_not": [ { "key": "tags", "match": { "any": ["deprecated", "archived"] } } ] } }
Collection Design Patterns
Single Collection (Simple)
agent_memory/ ├── type: "cache" | "decision" | "code" | "error" | "conversation" ├── project: "<project_name>" ├── timestamp: "<ISO8601>" └── content: "<text>"
Multi-Collection (Advanced)
| Collection | Purpose | Retention |
|---|---|---|
| Query-response cache | 7 days |
| Architectural decisions | Permanent |
| Reusable code snippets | 90 days |
| Key conversation points | 30 days |
| Error solutions | 60 days |
Token Savings Metrics
Track savings with metadata:
{ "tokens_input_saved": 15000, "tokens_output_saved": 2000, "cost_saved_usd": 0.27, "cache_hit": True, "retrieval_latency_ms": 45 }
Expected Savings:
| Scenario | Without Qdrant | With Qdrant | Savings |
|---|---|---|---|
| Repeated question | 8K tokens | 0 tokens | 100% |
| Context retrieval | 20K tokens | 1K tokens | 95% |
| Hybrid lookup | 15K tokens | 2K tokens | 87% |
Best Practices
Embedding Model Selection
| Model | Dimensions | Speed | Quality | Use Case |
|---|---|---|---|---|
| 1536 | Fast | Good | General use |
| 3072 | Medium | Excellent | High accuracy |
| 384 | Fastest | Good | Local/private |
Cache Invalidation
- Time-based: Expire cache entries after N days
- Manual: Clear cache when underlying data changes
- Version-based: Include model version in metadata
Memory Hygiene
- Deduplicate: Check similarity before storing
- Prune: Remove low-value memories periodically
- Compress: Summarize long conversations before storing
References
- See
for full setup, testing, and troubleshootingreferences/complete_guide.md - See
for complete schema definitionsreferences/collection_schemas.md - See
for model comparisonsreferences/embedding_models.md - See
for RAG optimization patternsreferences/advanced_patterns.md
AGI Framework Integration
Qdrant Memory Integration
Before executing complex tasks with this skill:
python3 execution/memory_manager.py auto --query "<task summary>"
Decision Tree:
- Cache hit? Use cached response directly — no need to re-process.
- Memory match? Inject
into your reasoning.context_chunks - No match? Proceed normally, then store results:
python3 execution/memory_manager.py store \ --content "Description of what was decided/solved" \ --type decision \ --tags qdrant-memory <relevant-tags>
Note: Storing automatically updates both Vector (Qdrant) and Keyword (BM25) indices.
Agent Team Collaboration- Strategy: This skill communicates via the shared memory system.
- Orchestration: Invoked by
via intelligent routing.orchestrator - Context Sharing: Always read previous agent outputs from memory before starting.
Local LLM Support
When available, use local Ollama models for embedding and lightweight inference:
- Embeddings:
via Qdrant memory systemnomic-embed-text - Lightweight analysis: Local models reduce API costs for repetitive patterns