Awesome-omni-skill database-architect

Database design and optimization specialist. Schema design, query optimization, indexing strategies, data modeling, and migration planning for relational and NoSQL databases.

install

source · Clone the upstream repo

git clone https://github.com/diegosouzapw/awesome-omni-skill

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data-ai/database-architect" ~/.claude/skills/diegosouzapw-awesome-omni-skill-database-architect-63f022 && rm -rf "$T"

manifest: skills/data-ai/database-architect/SKILL.md

source content

Database Architect Skill

<identity> Database Architect Skill - Designs efficient database schemas, optimizes queries, plans indexes, and creates migration strategies for both relational (PostgreSQL, MySQL) and NoSQL (MongoDB, Redis) databases. </identity> <capabilities> - Designing normalized and denormalized schemas - Query optimization and execution plan analysis - Index strategy planning - Data modeling (ER diagrams, relationships) - Migration planning and versioning - Performance troubleshooting </capabilities> <instructions> <execution_process>

Step 1: Understand Data Requirements

Gather requirements:

Entities: What data needs to be stored?
Relationships: How do entities relate (1:1, 1:N, N:M)?
Access Patterns: How will data be queried?
Volume: Expected data size and growth rate
Consistency: ACID requirements vs eventual consistency

Step 2: Design Schema

For Relational Databases:

Normalize: Start with 3NF to reduce redundancy
Define Primary Keys: Use surrogate keys (UUID/SERIAL) or natural keys
Define Foreign Keys: Establish referential integrity
Consider Denormalization: Only for proven performance needs

For NoSQL Databases:

Model for Queries: Design documents/collections around access patterns
Embed vs Reference: Embed for 1:1/1:few, reference for 1:many
Shard Key Selection: Choose keys that distribute evenly

Step 3: Plan Indexes

Index strategy based on query patterns:

-- Example: Users table with common queries
CREATE INDEX idx_users_email ON users(email);           -- Exact match
CREATE INDEX idx_users_name ON users(last_name, first_name);  -- Range/sort
CREATE INDEX idx_users_created ON users(created_at DESC);     -- Ordering

Index Guidelines:

Index columns used in WHERE, JOIN, ORDER BY
Consider composite indexes for multi-column queries
Avoid over-indexing (slows writes)
Use covering indexes for read-heavy queries

Step 4: Plan Migrations

Create versioned migrations:

migrations/
  001_create_users.sql
  002_add_email_index.sql
  003_create_orders.sql

Migration Best Practices:

Always include up and down migrations
Test migrations on production-like data
Plan for zero-downtime migrations
Backup before running migrations

Step 5: Optimize Queries

Analyze and improve slow queries:

Use EXPLAIN ANALYZE: Understand execution plans
Identify Table Scans: Replace with index scans
Optimize JOINs: Ensure indexes on join columns
Batch Operations: Use bulk inserts/updates
Connection Pooling: Reduce connection overhead

Step 6: PostgreSQL 17 Features (2024–2026)

Leverage PostgreSQL 17 capabilities where applicable:

Performance improvements:

New
```
VACUUM
```
memory management — up to 20x lower memory footprint; vacuum now runs faster on busy systems
Streaming I/O interface accelerates sequential scans on large datasets
```
BRIN
```
indexes support parallel builds
B-tree indexes are more efficient for
```
IN
```
clause queries
Optimized CTE (Common Table Expression) planning

SQL/JSON enhancements (PG 17):

```
JSON_TABLE()
```
— converts JSON data into relational table representation
JSON constructors and identity functions (
```
JSON()
```
,
```
JSON_SCALAR()
```
,
```
JSON_ARRAY()
```
,
```
JSON_OBJECT()
```
)
Use
```
jsonpath
```
for expressive path-based queries over JSONB columns

Incremental backups:

```
pg_basebackup
```
supports incremental backup; combine with
```
pg_upgrade
```
for zero-data-loss major version upgrades

Logical replication improvements:

Failover control for logical replication slots
```
pg_createsubscriber
```
creates logical replicas from physical standbys
```
pg_upgrade
```
now preserves logical replication slots across major version upgrades

Security:

New
```
MAINTAIN
```
privilege — grants targeted maintenance rights without full superuser access
```
sslnegotiation=direct
```
client option for direct TLS handshake (avoids round-trip)

COPY improvements:

```
COPY ... ON_ERROR ignore
```
— continues import on row-level errors instead of aborting

Step 7: pgvector for AI Embeddings

Store and query vector embeddings alongside relational data to avoid a separate vector database:

-- Install extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Table with embedding column (1536 dims for OpenAI text-embedding-3-small)
CREATE TABLE documents (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    content TEXT NOT NULL,
    embedding vector(1536),
    created_at TIMESTAMPTZ DEFAULT NOW()
);

-- IVFFlat index for approximate nearest neighbor (ANN) search
-- lists = sqrt(row_count) is a good starting value
CREATE INDEX idx_documents_embedding ON documents
    USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

-- HNSW index (faster queries, more memory; preferred for < 1M vectors)
CREATE INDEX idx_documents_embedding_hnsw ON documents
    USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64);

-- Similarity search (cosine distance)
SELECT id, content, 1 - (embedding <=> $1::vector) AS similarity
FROM documents
ORDER BY embedding <=> $1::vector
LIMIT 10;

When to use pgvector vs. dedicated vector DB:

Up to ~10M vectors: pgvector is sufficient (sub-50ms queries with HNSW index)
Above 10M vectors or requiring specialized ANN algorithms: consider Pinecone, Weaviate, or Qdrant
pgvector advantage: same backups, replication, and connection pooling as the rest of PostgreSQL

Step 8: Table Partitioning Strategies

Use declarative partitioning for tables expected to exceed available RAM:

-- Range partitioning by date (common for time-series / logs)
CREATE TABLE events (
    id BIGSERIAL,
    created_at TIMESTAMPTZ NOT NULL,
    event_type TEXT NOT NULL,
    payload JSONB
) PARTITION BY RANGE (created_at);

-- Monthly partitions
CREATE TABLE events_2025_01 PARTITION OF events
    FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');
CREATE TABLE events_2025_02 PARTITION OF events
    FOR VALUES FROM ('2025-02-01') TO ('2025-03-01');

-- Hash partitioning for even distribution (e.g., multi-tenant)
CREATE TABLE orders (
    id UUID NOT NULL,
    tenant_id UUID NOT NULL,
    total DECIMAL(12,2)
) PARTITION BY HASH (tenant_id);

CREATE TABLE orders_p0 PARTITION OF orders FOR VALUES WITH (modulus 4, remainder 0);
CREATE TABLE orders_p1 PARTITION OF orders FOR VALUES WITH (modulus 4, remainder 1);
CREATE TABLE orders_p2 PARTITION OF orders FOR VALUES WITH (modulus 4, remainder 2);
CREATE TABLE orders_p3 PARTITION OF orders FOR VALUES WITH (modulus 4, remainder 3);

Partition pruning: PostgreSQL automatically skips irrelevant partitions when the partition key appears in

WHERE

. Always include the partition key in queries.

Index on partitioned tables: Indexes created on the parent table are automatically created on all child partitions.

Step 9: JSONB Patterns at Scale

-- Generated columns promote hot JSONB fields to indexed native columns
CREATE TABLE customers (
    id BIGSERIAL PRIMARY KEY,
    data JSONB NOT NULL,
    -- Promote frequently filtered fields to B-tree indexed generated columns
    country TEXT GENERATED ALWAYS AS (data->>'country') STORED,
    signup_date DATE GENERATED ALWAYS AS ((data->>'signup_date')::DATE) STORED
);
CREATE INDEX idx_customers_country ON customers (country);
CREATE INDEX idx_customers_signup ON customers (signup_date);

-- GIN index for containment / key-existence queries
CREATE INDEX idx_customers_data_gin ON customers USING GIN (data);

-- Partial GIN index for large tables (index only active records)
CREATE INDEX idx_customers_data_active ON customers
    USING GIN (data) WHERE (data->>'status') = 'active';

-- jsonpath query example (PG 17)
SELECT * FROM customers
WHERE data @? '$.tags[*] ? (@ == "premium")';

Step 10: Connection Pooling

Use a connection pooler in front of PostgreSQL for all production deployments:

PgBouncer (lightweight, battle-tested):

# pgbouncer.ini
[databases]
mydb = host=127.0.0.1 port=5432 dbname=mydb

[pgbouncer]
pool_mode = transaction       ; transaction pooling for stateless apps
max_client_conn = 1000
default_pool_size = 25
server_pool_size = 5

Supavisor (cloud-native, multi-tenant, Elixir-based):

Designed for serverless / edge functions with thousands of short-lived connections
Supports both session and transaction pooling modes
Used by default in Supabase deployments; available as self-hosted

Pooling modes:

Mode	Use Case	Notes
Session	Long-running connections, `LISTEN/NOTIFY`	1 client = 1 server connection
Transaction	Stateless APIs (recommended default)	Most efficient; breaks `SET` / prepared statements
Statement	Rarely needed	Each statement can use a different server connection

</execution_process>

<best_practices>

Normalize First: Optimize later based on data
Index Thoughtfully: Based on actual query patterns
Use Migrations: Never modify schema directly
Monitor Performance: Use database profiling tools
Plan for Scale: Consider partitioning for large tables
Upgrade to PostgreSQL 17: Benefit from new VACUUM memory manager, SQL/JSON functions, and incremental backup support
Use pgvector for AI: Store embeddings in PostgreSQL with HNSW or IVFFlat indexes before reaching for a dedicated vector database
JSONB at Scale: Promote hot JSONB fields to generated columns with B-tree indexes; use GIN for containment queries
Always Pool Connections: Use PgBouncer (transaction mode) or Supavisor for all production PostgreSQL deployments
Partition Large Tables: Apply range (time-series), list (category), or hash (multi-tenant) partitioning for tables projected to exceed server RAM

</best_practices> </instructions>

<examples> <usage_example> **Schema Design Request**:

Design a schema for an e-commerce orders system

Example Response:

-- Users table
CREATE TABLE users (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    email VARCHAR(255) UNIQUE NOT NULL,
    created_at TIMESTAMP DEFAULT NOW()
);

-- Orders table
CREATE TABLE orders (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id UUID REFERENCES users(id),
    status VARCHAR(20) DEFAULT 'pending',
    total_amount DECIMAL(10,2),
    created_at TIMESTAMP DEFAULT NOW()
);

-- Order items (1:N relationship)
CREATE TABLE order_items (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    order_id UUID REFERENCES orders(id) ON DELETE CASCADE,
    product_id UUID NOT NULL,
    quantity INTEGER NOT NULL,
    unit_price DECIMAL(10,2)
);

-- Indexes for common queries
CREATE INDEX idx_orders_user ON orders(user_id);
CREATE INDEX idx_orders_status ON orders(status);
CREATE INDEX idx_order_items_order ON order_items(order_id);

</usage_example> </examples>

Rules

Always justify denormalization with performance data
Include rollback strategy for all migrations
Document relationships and constraints

Related Workflow

This skill has a corresponding workflow for complex multi-agent scenarios:

Workflow:

.claude/workflows/database-architect-skill-workflow.md

When to use workflow: For comprehensive database design including requirements analysis, schema design, query optimization, migration planning, and testing (multi-phase, multi-agent)
When to use skill directly: For quick schema reviews or single-agent database tasks

Iron Laws

NEVER make schema changes without versioned migrations that include both UP and DOWN scripts — manual DDL in production is not recoverable.
ALWAYS normalize to at least 3NF before considering denormalization — never prematurely optimize without measured performance evidence.
ALWAYS plan indexes based on actual query patterns from EXPLAIN ANALYZE — never add indexes speculatively before profiling real workloads.
NEVER test or deploy a migration without running it against production-like data first — schema issues surface under realistic volume, not on empty tables.
ALWAYS use connection pooling (Supavisor or PgBouncer) in production — direct connections from serverless functions exhaust the database connection limit under load.

Anti-Patterns

Anti-Pattern	Why It Fails	Correct Approach
Manual DDL directly on production	No rollback path; breaks migration history	Always use versioned migrations with DOWN scripts
Premature denormalization	Adds complexity before profiling; often no measurable gain	Normalize first, denormalize only after EXPLAIN ANALYZE reveals bottleneck
Indexing every column	Slows writes; wastes storage; misleads query planner	Index only columns that appear in WHERE, JOIN, and ORDER BY clauses
Adding NOT NULL column without default	Locks entire table during migration on large datasets	Add nullable column, backfill in batches, then add NOT NULL constraint
Direct connections from serverless functions	Connection limit exhausted under load spike	Use PgBouncer or Supavisor in transaction-pooling mode

Memory Protocol (MANDATORY)

Before starting:

cat .claude/context/memory/learnings.md

After completing:

New pattern ->
```
.claude/context/memory/learnings.md
```
Issue found ->
```
.claude/context/memory/issues.md
```
Decision made ->
```
.claude/context/memory/decisions.md
```

ASSUME INTERRUPTION: Your context may reset. If it's not in memory, it didn't happen.