Claude-skill-registry dean-large-scale-systems

Design large-scale systems using Jeff Dean's pragmatic approach. Emphasizes performance intuition, building reliable systems from unreliable components, and solving problems at Google scale. Use when building infrastructure that must handle millions of QPS, petabytes of data, or global distribution.

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/dean" ~/.claude/skills/majiayu000-claude-skill-registry-dean-large-scale-systems && rm -rf "$T"

manifest: skills/data/dean/SKILL.md

Jeff Dean Style Guide

Overview

Jeff Dean is the architect behind much of Google's infrastructure: MapReduce, BigTable, Spanner, TensorFlow, and more. He exemplifies the rare combination of deep systems knowledge, performance intuition, and practical engineering judgment. His work defines how modern internet-scale systems are built.

Core Philosophy

"Design for 10x the current load, but plan to rewrite before 100x."

"Simple solutions often require the most sophisticated understanding of the problem."

"If a problem isn't interesting at scale, it probably isn't interesting at all."

Design Principles

Embrace Failure: At scale, everything fails. Design systems that degrade gracefully, not catastrophically.
Numbers Matter: Know your latencies, throughputs, and failure rates by heart. Performance intuition comes from data.
Codesign Hardware and Software: The best performance comes from understanding the entire stack, from disk to datacenter.
Simplicity at Scale: Complex systems break in complex ways. The simplest solution that scales is usually the best.
Measure, Then Optimize: Never optimize without profiling. Intuition fails; data doesn't.

Numbers Every Engineer Should Know

L1 cache reference                           0.5 ns
Branch mispredict                            5   ns
L2 cache reference                           7   ns
Mutex lock/unlock                           25   ns
Main memory reference                      100   ns
Compress 1K bytes with Zippy             3,000   ns
Send 1K bytes over 1 Gbps network       10,000   ns
Read 4K randomly from SSD              150,000   ns
Read 1 MB sequentially from memory     250,000   ns
Round trip within same datacenter      500,000   ns
Read 1 MB sequentially from SSD      1,000,000   ns
Disk seek                           10,000,000   ns
Read 1 MB sequentially from disk    20,000,000   ns
Send packet CA→Netherlands→CA      150,000,000   ns

These numbers should guide every design decision.

When Designing Systems

Always

Start with back-of-envelope calculations before designing
Design for partial failure—some machines will always be down
Use replication for availability, sharding for scale
Batch operations when possible—amortize fixed costs
Compress data on the wire and at rest (CPU is cheaper than I/O)
Add monitoring and observability from day one
Design for debugging—you'll need to diagnose production issues

Never

Assume the network is reliable (it's not)
Assume latency is zero (it's not)
Assume bandwidth is infinite (it's not)
Optimize before measuring
Design for current load only—design for 10x
Ignore tail latency (p99 matters more than average)
Build systems you can't reason about under failure

Prefer

Idempotent operations over exactly-once semantics
Eventual consistency over strong consistency (when possible)
Denormalization over joins at scale
Structured data over unstructured (schemas help)
Batch processing over real-time when latency allows
Simple retry logic over complex distributed transactions

Architectural Patterns

MapReduce Mental Model

Problem: Process petabytes of data
Solution: 
  1. Map: Transform input into (key, value) pairs in parallel
  2. Shuffle: Group all values by key
  3. Reduce: Aggregate values for each key

Why it works:
  - Embarrassingly parallel map phase
  - Fault tolerance via re-execution
  - Simple programming model hides distribution

BigTable Design

Problem: Structured storage at massive scale
Solution:
  - Sparse, distributed, multi-dimensional sorted map
  - (row, column, timestamp) → value
  - Rows sorted lexicographically (enables range scans)
  - Column families for locality
  - Tablets (row ranges) as unit of distribution

Key insight: One data model, flexible enough for many use cases.

Spanner's TrueTime

Problem: Global consistency requires synchronized clocks
Solution:
  - GPS + atomic clocks in every datacenter
  - API returns interval [earliest, latest] not a point
  - Wait out uncertainty before committing
  
TrueTime.now() returns TTinterval: [earliest, latest]
Commit rule: Wait until TrueTime.now().earliest > commit_timestamp

Code Patterns

Back-of-Envelope Capacity Planning

def estimate_storage_needs(
    daily_active_users: int,
    actions_per_user_per_day: int,
    bytes_per_action: int,
    retention_days: int,
    replication_factor: int = 3
) -> dict:
    """Jeff Dean-style capacity estimation."""
    
    daily_bytes = daily_active_users * actions_per_user_per_day * bytes_per_action
    total_bytes = daily_bytes * retention_days * replication_factor
    
    return {
        "daily_raw_gb": daily_bytes / (1024**3),
        "total_storage_tb": total_bytes / (1024**4),
        "monthly_bandwidth_tb": (daily_bytes * 30) / (1024**4),
        "estimated_machines_1tb_each": total_bytes / (1024**4),
    }

# Example: 100M DAU, 10 actions/day, 1KB each, 90 day retention
# = 270 TB storage, ~300 machines (with replication)

Sharding Strategy

class ConsistentHashRing:
    """Distribute data across nodes with minimal reshuffling."""
    
    def __init__(self, nodes: list[str], virtual_nodes: int = 150):
        self.ring: dict[int, str] = {}
        self.sorted_keys: list[int] = []
        
        for node in nodes:
            for i in range(virtual_nodes):
                key = self._hash(f"{node}:{i}")
                self.ring[key] = node
        
        self.sorted_keys = sorted(self.ring.keys())
    
    def get_node(self, key: str) -> str:
        """Find the node responsible for this key."""
        if not self.ring:
            raise ValueError("Empty ring")
        
        h = self._hash(key)
        for ring_key in self.sorted_keys:
            if h <= ring_key:
                return self.ring[ring_key]
        return self.ring[self.sorted_keys[0]]
    
    def _hash(self, key: str) -> int:
        import hashlib
        return int(hashlib.md5(key.encode()).hexdigest(), 16)

Retry with Exponential Backoff

import random
import time
from typing import TypeVar, Callable

T = TypeVar('T')

def retry_with_backoff(
    fn: Callable[[], T],
    max_retries: int = 5,
    base_delay_ms: int = 100,
    max_delay_ms: int = 10000,
) -> T:
    """
    Retry with exponential backoff and jitter.
    
    At Google scale, thundering herds kill systems.
    Jitter prevents synchronized retries.
    """
    for attempt in range(max_retries):
        try:
            return fn()
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            
            delay = min(base_delay_ms * (2 ** attempt), max_delay_ms)
            jitter = random.uniform(0, delay * 0.1)
            time.sleep((delay + jitter) / 1000)
    
    raise RuntimeError("Unreachable")

Mental Model

Jeff Dean approaches problems with:

Quantify first: How much data? How many QPS? What latency budget?
Identify bottlenecks: Where will the system break first?
Design for failure: What happens when (not if) components fail?
Simplify ruthlessly: Can this be simpler while still meeting requirements?
Plan for evolution: Today's solution should be replaceable in 3 years

The Google Design Doc

1. Context & Scope
   - What problem are we solving? Why now?
   
2. Goals and Non-Goals
   - What this system WILL do
   - What this system explicitly WON'T do
   
3. Design
   - System architecture
   - Data model
   - API
   
4. Alternatives Considered
   - What else could we do? Why not?
   
5. Cross-cutting Concerns
   - Security, privacy, monitoring, rollout
   
6. Open Questions
   - What don't we know yet?

Warning Signs

You're violating Dean's principles if:

You don't know your system's p50, p99, and p999 latencies
You haven't done back-of-envelope capacity planning
Your system has no strategy for partial failure
You're optimizing without profiling data
You designed for current load, not 10x growth
You can't explain where every millisecond goes

Additional Resources

For detailed philosophy, see philosophy.md
For references (papers, talks), see references.md