Claude-skill-registry distribution-search

Guidance for finding probability distributions that satisfy specific statistical constraints such as KL divergence targets, entropy requirements, or moment conditions. This skill should be used when tasks involve constructing discrete or continuous probability distributions with specified divergence measures, entropy values, or other distributional properties through numerical optimization.

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/distribution-search" ~/.claude/skills/majiayu000-claude-skill-registry-distribution-search && rm -rf "$T"

manifest: skills/data/distribution-search/SKILL.md

Distribution Search

Overview

This skill provides systematic approaches for finding probability distributions that meet specific statistical constraints. Common tasks include constructing distributions with target KL divergence values (forward or backward), specified entropy, moment constraints, or combinations thereof. The approach emphasizes mathematical analysis before implementation, efficient parameterization, modular code structure, and rigorous verification.

When to Use This Skill

Finding distributions with specific KL divergence values (forward or backward)
Constructing distributions with target entropy
Searching for distributions satisfying moment constraints
Optimization problems involving probability mass/density functions
Any task requiring numerical search over distribution parameters

Methodology

Phase 1: Mathematical Analysis Before Coding

Before writing any code, thoroughly analyze the mathematical constraints:

1. Constraint Feasibility

Determine if a solution exists given the constraints
Calculate bounds on achievable values (e.g., max entropy for given support)
Identify necessary conditions for solution existence

2. Degrees of Freedom Analysis

Count the number of free parameters needed
Determine if simple parameterizations (e.g., two-group distributions) have sufficient flexibility
Plan for more complex parameterizations if needed

3. Analytical Derivations

Derive any closed-form relationships that constrain the search
For KL divergence: H(P) = log(V) - D_KL(P||Q) when Q is uniform over vocabulary V
Use analytical results to narrow the search space

Phase 2: Efficient Parameterization

Start Simple, Plan for Complexity

Two-group distributions: Divide elements into high-probability and low-probability groups
- Parameters: k (number of high-prob elements), p_high, p_low
- Constraint: k * p_high + (V - k) * p_low = 1
Multi-group distributions: If two groups are insufficient, add more groups
- More degrees of freedom allow satisfying more constraints
Continuous parameterizations: For smooth optimization landscapes
- Softmax over logits
- Exponential family parameterizations

Computational Efficiency for Large Vocabularies

For large vocabulary sizes (e.g., V = 150,000):

Avoid creating full arrays when closed-form calculations exist

Use analytical formulas for group-based distributions:

Forward KL = k * p_high * log(p_high * V) + (V - k) * p_low * log(p_low * V)

Only create full arrays for final verification

Phase 3: Optimization Strategy

Choose Appropriate Methods

Direct analytical solution: When constraints reduce to solvable equations
Root-finding (fsolve): When you have equations equal to zero
Least squares (least_squares): When minimizing squared constraint violations
Gradient-free optimization (Nelder-Mead): When derivatives are unavailable or noisy
Grid search over discrete parameters: For parameters like k (number of elements in a group)

Implementation Pattern

def objective(params, target_forward_kl, target_backward_kl, vocab_size):
    # Extract parameters
    k, log_ratio = params
    k = int(round(k))

    # Compute probabilities
    p_high, p_low = compute_probs(k, log_ratio, vocab_size)

    # Validate probabilities
    if p_high <= 0 or p_low <= 0 or p_high > 1 or p_low > 1:
        return [1e10, 1e10]  # Infeasible

    # Compute KL divergences using closed-form formulas
    forward_kl = compute_forward_kl(k, p_high, p_low, vocab_size)
    backward_kl = compute_backward_kl(k, p_high, p_low, vocab_size)

    return [forward_kl - target_forward_kl, backward_kl - target_backward_kl]

Grid Search for Discrete Parameters

best_solution = None
best_error = float('inf')

for k in range(1, vocab_size):
    # Optimize continuous parameters for this k
    result = optimize_continuous_params(k, targets, vocab_size)

    if result.error < best_error:
        best_error = result.error
        best_solution = result

Phase 4: Code Organization

Modular Structure to Prevent Inconsistencies

Create separate, reusable functions for core computations:

# Core computation functions - define ONCE, use everywhere
def forward_kl(p, q, mask=None):
    """Compute D_KL(P || Q) = sum_i p_i * log(p_i / q_i)"""
    if mask is None:
        mask = p > 1e-30
    return np.sum(p[mask] * np.log(p[mask] / q[mask]))

def backward_kl(p, q, mask=None):
    """Compute D_KL(Q || P) = sum_i q_i * log(q_i / p_i)"""
    if mask is None:
        mask = p > 1e-30
    return np.sum(q[mask] * np.log(q[mask] / p[mask]))

def entropy(p, mask=None):
    """Compute H(P) = -sum_i p_i * log(p_i)"""
    if mask is None:
        mask = p > 1e-30
    return -np.sum(p[mask] * np.log(p[mask]))

Import in All Scripts

# In optimization script
from kl_utils import forward_kl, backward_kl

# In verification script - use SAME functions
from kl_utils import forward_kl, backward_kl

Phase 5: Verification

Verification Checklist

For the final solution, verify:

Distribution Properties:
[ ] All probabilities are positive
[ ] All probabilities are <= 1
[ ] Sum of probabilities equals 1.0 (within floating-point tolerance)
[ ] No NaN or Inf values

Constraint Satisfaction:
[ ] Forward KL divergence within tolerance
[ ] Backward KL divergence within tolerance
[ ] Other constraints (entropy, moments) within tolerance

Numerical Precision:
[ ] Tolerance requirements are met (e.g., |error| < 1e-6)
[ ] Floating-point sum is acceptably close to 1.0

Verification Script Structure

def verify_distribution(p, q, target_forward, target_backward, tol=1e-6):
    print(f"Sum of probabilities: {np.sum(p)}")
    print(f"Min probability: {np.min(p)}")
    print(f"Max probability: {np.max(p)}")
    print(f"Any NaN: {np.any(np.isnan(p))}")
    print(f"Any Inf: {np.any(np.isinf(p))}")

    fwd = forward_kl(p, q)
    bwd = backward_kl(p, q)

    print(f"\nForward KL: {fwd:.10f} (target: {target_forward}, error: {abs(fwd - target_forward):.2e})")
    print(f"Backward KL: {bwd:.10f} (target: {target_backward}, error: {abs(bwd - target_backward):.2e})")

    fwd_ok = abs(fwd - target_forward) < tol
    bwd_ok = abs(bwd - target_backward) < tol

    print(f"\nForward KL within tolerance: {'PASS' if fwd_ok else 'FAIL'}")
    print(f"Backward KL within tolerance: {'PASS' if bwd_ok else 'FAIL'}")

    return fwd_ok and bwd_ok

Common Pitfalls

Pitfall 1: Full Array Creation for Large Vocabularies

Problem: Creating arrays of size V = 150,000 elements causes memory issues and timeouts Solution: Use closed-form formulas for group-based distributions; only create full arrays for final verification

Pitfall 2: Inconsistent Formula Implementations

Problem: Different scripts implement KL divergence formulas differently, leading to discrepancies Solution: Define core computation functions once and import them everywhere

Pitfall 3: Incorrect Masking in KL Divergence

Problem: Masking logic differs between forward and backward KL, or mask sum is incorrectly used Solution: Use consistent masking (p > 1e-30) and sum over masked elements, not multiply by mask count

Pitfall 4: Insufficient Degrees of Freedom

Problem: Simple parameterizations cannot satisfy all constraints simultaneously Solution: Analyze degrees of freedom before implementation; plan for more flexible parameterizations

Pitfall 5: Syntax Errors from Truncated Writes

Problem: File writes are truncated, leaving incomplete code Solution: Verify file content after every write by reading it back or attempting to import/execute

Pitfall 6: No Feasibility Analysis

Problem: Attempting optimization without verifying a solution exists Solution: Mathematically analyze constraints to establish feasibility before coding

Pitfall 7: Convergence to Local Minima

Problem: Optimization finds a local minimum that doesn't satisfy constraints Solution: Try multiple initializations; use grid search over discrete parameters; verify final solution

Pitfall 8: Floating-Point Precision Issues

Problem: Probability sum not exactly 1.0 due to floating-point arithmetic Solution: Use appropriate tolerances; normalize probabilities after construction; verify precision is acceptable for the task

KL Divergence Reference

Definitions

Forward KL (information projection):

D_KL(P || Q) = sum_i P(i) * log(P(i) / Q(i))

Backward KL (moment projection):

D_KL(Q || P) = sum_i Q(i) * log(Q(i) / P(i))

Properties

KL divergence is non-negative: D_KL >= 0
KL divergence is asymmetric: D_KL(P || Q) != D_KL(Q || P) in general
When Q is uniform over V elements: D_KL(P || Q) = log(V) - H(P)
KL divergence can be infinite if P has support where Q is zero

Closed-Form for Two-Group Distributions

For P with k elements at probability p_high and (V-k) elements at probability p_low, with Q uniform:

D_KL(P || Q) = k * p_high * log(p_high * V) + (V - k) * p_low * log(p_low * V)
D_KL(Q || P) = (1/V) * [k * log(1 / (V * p_high)) + (V - k) * log(1 / (V * p_low))]
             = (1/V) * [-k * log(V * p_high) - (V - k) * log(V * p_low)]

Iterative Refinement Pattern

When initial approaches fail:

Diagnose the failure: Understand why constraints aren't satisfied
Check mathematical feasibility: Re-verify that a solution exists
Increase flexibility: Add more parameters or groups
Adjust optimization method: Try different solvers or initialization strategies
Verify incrementally: Test each component in isolation before integration

Avoid completely rewriting from scratch each time; instead, modularly modify specific components.