git clone https://github.com/Intense-Visions/harness-engineering
T=$(mktemp -d) && git clone --depth=1 https://github.com/Intense-Visions/harness-engineering "$T" && mkdir -p ~/.claude/skills && cp -r "$T/agents/skills/codex/db-query-statistics" ~/.claude/skills/intense-visions-harness-engineering-db-query-statistics-fab8f2 && rm -rf "$T"
agents/skills/codex/db-query-statistics/SKILL.mdQuery Statistics and Selectivity
How the planner uses table statistics (pg_stats, histograms, most-common-values) to estimate row counts and choose execution plans.
When to Use
- Diagnosing planner misestimations (estimated rows vs actual rows differ by 10x+)
- Understanding why the planner chose a bad plan despite correct indexes
- Tuning statistics targets for columns with skewed distributions
- After bulk data loads, migrations, or large deletes that change data distribution
- Investigating queries where EXPLAIN shows correct indexes but wrong join strategies
Instructions
Key Concepts
The planner does not look at actual table data at query time. Instead, it uses pre-computed statistics stored in
pg_statistic (accessible via the pg_stats view). These statistics are sampled approximations, not exact counts.
Key statistics columns in
:pg_stats
-- estimated number of distinct values. Positive values are absolute counts; negative values are fractions of total rows (e.g., -1.0 means every row is unique)n_distinct
-- the N most frequent values in the columnmost_common_vals
-- the frequency of each most-common value (fractions of total rows)most_common_freqs
-- equal-frequency histogram bucket boundaries for values not in the MCV listhistogram_bounds
-- fraction of rows that are NULLnull_frac
-- how well the physical row order matches the logical (sorted) order. Values near 1.0 or -1.0 mean high correlationcorrelation
The ANALYZE command samples the table and updates these statistics:
ANALYZE orders; -- analyze one table ANALYZE orders (status); -- analyze one column ANALYZE; -- analyze all tables in the database
Autovacuum runs ANALYZE automatically, but it may lag behind large data changes.
Worked Example
Examining statistics for the
status column on an orders table:
SELECT attname, n_distinct, most_common_vals, most_common_freqs, null_frac, correlation FROM pg_stats WHERE tablename = 'orders' AND attname = 'status';
attname | n_distinct | most_common_vals | most_common_freqs | null_frac | correlation ---------+------------+-----------------------------+--------------------------+-----------+------------ status | 3 | {completed,active,cancelled} | {0.85,0.12,0.03} | 0 | 0.15
How the planner uses this: For
WHERE status = 'active', the estimated rows = total_rows * 0.12. On a 10M-row table, the estimate is 1.2M rows. For WHERE status = 'cancelled', the estimate is 300K rows.
Misestimation scenario: After a data migration that marked 90% of orders as cancelled (previously 3%), the statistics are stale:
EXPLAIN ANALYZE SELECT * FROM orders WHERE status = 'cancelled';
Index Scan using idx_orders_status on orders (cost=0.43..12345.67 rows=300000 width=52) (actual time=0.031..8923.450 rows=9000000 loops=1)
Estimated 300K rows, actual 9M -- a 30x misestimation. The planner chose Index Scan (good for 300K) instead of Seq Scan (better for 9M). Fix:
ANALYZE orders;
After ANALYZE, the planner sees the updated distribution and switches to Seq Scan:
Seq Scan on orders (cost=0.00..223456.00 rows=9010000 width=52) (actual time=0.012..1234.567 rows=9000000 loops=1) Filter: (status = 'cancelled') Execution Time: 1345.678 ms
Increasing statistics granularity for skewed distributions:
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 1000; ANALYZE orders;
The default
default_statistics_target is 100 (100 histogram buckets and 100 MCV entries). For highly skewed columns, increasing to 500-1000 gives the planner a more accurate picture.
Anti-Patterns
-
Never running ANALYZE after bulk operations. After a large INSERT, DELETE, or UPDATE that changes data distribution, statistics are stale. The planner uses the old distribution, producing bad plans. Always
after bulk changes.ANALYZE tablename; -
Setting default_statistics_target too low. The default of 100 works for uniform distributions but fails for skewed data. Columns with thousands of distinct values at varying frequencies need higher targets.
-
Ignoring n_distinct misestimates. For high-cardinality columns (e.g., user_id with 50M distinct values), the sampled n_distinct can be significantly off. Override with:
(tells the planner every value is unique).ALTER TABLE orders ALTER COLUMN user_id SET (n_distinct = -1); -
Disabling autovacuum. Autovacuum also runs auto-ANALYZE. Disabling it means statistics are never refreshed automatically, guaranteeing plan degradation over time.
PostgreSQL Specifics
default_statistics_target controls the number of histogram buckets and MCV entries. Default: 100. Maximum: 10000. Higher values increase ANALYZE time but improve estimation accuracy:
-- Global setting: SET default_statistics_target = 200; -- Per-column override: ALTER TABLE events ALTER COLUMN event_type SET STATISTICS 500;
Extended statistics for correlated columns (PostgreSQL 10+):
CREATE STATISTICS stat_orders_status_region (dependencies) ON status, region FROM orders; ANALYZE orders;
Without extended statistics, the planner assumes columns are independent. If
status = 'active' AND region = 'us-east' are correlated (e.g., 80% of us-east orders are active), the default estimation multiplies their individual frequencies, producing a significant underestimate.
Monitoring ANALYZE freshness:
SELECT relname, last_analyze, last_autoanalyze, n_live_tup, n_dead_tup FROM pg_stat_user_tables WHERE relname = 'orders';
If
last_autoanalyze is days old and n_dead_tup is high, autovacuum is falling behind.
Details
Advanced Topics
Multivariate statistics evolution:
- PostgreSQL 10: functional dependencies (
)dependencies - PostgreSQL 12: MCV lists for column combinations (
)mcv - PostgreSQL 14: expression statistics (
)expressions
CREATE STATISTICS stat_orders_multi (dependencies, mcv) ON status, region, tenant_id FROM orders;
Selectivity estimation for complex predicates. AND clauses multiply individual selectivities (assuming independence). OR clauses use inclusion-exclusion. NOT inverts selectivity. These assumptions break down for correlated columns -- extended statistics fix this.
The 1/n_distinct fallback. For values not in the MCV list and not in the histogram range, PostgreSQL estimates selectivity as 1/n_distinct. This is a rough guess and can be significantly wrong for new or rare values.
pg_statistic_ext stores extended statistics data. Query it to verify that your extended statistics are being computed:
SELECT stxname, stxkeys, stxkind FROM pg_statistic_ext WHERE stxrelid = 'orders'::regclass;
Engine Differences
MySQL uses
ANALYZE TABLE (not just ANALYZE) to update statistics:
ANALYZE TABLE orders;
MySQL stores statistics in
mysql.innodb_index_stats and mysql.innodb_table_stats. Key differences:
- Persistent statistics (
by default since 5.6): statistics survive restartsinnodb_stats_persistent = ON - Sampling pages (
, default 20): controls sample size for ANALYZE. Much smaller than PostgreSQL's default of 30,000 *innodb_stats_persistent_sample_pages
rowsdefault_statistics_target - Histogram support added in MySQL 8.0:
ANALYZE TABLE t UPDATE HISTOGRAM ON col WITH 100 BUCKETS; - No extended/multivariate statistics -- MySQL assumes column independence for all multi-column predicates
MySQL's optimizer uses a simpler statistics model overall. For complex queries with correlated columns, PostgreSQL's extended statistics provide significantly better estimates.
Real-World Case Studies
Reporting system with daily batch inserts of 5M rows. After each nightly batch, morning report queries degraded from 2 seconds to 30+ seconds. Investigation showed that autovacuum's ANALYZE had not run since the batch completed -- the statistics still reflected the previous day's distribution. Adding
ANALYZE reporting_events; to the batch job's post-load step ensured fresh statistics. Query performance remained consistent at 2 seconds. Selectivity estimation error dropped from 100x (stale stats) to under 2x (fresh stats).
Source
Process
- Read the key concepts to understand how the planner uses pg_stats for selectivity estimation.
- Apply ANALYZE after bulk data changes, and increase statistics targets for columns with skewed distributions.
- Verify by checking
for your key columns and comparing EXPLAIN estimated rows to actual rows.pg_stats
Harness Integration
- Type: knowledge -- this skill is a reference document, not a procedural workflow.
- No tools or state -- consumed as context by other skills and agents.
- related_skills: db-explain-reading, db-scan-types, db-query-rewriting, db-btree-index
Success Criteria
- Statistics freshness is verified after bulk operations (
runs in post-load steps).ANALYZE - Misestimations are diagnosed by comparing pg_stats values to actual data distribution.
- Extended statistics are created for correlated columns that cause multi-column estimation errors.