Harness-engineering db-query-statistics

Query Statistics and Selectivity

install

source · Clone the upstream repo

git clone https://github.com/Intense-Visions/harness-engineering

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/Intense-Visions/harness-engineering "$T" && mkdir -p ~/.claude/skills && cp -r "$T/agents/skills/codex/db-query-statistics" ~/.claude/skills/intense-visions-harness-engineering-db-query-statistics-fab8f2 && rm -rf "$T"

manifest: agents/skills/codex/db-query-statistics/SKILL.md

source content

Query Statistics and Selectivity

How the planner uses table statistics (pg_stats, histograms, most-common-values) to estimate row counts and choose execution plans.

When to Use

Diagnosing planner misestimations (estimated rows vs actual rows differ by 10x+)
Understanding why the planner chose a bad plan despite correct indexes
Tuning statistics targets for columns with skewed distributions
After bulk data loads, migrations, or large deletes that change data distribution
Investigating queries where EXPLAIN shows correct indexes but wrong join strategies

Instructions

Key Concepts

The planner does not look at actual table data at query time. Instead, it uses pre-computed statistics stored in

pg_statistic

(accessible via the

pg_stats

view). These statistics are sampled approximations, not exact counts.

Key statistics columns in

pg_stats

```
n_distinct
```
-- estimated number of distinct values. Positive values are absolute counts; negative values are fractions of total rows (e.g., -1.0 means every row is unique)
```
most_common_vals
```
-- the N most frequent values in the column
```
most_common_freqs
```
-- the frequency of each most-common value (fractions of total rows)
```
histogram_bounds
```
-- equal-frequency histogram bucket boundaries for values not in the MCV list
```
null_frac
```
-- fraction of rows that are NULL
```
correlation
```
-- how well the physical row order matches the logical (sorted) order. Values near 1.0 or -1.0 mean high correlation

The ANALYZE command samples the table and updates these statistics:

ANALYZE orders;                    -- analyze one table
ANALYZE orders (status);            -- analyze one column
ANALYZE;                            -- analyze all tables in the database

Autovacuum runs ANALYZE automatically, but it may lag behind large data changes.

Worked Example

Examining statistics for the

status

column on an orders table:

SELECT
  attname,
  n_distinct,
  most_common_vals,
  most_common_freqs,
  null_frac,
  correlation
FROM pg_stats
WHERE tablename = 'orders' AND attname = 'status';

 attname | n_distinct | most_common_vals            | most_common_freqs        | null_frac | correlation
---------+------------+-----------------------------+--------------------------+-----------+------------
 status  |          3 | {completed,active,cancelled} | {0.85,0.12,0.03}         |         0 |       0.15

How the planner uses this: For

WHERE status = 'active'

, the estimated rows = total_rows * 0.12. On a 10M-row table, the estimate is 1.2M rows. For

WHERE status = 'cancelled'

, the estimate is 300K rows.

Misestimation scenario: After a data migration that marked 90% of orders as cancelled (previously 3%), the statistics are stale:

EXPLAIN ANALYZE
SELECT * FROM orders WHERE status = 'cancelled';

Index Scan using idx_orders_status on orders
  (cost=0.43..12345.67 rows=300000 width=52)
  (actual time=0.031..8923.450 rows=9000000 loops=1)

Estimated 300K rows, actual 9M -- a 30x misestimation. The planner chose Index Scan (good for 300K) instead of Seq Scan (better for 9M). Fix:

ANALYZE orders;

After ANALYZE, the planner sees the updated distribution and switches to Seq Scan:

Seq Scan on orders (cost=0.00..223456.00 rows=9010000 width=52)
  (actual time=0.012..1234.567 rows=9000000 loops=1)
  Filter: (status = 'cancelled')
Execution Time: 1345.678 ms

Increasing statistics granularity for skewed distributions:

ALTER TABLE orders ALTER COLUMN status SET STATISTICS 1000;
ANALYZE orders;

The default

default_statistics_target

is 100 (100 histogram buckets and 100 MCV entries). For highly skewed columns, increasing to 500-1000 gives the planner a more accurate picture.

Anti-Patterns

Never running ANALYZE after bulk operations. After a large INSERT, DELETE, or UPDATE that changes data distribution, statistics are stale. The planner uses the old distribution, producing bad plans. Always
```
ANALYZE tablename;
```
after bulk changes.
Setting default_statistics_target too low. The default of 100 works for uniform distributions but fails for skewed data. Columns with thousands of distinct values at varying frequencies need higher targets.
Ignoring n_distinct misestimates. For high-cardinality columns (e.g., user_id with 50M distinct values), the sampled n_distinct can be significantly off. Override with:
```
ALTER TABLE orders ALTER COLUMN user_id SET (n_distinct = -1);
```
(tells the planner every value is unique).
Disabling autovacuum. Autovacuum also runs auto-ANALYZE. Disabling it means statistics are never refreshed automatically, guaranteeing plan degradation over time.

PostgreSQL Specifics

default_statistics_target controls the number of histogram buckets and MCV entries. Default: 100. Maximum: 10000. Higher values increase ANALYZE time but improve estimation accuracy:

-- Global setting:
SET default_statistics_target = 200;

-- Per-column override:
ALTER TABLE events ALTER COLUMN event_type SET STATISTICS 500;

Extended statistics for correlated columns (PostgreSQL 10+):

CREATE STATISTICS stat_orders_status_region (dependencies)
ON status, region FROM orders;
ANALYZE orders;

Without extended statistics, the planner assumes columns are independent. If

status = 'active'

AND

region = 'us-east'

are correlated (e.g., 80% of us-east orders are active), the default estimation multiplies their individual frequencies, producing a significant underestimate.

Monitoring ANALYZE freshness:

SELECT relname, last_analyze, last_autoanalyze, n_live_tup, n_dead_tup
FROM pg_stat_user_tables
WHERE relname = 'orders';

last_autoanalyze

is days old and

n_dead_tup

is high, autovacuum is falling behind.

Details

Advanced Topics

Multivariate statistics evolution:

PostgreSQL 10: functional dependencies (
```
dependencies
```
)
PostgreSQL 12: MCV lists for column combinations (
```
mcv
```
)
PostgreSQL 14: expression statistics (
```
expressions
```
)

CREATE STATISTICS stat_orders_multi (dependencies, mcv)
ON status, region, tenant_id FROM orders;

Selectivity estimation for complex predicates. AND clauses multiply individual selectivities (assuming independence). OR clauses use inclusion-exclusion. NOT inverts selectivity. These assumptions break down for correlated columns -- extended statistics fix this.

The 1/n_distinct fallback. For values not in the MCV list and not in the histogram range, PostgreSQL estimates selectivity as 1/n_distinct. This is a rough guess and can be significantly wrong for new or rare values.

pg_statistic_ext stores extended statistics data. Query it to verify that your extended statistics are being computed:

SELECT stxname, stxkeys, stxkind
FROM pg_statistic_ext
WHERE stxrelid = 'orders'::regclass;

Engine Differences

MySQL uses

ANALYZE TABLE

(not just

ANALYZE

) to update statistics:

ANALYZE TABLE orders;

MySQL stores statistics in

mysql.innodb_index_stats

and

mysql.innodb_table_stats

. Key differences:

Persistent statistics (
```
innodb_stats_persistent = ON
```
by default since 5.6): statistics survive restarts
Sampling pages (
```
innodb_stats_persistent_sample_pages
```
, default 20): controls sample size for ANALYZE. Much smaller than PostgreSQL's default of 30,000 *
```
default_statistics_target
```
rows

Histogram support added in MySQL 8.0:

ANALYZE TABLE t UPDATE HISTOGRAM ON col WITH 100 BUCKETS;

No extended/multivariate statistics -- MySQL assumes column independence for all multi-column predicates

MySQL's optimizer uses a simpler statistics model overall. For complex queries with correlated columns, PostgreSQL's extended statistics provide significantly better estimates.

Real-World Case Studies

Reporting system with daily batch inserts of 5M rows. After each nightly batch, morning report queries degraded from 2 seconds to 30+ seconds. Investigation showed that autovacuum's ANALYZE had not run since the batch completed -- the statistics still reflected the previous day's distribution. Adding

ANALYZE reporting_events;

to the batch job's post-load step ensured fresh statistics. Query performance remained consistent at 2 seconds. Selectivity estimation error dropped from 100x (stale stats) to under 2x (fresh stats).

Source

Process

Read the key concepts to understand how the planner uses pg_stats for selectivity estimation.
Apply ANALYZE after bulk data changes, and increase statistics targets for columns with skewed distributions.
Verify by checking
```
pg_stats
```
for your key columns and comparing EXPLAIN estimated rows to actual rows.

Harness Integration

Type: knowledge -- this skill is a reference document, not a procedural workflow.
No tools or state -- consumed as context by other skills and agents.
related_skills: db-explain-reading, db-scan-types, db-query-rewriting, db-btree-index

Success Criteria

Statistics freshness is verified after bulk operations (
```
ANALYZE
```
runs in post-load steps).
Misestimations are diagnosed by comparing pg_stats values to actual data distribution.
Extended statistics are created for correlated columns that cause multi-column estimation errors.