Gsd-skill-creator descriptive-statistics

Summarizing and visualizing data through measures of center, spread, shape, and position. Covers mean, median, mode, range, IQR, variance, standard deviation, percentiles, z-scores, five-number summaries, and graphical displays (histograms, box plots, stem-and-leaf, dot plots, scatter plots). Emphasizes choosing the right summary for the data's shape and scale. Use when summarizing datasets, choosing visualizations, computing summary statistics, or interpreting distributions.

install
source · Clone the upstream repo
git clone https://github.com/Tibsfox/gsd-skill-creator
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/Tibsfox/gsd-skill-creator "$T" && mkdir -p ~/.claude/skills && cp -r "$T/examples/skills/statistics/descriptive-statistics" ~/.claude/skills/tibsfox-gsd-skill-creator-descriptive-statistics && rm -rf "$T"
manifest: examples/skills/statistics/descriptive-statistics/SKILL.md
source content

Descriptive Statistics

Descriptive statistics distills raw data into interpretable summaries. Before any inference, modeling, or prediction, a dataset must be described: its center, its spread, its shape, and its notable features. This skill covers the full toolkit of numerical summaries and graphical displays that turn observations into understanding.

Agent affinity: pearson (measures of association), gosset (small-sample summaries), george (pedagogy)

Concept IDs: stat-descriptive-statistics, stat-data-visualization

Measures of Center

The center of a distribution answers "where is the typical value?"

MeasureDefinitionBest forSensitive to
MeanSum of values divided by countSymmetric distributionsOutliers, skew
MedianMiddle value when sortedSkewed distributions, ordinal dataNothing -- robust
ModeMost frequent valueCategorical data, multimodal distributionsTies, bin width
Trimmed meanMean after removing k% from each tailDistributions with mild outliersChoice of trim percentage

Decision rule. Use the mean when the distribution is roughly symmetric with no extreme outliers. Use the median when the distribution is skewed or contains outliers. Use the mode for categorical data or when identifying the most common category matters more than a numeric center.

When they diverge. If mean and median differ substantially, the distribution is skewed. Mean > median indicates right skew (pulled by high values); mean < median indicates left skew. The distance between them is a rough skew indicator.

Measures of Spread

Spread answers "how variable are the values?"

MeasureDefinitionBest forProperties
RangeMax minus minQuick sense of total spreadExtremely sensitive to outliers
IQRQ3 minus Q1Robust measure alongside medianIgnores tails entirely
VarianceAverage squared deviation from the meanTheoretical work, ANOVAIn squared units
Standard deviationSquare root of varianceGeneral-purpose spreadSame units as data
MADMedian absolute deviation from the medianRobust alternative to SDResistant to outliers
Coefficient of variationSD divided by mean, as percentageComparing spread across different scalesUndefined if mean = 0

Sample vs. population. Variance uses n in the denominator for a population and n-1 (Bessel's correction) for a sample. The n-1 correction produces an unbiased estimator of the population variance. Standard deviation inherits this distinction.

Choosing a spread measure. Pair the spread measure with the center measure: mean pairs with SD; median pairs with IQR or MAD. Mixing (e.g., reporting median and SD) obscures the summary because the measures respond to different features of the distribution.

Shape

Shape describes the overall pattern of the distribution beyond center and spread.

Symmetry and skewness

  • Symmetric: Mean equals median. The distribution is a mirror image around the center. Examples: normal distribution, uniform distribution.
  • Right-skewed (positive skew): Long right tail. Mean > median. Examples: income, home prices, reaction times.
  • Left-skewed (negative skew): Long left tail. Mean < median. Examples: exam scores when most students do well, age at retirement.

Kurtosis

Kurtosis measures the heaviness of the tails relative to the normal distribution.

  • Mesokurtic (kurtosis = 3): Normal-like tails. Baseline.
  • Leptokurtic (kurtosis > 3): Heavier tails than normal. More extreme values. Examples: financial returns, t-distributions with few degrees of freedom.
  • Platykurtic (kurtosis < 3): Lighter tails than normal. Fewer extreme values. Examples: uniform distribution.

Excess kurtosis subtracts 3, so the normal distribution has excess kurtosis 0. Many software packages report excess kurtosis by default.

Modality

  • Unimodal: One peak. Most common shape.
  • Bimodal: Two peaks. Often signals two subpopulations (e.g., heights of adults when sex is not separated).
  • Multimodal: Three or more peaks. Investigate subgroups.

Position Measures

Position answers "where does this observation fall within the distribution?"

Percentiles and quartiles

The pth percentile is the value below which p% of the data falls.

  • Q1 (25th percentile): One quarter of data below.
  • Q2 (50th percentile): The median.
  • Q3 (75th percentile): Three quarters of data below.

Five-number summary: Min, Q1, Median, Q3, Max. The backbone of the box plot.

Z-scores

The z-score of an observation x is: z = (x - mean) / SD.

A z-score of +2 means the observation is 2 standard deviations above the mean. Z-scores allow comparison across different scales ("she scored 2.3 SDs above the mean on the verbal section, 1.8 SDs above on the math section").

Chebyshev's inequality. For any distribution, at least (1 - 1/k^2) of observations fall within k standard deviations of the mean. For k=2: at least 75%. For k=3: at least 89%.

Empirical rule (68-95-99.7). For approximately normal distributions: 68% within 1 SD, 95% within 2 SD, 99.7% within 3 SD.

Graphical Displays

Choosing the right graph

GraphData typeShowsUse when
HistogramQuantitativeDistribution shape, center, spreadExploring a single quantitative variable
Box plotQuantitativeFive-number summary, outliersComparing distributions across groups
Dot plotQuantitative (small n)Individual valuesSmall datasets where every point matters
Stem-and-leafQuantitative (small n)Shape + exact valuesQuick classroom display
Bar chartCategoricalFrequencies or proportionsComparing category counts
Scatter plotTwo quantitativeAssociation patternExploring bivariate relationships
Time series plotQuantitative over timeTrends, seasonality, cyclesTemporal data

Histogram construction

  1. Choose the number of bins. Sturges' rule: k = 1 + 3.322 * log10(n). Scott's rule: bin width = 3.49 * SD * n^(-1/3). Freedman-Diaconis: bin width = 2 * IQR * n^(-1/3).
  2. Bins must be equal width, non-overlapping, and exhaustive.
  3. The y-axis is frequency (count) or relative frequency (proportion). Density (relative frequency / bin width) is required when bin widths differ.

Box plot construction

  1. Draw a box from Q1 to Q3.
  2. Draw a line at the median.
  3. Compute fences: lower = Q1 - 1.5IQR, upper = Q3 + 1.5IQR.
  4. Whiskers extend to the most extreme data points within the fences.
  5. Observations beyond the fences are plotted individually as outliers.

Modified box plots use the same fences. Some implementations also mark "far outliers" beyond Q1 - 3IQR and Q3 + 3IQR.

Bivariate Description

Scatter plots and association

A scatter plot places two quantitative variables on the x and y axes. The pattern reveals:

  • Direction: Positive (both increase together) or negative (one increases as the other decreases) or none.
  • Form: Linear, curved, clustered, or no clear pattern.
  • Strength: How tightly points cluster around the pattern.
  • Unusual features: Outliers, gaps, subgroups.

Correlation coefficient (Pearson's r)

r = (sum of (x_i - x_bar)(y_i - y_bar)) / ((n-1) * s_x * s_y)

Properties: -1 <= r <= 1. r = 1 means perfect positive linear association. r = -1 means perfect negative linear association. r = 0 means no linear association (but nonlinear association may still exist).

Critical warnings about r:

  • r measures only linear association. A perfect parabola has r = 0.
  • r is sensitive to outliers. One extreme point can inflate or deflate r dramatically.
  • r does not imply causation.
  • r is unitless -- it does not depend on the scale of measurement.
  • Always plot the data before computing r. Anscombe's Quartet (four datasets with identical r but wildly different scatter plots) demonstrates why.

Common Mistakes

MistakeWhy it failsFix
Reporting mean for skewed dataMean is pulled by the tail, misrepresents typical valueUse median (or report both with a note about skew)
Interpreting correlation as causationAssociation is not causationState the association; identify confounders; reserve causal language for experiments
Ignoring outliersOutliers can drive means, SDs, and correlationsIdentify outliers, investigate their source, report with and without
Using a pie chart for many categoriesUnreadable beyond ~5 categoriesUse a bar chart
Choosing bins to force a shapeHistogram appearance depends heavily on bin countUse a principled rule (Sturges, Scott, Freedman-Diaconis)

Cross-References

  • pearson agent: Correlation coefficients, measures of association. The historical inventor of the Pearson r.
  • gosset agent: Small-sample descriptive summaries and the care required when n is small.
  • george agent: Teaching descriptive statistics through simulation and active learning.
  • probability-theory skill: Theoretical distributions that descriptive statistics empirically approximates.
  • regression-modeling skill: Bivariate description extends into regression when prediction is the goal.
  • statistical-computing skill: Computational tools for calculating these summaries at scale.

References

  • Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley.
  • Anscombe, F. J. (1973). "Graphs in statistical analysis." The American Statistician, 27(1), 17-21.
  • Freedman, D., Pisani, R., & Purves, R. (2007). Statistics. 4th edition. W.W. Norton.
  • Moore, D. S., McCabe, G. P., & Craig, B. A. (2021). Introduction to the Practice of Statistics. 10th edition. W.H. Freeman.