Claude-skill-registry asymptotic-theory

M-estimation, influence functions, and semiparametric efficiency theory for causal inference

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/asymptotic-theory" ~/.claude/skills/majiayu000-claude-skill-registry-asymptotic-theory && rm -rf "$T"
manifest: skills/data/asymptotic-theory/SKILL.md
source content

Asymptotic Theory

Rigorous framework for statistical inference and efficiency in modern methodology

Use this skill when working on: asymptotic properties of estimators, influence functions, semiparametric efficiency, double robustness, variance estimation, confidence intervals, hypothesis testing, M-estimation, or deriving limiting distributions.


Efficiency Bounds

Semiparametric Efficiency Theory

Cramér-Rao Lower Bound: For any unbiased estimator, $$\text{Var}(\hat{\theta}) \geq \frac{1}{nI(\theta)}$$

where $I(\theta)$ is the Fisher information.

Semiparametric Efficiency Bound: The variance of the efficient influence function: $$V_{eff} = E[\phi^*(\theta_0)^2]$$

where $\phi^*$ is the efficient influence function (EIF).

Influence Function Notation: $IF(O; \theta, P)$ represents the influence of observation $O$ on parameter $\theta$ under distribution $P$: $$IF(O; \theta, P) = \lim_{\epsilon \to 0} \frac{T((1-\epsilon)P + \epsilon \delta_O) - T(P)}{\epsilon}$$

Semiparametric Variance: For RAL estimators, $$\sqrt{n}(\hat{\theta} - \theta_0) \xrightarrow{d} N(0, E[IF(O)^2])$$

Estimating Equations: M-estimators solve $\sum_{i=1}^n \psi(O_i; \theta) = 0$, with asymptotic variance: $$V = \left(\frac{\partial}{\partial \theta} E[\psi(O; \theta)]\right)^{-1} E[\psi(O; \theta)\psi(O; \theta)^T] \left(\frac{\partial}{\partial \theta} E[\psi(O; \theta)]\right)^{-T}$$

Efficiency for Mediation Estimands

EstimandEfficient Influence FunctionEfficiency Bound
ATE$\phi_{ATE} = \frac{A}{\pi}(Y-\mu_1) - \frac{1-A}{1-\pi}(Y-\mu_0) + \mu_1 - \mu_0 - \psi$$V_{ATE} = E[\phi_{ATE}^2]$
NDEComplex (VanderWeele & Tchetgen, 2014)Higher than ATE
NIEComplex (VanderWeele & Tchetgen, 2014)Higher than ATE
# Compute semiparametric efficiency bound
compute_efficiency_bound <- function(data, estimand = "ATE") {
  n <- nrow(data)

  if (estimand == "ATE") {
    # Estimate nuisance functions
    ps_model <- glm(A ~ X, data = data, family = binomial)
    pi_hat <- predict(ps_model, type = "response")

    mu1_model <- lm(Y ~ X, data = subset(data, A == 1))
    mu0_model <- lm(Y ~ X, data = subset(data, A == 0))

    mu1_hat <- predict(mu1_model, newdata = data)
    mu0_hat <- predict(mu0_model, newdata = data)

    # Efficient influence function
    psi_hat <- mean(mu1_hat - mu0_hat)
    phi <- with(data, {
      A/pi_hat * (Y - mu1_hat) -
      (1-A)/(1-pi_hat) * (Y - mu0_hat) +
      mu1_hat - mu0_hat - psi_hat
    })

    # Efficiency bound = variance of EIF
    list(
      efficiency_bound = var(phi),
      standard_error = sqrt(var(phi) / n),
      eif_values = phi
    )
  }
}

Empirical Process Theory

Key Concepts

Empirical Process: $\mathbb{G}_n(f) = \sqrt{n}(\mathbb{P}n - P)f = \frac{1}{\sqrt{n}}\sum{i=1}^n (f(O_i) - Pf)$

Uniform Convergence: For function class $\mathcal{F}$, $$\sup_{f \in \mathcal{F}} |\mathbb{G}n(f)| \xrightarrow{d} \sup{f \in \mathcal{F}} |\mathbb{G}(f)|$$

where $\mathbb{G}$ is a Gaussian process.

Complexity Measures

MeasureDefinitionUse
VC dimensionMax shattered set sizeClassification
Covering number$N(\epsilon, \mathcal{F}, |\cdot|)$General classes
Bracketing number$N_{[]}(\epsilon, \mathcal{F}, L_2)$Entropy bounds
Rademacher complexity$\mathcal{R}n(\mathcal{F}) = E[\sup{f \in \mathcal{F}}\frac{1}{n}\sum_i \epsilon_i f(X_i)
# Estimate Rademacher complexity via Monte Carlo
estimate_rademacher <- function(f_class, data, n_reps = 1000) {
  n <- nrow(data)

  sup_values <- replicate(n_reps, {
    # Random Rademacher variables
    epsilon <- sample(c(-1, 1), n, replace = TRUE)

    # Compute supremum over function class
    sup_f <- max(sapply(f_class, function(f) {
      abs(mean(epsilon * f(data)))
    }))

    sup_f
  })

  mean(sup_values)
}

Donsker Classes

Definition and Importance

A function class $\mathcal{F}$ is Donsker if $\mathbb{G}_n \rightsquigarrow \mathbb{G}$ in $\ell^\infty(\mathcal{F})$, where $\mathbb{G}$ is a tight Gaussian process.

Key Donsker Classes

ClassDescriptionApplication
VC classesFinite VC dimensionClassification functions
Smooth functionsBounded derivativesRegression estimators
Monotone functionsSingle crossingsDistribution functions
Lipschitz functionsBounded variationM-estimators

Donsker Theorem Applications

For M-estimation: If $\psi(O, \theta)$ belongs to a Donsker class, then $$\sqrt{n}(\hat{\theta} - \theta_0) \xrightarrow{d} N(0, V)$$

where $V = (\partial_\theta E[\psi])^{-1} \text{Var}(\psi) (\partial_\theta E[\psi])^{-T}$

# Verify Donsker conditions for empirical process
check_donsker_conditions <- function(psi_class, data) {
  # Estimate bracketing entropy integral
  epsilon_grid <- seq(0.01, 1, by = 0.01)
  bracket_numbers <- sapply(epsilon_grid, function(eps) {
    # Estimate N_[](eps, F, L_2)
    estimate_bracketing_number(psi_class, data, eps)
  })

  # Donsker if integral converges
  entropy_integral <- integrate(
    function(eps) sqrt(log(approxfun(epsilon_grid, bracket_numbers)(eps))),
    lower = 0, upper = 1
  )

  list(
    is_donsker = entropy_integral$value < Inf,
    entropy_integral = entropy_integral$value,
    bracket_numbers = data.frame(epsilon = epsilon_grid, N = bracket_numbers)
  )
}

Core Concepts

Why Asymptotics?

  1. Exact distributions often unavailable for complex estimators
  2. Large-sample approximations provide tractable inference
  3. Efficiency theory guides optimal estimator construction
  4. Robustness properties clarified through asymptotic analysis

Fundamental Sequence

Estimator θ̂ₙ → Consistency → Asymptotic Normality → Efficiency → Inference
                    ↓              ↓                     ↓            ↓
               θ̂ₙ →ᵖ θ₀    √n(θ̂ₙ-θ₀) →ᵈ N(0,V)    V = V_eff    CIs, tests

Modes of Convergence

Convergence in Probability ($\xrightarrow{p}$)

$X_n \xrightarrow{p} X$ if $\forall \epsilon > 0$: $P(|X_n - X| > \epsilon) \to 0$

Consistency: $\hat{\theta}_n \xrightarrow{p} \theta_0$

Convergence in Distribution ($\xrightarrow{d}$)

$X_n \xrightarrow{d} X$ if $F_{X_n}(x) \to F_X(x)$ at all continuity points

Asymptotic normality: $\sqrt{n}(\hat{\theta}_n - \theta_0) \xrightarrow{d} N(0, V)$

Almost Sure Convergence ($\xrightarrow{a.s.}$)

$X_n \xrightarrow{a.s.} X$ if $P(\lim_{n\to\infty} X_n = X) = 1$

Relationship: $\xrightarrow{a.s.} \Rightarrow \xrightarrow{p} \Rightarrow \xrightarrow{d}$

Stochastic Order Notation

NotationMeaningExample
$O_p(1)$Bounded in probability$\hat{\theta}_n = O_p(1)$
$o_p(1)$Converges to 0 in probability$\hat{\theta}_n - \theta_0 = o_p(1)$
$O_p(a_n)$$X_n/a_n = O_p(1)$$\hat{\theta}_n - \theta_0 = O_p(n^{-1/2})$
$o_p(a_n)$$X_n/a_n = o_p(1)$Remainder terms

Key Theorems

Laws of Large Numbers

Weak LLN: If $X_1, \ldots, X_n$ iid with $E|X| < \infty$: $$\bar{X}_n \xrightarrow{p} E[X]$$

Strong LLN: If $X_1, \ldots, X_n$ iid with $E|X| < \infty$: $$\bar{X}_n \xrightarrow{a.s.} E[X]$$

Uniform LLN: For $\sup_{\theta \in \Theta}$ convergence, need additional conditions (compactness, envelope).

Central Limit Theorem

Classical CLT: If $X_1, \ldots, X_n$ iid with $E[X] = \mu$, $Var(X) = \sigma^2 < \infty$: $$\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} N(0, \sigma^2)$$

Lindeberg-Feller CLT: For triangular arrays with: $$\sum_{i=1}^n E[X_{ni}^2 \mathbf{1}(|X_{ni}| > \epsilon)] \to 0 \quad \forall \epsilon > 0$$

Multivariate CLT: $$\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} N(0, \Sigma)$$

Slutsky's Theorem

If $X_n \xrightarrow{d} X$ and $Y_n \xrightarrow{p} c$ (constant):

  • $X_n + Y_n \xrightarrow{d} X + c$
  • $X_n Y_n \xrightarrow{d} cX$
  • $X_n/Y_n \xrightarrow{d} X/c$ (if $c \neq 0$)

Continuous Mapping Theorem

If $X_n \xrightarrow{d} X$ and $g$ continuous: $$g(X_n) \xrightarrow{d} g(X)$$

Delta Method

If $\sqrt{n}(\hat{\theta}_n - \theta_0) \xrightarrow{d} N(0, V)$ and $g$ differentiable at $\theta_0$: $$\sqrt{n}(g(\hat{\theta}_n) - g(\theta_0)) \xrightarrow{d} N(0, g'(\theta_0)^\top V g'(\theta_0))$$

Multivariate: Replace $g'(\theta_0)$ with Jacobian matrix.


M-Estimation Theory

Setup

Estimator $\hat{\theta}_n$ solves: $$\hat{\theta}n = \arg\max{\theta \in \Theta} M_n(\theta)$$

where $M_n(\theta) = n^{-1} \sum_{i=1}^n m(O_i; \theta)$

Consistency Conditions

  1. Uniform convergence: $\sup_\theta |M_n(\theta) - M(\theta)| \xrightarrow{p} 0$
  2. Identification: $M(\theta)$ uniquely maximized at $\theta_0$
  3. Compactness: $\Theta$ compact (or identification at distance from boundary)

Result: $\hat{\theta}_n \xrightarrow{p} \theta_0$

Asymptotic Normality Conditions

  1. $\theta_0$ interior point of $\Theta$
  2. $M(\theta)$ twice differentiable at $\theta_0$
  3. $\ddot{M}(\theta_0)$ non-singular
  4. $\sqrt{n} \dot{M}_n(\theta_0) \xrightarrow{d} N(0, V)$

Result: $$\sqrt{n}(\hat{\theta}_n - \theta_0) \xrightarrow{d} N(0, [-\ddot{M}(\theta_0)]^{-1} V [-\ddot{M}(\theta_0)]^{-1})$$

Standard Errors

Sandwich estimator: $$\hat{V} = \hat{A}^{-1} \hat{B} \hat{A}^{-1}$$

where:

  • $\hat{A} = -n^{-1} \sum_i \ddot{m}(O_i; \hat{\theta}_n)$ (Hessian)
  • $\hat{B} = n^{-1} \sum_i \dot{m}(O_i; \hat{\theta}_n) \dot{m}(O_i; \hat{\theta}_n)^\top$ (outer product)

Influence Functions

Definition

The influence function of a functional $T(P)$ at distribution $P$ is: $$\phi(o) = \lim_{\epsilon \to 0} \frac{T((1-\epsilon)P + \epsilon \delta_o) - T(P)}{\epsilon}$$

where $\delta_o$ is point mass at $o$.

Properties

  1. Mean zero: $E_P[\phi(O)] = 0$
  2. Variance = asymptotic variance: If $\sqrt{n}(\hat{T}_n - T) \xrightarrow{d} N(0, V)$, then $V = E[\phi(O)^2]$
  3. Linearization: $\sqrt{n}(\hat{T}_n - T) = \sqrt{n} \mathbb{P}_n[\phi] + o_p(1)$

Examples

FunctionalInfluence Function
Mean $E[Y]$$\phi(y) = y - E[Y]$
Variance $Var(Y)$$\phi(y) = (y - \mu)^2 - \sigma^2$
Quantile $Q_p$$\phi(y) = \frac{p - \mathbf{1}(y \leq Q_p)}{f(Q_p)}$
Regression coefficient$\phi = (X^\top X)^{-1} X(Y - X^\top\beta)$

Deriving Influence Functions

Method 1: Gateaux derivative (definition)

Method 2: Estimating equation approach If $\hat{\theta}$ solves $\mathbb{P}n[\psi(O; \theta)] = 0$, then: $$\phi(O) = -E[\partial\theta \psi]^{-1} \psi(O; \theta_0)$$

Method 3: Functional delta method For $\psi = g(T_1, T_2, \ldots)$: $$\phi_\psi = \sum_j \frac{\partial g}{\partial T_j} \phi_{T_j}$$


Semiparametric Efficiency

Semiparametric Models

Model $\mathcal{P}$ contains distributions satisfying: $$\theta = \Psi(P), \quad P \in \mathcal{P}$$

The "nuisance" is infinite-dimensional (e.g., unknown baseline distribution).

Tangent Space

Parametric submodels: One-dimensional smooth paths ${P_t : t \in \mathbb{R}}$ through $P_0$.

Score: $S = \partial_t \log p_t \big|_{t=0}$

Tangent space $\mathcal{T}$: Closed linear span of all such scores.

Efficiency Bound

The efficient influence function (EIF) is the projection of any influence function onto the tangent space.

Semiparametric efficiency bound: $$V_{eff} = E[\phi_{eff}(O)^2]$$

No regular estimator can have asymptotic variance smaller than $V_{eff}$.

Achieving Efficiency

An estimator is semiparametrically efficient if its influence function equals the EIF: $$\phi_{\hat{\theta}} = \phi_{eff}$$

Strategies:

  1. Solve efficient score equation
  2. Targeted learning (TMLE)
  3. One-step estimator with EIF-based correction

Double Robustness

Concept

An estimator is doubly robust if it is consistent when either:

  • Outcome model correctly specified, OR
  • Treatment model (propensity score) correctly specified

AIPW Estimator

For ATE $\psi = E[Y(1) - Y(0)]$:

$$\hat{\psi}_{DR} = \mathbb{P}_n\left[\frac{A(Y - \hat{\mu}_1(X))}{\hat{\pi}(X)} + \hat{\mu}_1(X)\right] - \mathbb{P}_n\left[\frac{(1-A)(Y - \hat{\mu}_0(X))}{1-\hat{\pi}(X)} + \hat{\mu}_0(X)\right]$$

where:

  • $\hat{\mu}_a(X) = \hat{E}[Y|A=a,X]$ (outcome model)
  • $\hat{\pi}(X) = \hat{P}(A=1|X)$ (propensity score)

Why It Works

Bias decomposition: $$\hat{\psi}_{DR} - \psi = \text{(outcome error)} \times \text{(propensity error)} + o_p(n^{-1/2})$$

If either error is zero, bias is zero.

Efficiency Under Double Robustness

When both models correct:

  • Achieves semiparametric efficiency bound
  • Asymptotic variance = $E[\phi_{eff}^2]$

When one model wrong:

  • Still consistent
  • But less efficient than when both correct

Variance Estimation

Analytic (Sandwich)

$$\hat{V} = \frac{1}{n} \sum_{i=1}^n \hat{\phi}(O_i)^2$$

where $\hat{\phi}$ is estimated influence function.

Bootstrap

Nonparametric bootstrap:

  1. Resample $n$ observations with replacement
  2. Compute $\hat{\theta}^*_b$ for $b = 1, \ldots, B$
  3. $\hat{V} = \text{Var}(\hat{\theta}^_1, \ldots, \hat{\theta}^_B)$

Bootstrap validity: Requires $\sqrt{n}$-consistent, regular estimators.

Influence Function-Based Bootstrap

More stable than full recomputation: $$\hat{\theta}^b = \hat{\theta} + n^{-1} \sum{i=1}^n (W_i^ - 1) \hat{\phi}(O_i)$$

where $W_i^*$ are bootstrap weights.


Inference

Confidence Intervals

Wald interval: $$\hat{\theta} \pm z_{1-\alpha/2} \cdot \hat{SE}$$

Percentile bootstrap: $$[\hat{\theta}^_{(\alpha/2)}, \hat{\theta}^_{(1-\alpha/2)}]$$

BCa bootstrap (bias-corrected accelerated): Corrects for bias and skewness.

Hypothesis Testing

Wald test: $W = (\hat{\theta} - \theta_0)^2 / \hat{V} \sim \chi^2_1$

Score test: Based on score at null.

Likelihood ratio test: $2(\ell(\hat{\theta}) - \ell(\theta_0)) \sim \chi^2_k$


Product of Coefficients (Mediation)

Setup

Mediation effect = $\alpha \beta$ (or $\alpha_1 \beta_1 \gamma_2$ for sequential)

Distribution of Products

Not normal: Product of normals is NOT normal.

Exact distribution: Complex (involves Bessel functions for two normals).

Approximations:

  1. Sobel test: Normal approximation via delta method
  2. PRODCLIN: Distribution of product method (RMediation)
  3. Monte Carlo: Simulate from joint distribution

Delta Method Variance

For $\psi = \alpha\beta$: $$Var(\hat{\alpha}\hat{\beta}) \approx \beta^2 Var(\hat{\alpha}) + \alpha^2 Var(\hat{\beta}) + Var(\hat{\alpha})Var(\hat{\beta})$$

The last term often omitted (Sobel) but matters when effects are small.

Product of Three

For sequential mediation $\psi = \alpha_1 \beta_1 \gamma_2$:

  • Distribution more complex
  • Monte Carlo or specialized methods needed
  • Your "product of three" manuscript addresses this

Regularity Conditions Checklist

For Consistency

  • Parameter space compact (or bounded away from boundary)
  • Objective function continuous in $\theta$
  • Uniform convergence of criterion
  • Unique maximizer at $\theta_0$

For Asymptotic Normality

  • $\theta_0$ interior point
  • Twice differentiable criterion
  • Non-singular Hessian
  • CLT applies to score
  • Lindeberg/Lyapunov conditions if non-iid

For Efficiency

  • Model correctly specified
  • Nuisance parameters consistently estimated
  • Sufficient smoothness for influence function calculation
  • Rate conditions on nuisance estimation (for doubly robust)

Common Pitfalls

1. Ignoring Estimation of Nuisance Parameters

Wrong: Treat $\hat{\eta}$ as known when computing variance. Right: Account for $\hat{\eta}$ uncertainty or use cross-fitting.

2. Slow Nuisance Estimation

For doubly robust estimators, need: $$|\hat{\mu} - \mu_0| \cdot |\hat{\pi} - \pi_0| = o_p(n^{-1/2})$$

If both converge at $n^{-1/4}$, product is $n^{-1/2}$.

3. Bootstrap Failure

Bootstrap can fail for:

  • Non-differentiable functionals
  • Super-efficient estimators
  • Boundary parameters

4. Underestimating Variance

Sandwich estimator assumes correct influence function. Model misspecification → wrong variance.


Template: Asymptotic Result

\begin{theorem}[Asymptotic Distribution]
Under Assumptions \ref{A1}--\ref{An}:
\begin{enumerate}
\item (Consistency) $\hat{\theta}_n \xrightarrow{p} \theta_0$
\item (Asymptotic normality) $\sqrt{n}(\hat{\theta}_n - \theta_0) \xrightarrow{d} N(0, V)$
\item (Variance) $V = E[\phi(O)^2]$ where $\phi$ is the influence function
\item (Variance estimation) $\hat{V} \xrightarrow{p} V$
\end{enumerate}
\end{theorem}

\begin{proof}
\textbf{Step 1 (Consistency):}
[Apply M-estimation or direct argument]

\textbf{Step 2 (Expansion):}
Taylor expand around $\theta_0$:
\[
0 = \mathbb{P}_n[\psi(O; \hat{\theta})] = \mathbb{P}_n[\psi(O; \theta_0)]
    + \mathbb{P}_n[\dot{\psi}(\tilde{\theta})](\hat{\theta} - \theta_0)
\]

\textbf{Step 3 (Rearrangement):}
\[
\sqrt{n}(\hat{\theta} - \theta_0) = -[\mathbb{P}_n[\dot{\psi}]]^{-1} \sqrt{n}\mathbb{P}_n[\psi(O; \theta_0)]
\]

\textbf{Step 4 (CLT):}
$\sqrt{n}\mathbb{P}_n[\psi(O; \theta_0)] \xrightarrow{d} N(0, E[\psi\psi^\top])$ by CLT.

\textbf{Step 5 (Slutsky):}
$\mathbb{P}_n[\dot{\psi}] \xrightarrow{p} E[\dot{\psi}]$ by WLLN. Apply Slutsky.

\textbf{Step 6 (Identify $V$):}
$V = E[\dot{\psi}]^{-1} E[\psi\psi^\top] E[\dot{\psi}]^{-\top}$.
\end{proof}

Integration with Other Skills

This skill works with:

  • proof-architect - For structuring asymptotic proofs
  • identification-theory - Identification precedes estimation/inference
  • simulation-architect - Validate asymptotic approximations
  • methods-paper-writer - Present results in manuscripts

Key References

  • Bickel

  • Newey

  • Robins

  • van der Vaart, A.W. (1998). Asymptotic Statistics

  • Tsiatis, A.A. (2006). Semiparametric Theory and Missing Data

  • Kennedy, E.H. (2016). Semiparametric Theory and Empirical Processes

  • van der Laan, M.J. & Rose, S. (2011). Targeted Learning


Version: 1.0 Created: 2025-12-08 Domain: Asymptotic Statistics, Semiparametric Inference