Learn-skills.dev walk-forward-validation
Walk-forward validation framework for trading strategies and ML models with time-series-aware splits, overfit detection, and regime-aware validation
git clone https://github.com/NeverSight/learn-skills.dev
T=$(mktemp -d) && git clone --depth=1 https://github.com/NeverSight/learn-skills.dev "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/skills-md/agiprolabs/claude-trading-skills/walk-forward-validation" ~/.claude/skills/neversight-learn-skills-dev-walk-forward-validation && rm -rf "$T"
data/skills-md/agiprolabs/claude-trading-skills/walk-forward-validation/SKILL.mdWalk-Forward Validation
Walk-forward validation framework for trading strategies and ML models. Standard cross-validation (k-fold, random splits) fails catastrophically for financial time series because it introduces lookahead bias and ignores autocorrelation. This skill covers proper time-series validation techniques including rolling and expanding windows, purged cross-validation, combinatorial purged cross-validation (CPCV), and overfit detection metrics.
Why Standard Cross-Validation Fails
Standard k-fold CV assumes data points are independent and identically distributed (IID). Financial time series violate both assumptions:
- Lookahead bias — Random splits let the model train on future data and predict past data, artificially inflating performance.
- Autocorrelation — Adjacent observations are correlated. A random split that puts Monday in test and Tuesday in train leaks information.
- Regime dependence — Markets shift between regimes. A model trained on a bull market and tested on a bull market tells you nothing about bear market performance.
- Label overlap — If labels are computed over windows (e.g., 24h forward return), adjacent train/test samples share label computation periods, leaking information.
Walk-Forward Framework
Rolling Window (Fixed Train Size)
The train window has a fixed size and slides forward in time. This is preferred when you believe older data is less relevant (common in crypto).
Window 1: [===TRAIN===][=TEST=] Window 2: [===TRAIN===][=TEST=] Window 3: [===TRAIN===][=TEST=]
Parameters:
: Number of bars/days in the training windowtrain_size
: Number of bars/days in the test windowtest_size
: How far to advance between folds (often equalsstep_size
)test_size
Expanding Window (Growing Train)
The train window starts at the beginning and expands forward. This uses all available historical data, which helps when data is scarce.
Window 1: [==TRAIN==][=TEST=] Window 2: [====TRAIN====][=TEST=] Window 3: [======TRAIN======][=TEST=]
Parameters:
: Minimum training samples before first foldmin_train_size
: Fixed test window sizetest_size
: How far to advance between foldsstep_size
Choosing Between Them
| Factor | Rolling | Expanding |
|---|---|---|
| Data recency | Prioritizes recent data | Uses all history |
| Regime changes | Better adapts to new regimes | May dilute recent regime |
| Sample size | Fixed, may be small | Grows over time |
| Crypto preference | Preferred for < 6mo horizons | Better for regime-stable models |
Purging and Embargo
Purging
Remove training samples whose labels overlap with the test set's time range. If a label is computed as the 24h forward return starting at time
t, any training sample where t + 24h extends into the test period must be purged.
def purge_train_indices( train_idx: list[int], test_start: int, label_horizon: int, timestamps: list[int], ) -> list[int]: """Remove train samples whose label windows overlap test period.""" test_start_time = timestamps[test_start] return [ i for i in train_idx if timestamps[i] + label_horizon < test_start_time ]
Embargo
Add a buffer gap between the end of training and start of testing to account for serial correlation that purging alone does not eliminate.
[===TRAIN===][--EMBARGO--][=TEST=]
Typical embargo sizes:
- 1-minute bars: 60–240 bars (1–4 hours)
- 5-minute bars: 12–48 bars (1–4 hours)
- Hourly bars: 6–24 bars (6–24 hours)
- Daily bars: 2–5 bars (2–5 days)
- Crypto rule of thumb: Embargo >= 2x the label computation horizon
Combinatorial Purged Cross-Validation (CPCV)
CPCV (Lopez de Prado, 2018) generates all possible train/test combinations from
N groups while maintaining temporal ordering. This produces far more test paths than standard walk-forward, enabling statistical tests for overfitting.
Key properties:
- Splits data into
contiguous groupsN - For each combination of
test groups, the remainingk
groups form the training setN-k - Applies purging and embargo at each train/test boundary
- Produces
backtest paths (e.g., N=6, k=2 gives 15 paths)C(N, k)
See
references/methodology.md for the full CPCV algorithm and formulas.
Overfit Detection
Deflated Sharpe Ratio (DSR)
The observed Sharpe ratio must be adjusted for:
- Number of strategies tested (multiple testing)
- Non-normality of returns (skewness, kurtosis)
- Length of the backtest
import numpy as np from scipy.stats import norm def deflated_sharpe_ratio( observed_sr: float, num_trials: int, backtest_length: int, skewness: float = 0.0, kurtosis: float = 3.0, ) -> float: """Compute the probability that observed SR > 0 after deflation. Args: observed_sr: Annualized Sharpe ratio of the selected strategy. num_trials: Number of strategies tested (including discarded ones). backtest_length: Number of return observations. skewness: Skewness of returns. kurtosis: Excess kurtosis of returns. Returns: p-value (probability SR is genuinely > 0). """ sr_std = np.sqrt( (1 - skewness * observed_sr + (kurtosis - 1) / 4 * observed_sr**2) / (backtest_length - 1) ) # Expected max SR under null (Euler-Mascheroni approximation) euler_mascheroni = 0.5772156649 expected_max_sr = norm.ppf(1 - 1 / num_trials) * ( 1 - euler_mascheroni ) + euler_mascheroni * norm.ppf(1 - 1 / (num_trials * np.e)) dsr = norm.cdf((observed_sr - expected_max_sr) / sr_std) return dsr
A DSR below 0.95 suggests the observed performance is likely due to overfitting across the trials tested.
Probability of Backtest Overfitting (PBO)
PBO uses CPCV to measure the fraction of backtest paths where the in-sample optimal strategy underperforms the median out-of-sample. A PBO above 0.50 indicates more-likely-than-not overfitting.
See
references/overfit_detection.md for complete derivations and implementation details.
Crypto-Specific Considerations
- Shorter windows: Crypto regimes change faster than equities. A 90-day rolling window may be more appropriate than 252 days.
- 24/7 markets: No weekends or holidays to account for, but funding rate resets (every 8h on perps) create microstructure effects.
- Survivorship bias: Many tokens delist. Validation must include delisted tokens or at minimum acknowledge this limitation.
- Liquidity regime shifts: A token's liquidity profile can change dramatically (new CEX listing, liquidity mining end). Train/test splits should ideally not straddle major liquidity events.
- Data availability: Many tokens have < 1 year of data. Expanding windows with small
may be necessary.min_train_size
Practical Window Sizes for Crypto
| Strategy Timeframe | Train Window | Test Window | Embargo |
|---|---|---|---|
| Scalping (1-5min) | 3-7 days | 1 day | 2-4 hours |
| Intraday (15min-1h) | 14-30 days | 3-7 days | 12-24 hours |
| Swing (4h-daily) | 30-90 days | 7-14 days | 2-5 days |
| Position (daily-weekly) | 90-180 days | 30 days | 5-10 days |
Quick Start
from walk_forward import WalkForwardValidator, WalkForwardConfig config = WalkForwardConfig( train_size=90, test_size=14, step_size=14, window_type="rolling", embargo_size=3, purge_horizon=1, ) validator = WalkForwardValidator(config) for fold in validator.split(price_data): model.fit(fold.train_X, fold.train_y) predictions = model.predict(fold.test_X) fold.record_performance(predictions, fold.test_y) results = validator.aggregate_results() print(f"OOS Sharpe: {results.oos_sharpe:.3f}") print(f"Train/Test Sharpe ratio: {results.sharpe_ratio_ratio:.2f}")
Files
References
— Walk-forward theory, window types, purging, embargo, CPCV algorithm with formulasreferences/methodology.md
— Deflated Sharpe ratio, probability of backtest overfitting, multiple testing correctionsreferences/overfit_detection.md
— Window size selection for crypto, regime considerations, common validation mistakesreferences/practical_guide.md
Scripts
— Walk-forward validation engine with rolling and expanding windows;scripts/walk_forward.py
mode with synthetic data--demo
— Deflated Sharpe ratio and PBO computation;scripts/overfit_detector.py
mode with synthetic backtest results--demo