Claude-skill-registry Backtesting Analysis

Comprehensive guidance for interpreting backtest results and detecting overfitting (project)

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/backtesting-analysis" ~/.claude/skills/majiayu000-claude-skill-registry-backtesting-analysis && rm -rf "$T"
manifest: skills/data/backtesting-analysis/SKILL.md
source content

Backtesting Analysis Skill

Purpose: Interpret backtest results, understand performance metrics, and detect overfitting or unreliable strategies.

Progressive Disclosure: This primer contains essentials only. Full details available via

docs
command.


When to Use This Skill

Load when:

  • Evaluating backtest results (Phase 3)
  • Detecting potential overfitting
  • Understanding strategy-specific performance expectations
  • Comparing multiple strategies or explaining results

Quick Reference: Key Metrics

Sharpe Ratio (Primary Metric)

Formula:

(Return - Risk-Free Rate) / Volatility

SharpeQualityAction
< 0.5PoorAbandon
0.5 - 0.7MarginalConsider optimization
0.7 - 1.0AcceptableOptimize
1.0 - 1.5GoodProduction-ready
1.5 - 2.0Very GoodValidate thoroughly
> 3.0SUSPICIOUSLikely overfitting

Key Insight: QuantConnect reports annual Sharpe. Sharpe > 1.0 is production-ready for most strategies.


Maximum Drawdown

Formula:

(Trough - Peak) / Peak

DrawdownQualityAction
< 20%ExcellentLow risk
20% - 30%GoodAcceptable for live trading
30% - 40%ConcerningNeeds strong Sharpe to justify
> 40%Too HighUnacceptable for most traders

Key Insight: Drawdowns > 30% are hard to tolerate psychologically. Consider: "Could I stomach this loss in real money?"


Total Trades (Statistical Significance)

Trade CountReliabilityDecision Impact
< 20UnreliableAbandon or escalate
20 - 30LowMinimum viable
30 - 50ModerateAcceptable
50 - 100GoodStrong confidence
100+ExcellentHighly reliable

Key Insight: Need 30+ trades for basic significance, 100+ for high confidence. Few trades = unreliable metrics.


Win Rate

Win RateQualityInterpretation
< 40%LowNeeds large winners (trend following)
40% - 55%AverageTypical for most strategies
55% - 65%GoodStrong edge
> 75%SUSPICIOUSLikely overfitting

Key Insight: Win rate alone is misleading. Must consider profit factor and average win/loss ratio.


Profit Factor

Formula:

Gross Profit / Gross Loss

Profit FactorQualityInterpretation
< 1.3MarginalTransaction costs may kill it
1.3 - 1.5AcceptableDecent after costs
1.5 - 2.0GoodStrong profitability
> 3.0ExceptionalOutstanding (verify no overfitting)

Key Insight: Minimum 1.5 for live trading to cover slippage and commissions.


Overfitting Detection (Red Flags)

  1. Too Perfect Sharpe (> 3.0) → ESCALATE_TO_HUMAN
  2. Too High Win Rate (> 75%) → Check for look-ahead bias
  3. Too Few Trades (< 20) → Unreliable metrics
  4. Excessive Optimization Improvement (> 30%) → Lucky parameters
  5. Severe Out-of-Sample Degradation (> 40%) → ABANDON_HYPOTHESIS
  6. Equity Curve Too Smooth → Check unrealistic assumptions
  7. Works Only in One Market Regime → Not robust

Remember: If it looks too good to be true, it probably is.


Strategy-Type Expectations

Momentum

  • Sharpe: 0.8 - 1.5 | Drawdown: 20-35% | Win Rate: 40-55%

Mean Reversion

  • Sharpe: 0.7 - 1.3 | Drawdown: 15-30% | Win Rate: 55-70%

Trend Following

  • Sharpe: 0.5 - 1.0 | Drawdown: 25-40% | Win Rate: 30-50%

Breakout

  • Sharpe: 0.6 - 1.2 | Drawdown: 20-35% | Win Rate: 40-55%

Use these to calibrate expectations - different strategies have different profiles.


Example Decisions

GOOD (Optimization Worthy)

Sharpe: 0.85, Drawdown: 22%, Trades: 67, Win Rate: 42%, PF: 1.8
→ PROCEED_TO_OPTIMIZATION (decent baseline, worth improving)

EXCELLENT (Production Ready)

Sharpe: 1.35, Drawdown: 18%, Trades: 142, Win Rate: 53%, PF: 2.1
→ PROCEED_TO_VALIDATION (already strong, skip optimization)

SUSPICIOUS (Overfitting)

Sharpe: 4.2, Drawdown: 5%, Trades: 25, Win Rate: 88%, PF: 5.8
→ ESCALATE_TO_HUMAN (too perfect, likely look-ahead bias or bug)

POOR (Abandon)

Sharpe: 0.3, Drawdown: 38%, Trades: 89, Win Rate: 35%, PF: 1.1
→ ABANDON_HYPOTHESIS (poor risk-adjusted returns)

Common Confusion Points

Q: "Strategy made 200% returns, but Sharpe is only 0.6 - is this good?" A: No. We prioritize risk-adjusted returns (Sharpe), not raw returns. High returns with high volatility = bad Sharpe = risky.

Q: "Sharpe 2.5 with 15 trades - should I proceed?" A: ESCALATE_TO_HUMAN. Too few trades (<20) for statistical significance. High Sharpe with few trades = luck, not skill.

Q: "Optimization improved Sharpe from 0.8 to 1.5 (87% improvement) - is this good?" A: ESCALATE_TO_HUMAN. 87% > 30% threshold = likely overfitting to in-sample period.

Q: "Win rate is 78%, Sharpe is 1.2 - why is this flagged?" A: Win rate > 75% is an overfitting signal. Real strategies rarely achieve such high win rates.


Key Principles

  1. Sharpe ratio is king - Primary metric for risk-adjusted returns
  2. Trade count matters - Need 30+ for reliability, 100+ for confidence
  3. Beware overfitting - Too perfect results are suspicious
  4. Context by strategy type - Different strategies have different expectations
  5. Risk-adjusted, not raw returns - High returns with high volatility = bad

Reference Documentation (Progressive Disclosure)

Need detailed analysis? All reference documentation accessible via

--help
:

python SCRIPTS/backtesting_analysis.py --help

That's the only way to access complete reference documentation.

Topics covered in

--help
:

  • Sharpe Ratio Deep Dive
  • Maximum Drawdown Analysis
  • Trade Count Statistical Significance
  • Win Rate Analysis
  • Profit Factor Analysis
  • Complete Overfitting Detection Guide
  • Strategy-Type Profiles (Momentum, Mean Reversion, Trend Following, Breakout)
  • 10+ Annotated Example Backtests
  • Common Confusion Points

The primer above covers 90% of use cases. Use

--help
for edge cases and detailed analysis.


Integration with Decision Framework

This skill complements the decision-framework skill:

  • decision-framework: Provides thresholds and decision logic
  • backtesting-analysis: Provides interpretation and context

Workflow:

  1. Load decision-framework to apply thresholds
  2. Load backtesting-analysis to understand what metrics mean
  3. Combine insights to make informed decisions

Related Files

  • .claude/skills/decision-framework/skill.md
    - Decision thresholds
  • SCRIPTS/decision_logic.py
    - Decision implementation
  • .claude/commands/qc-backtest.md
    - Backtest execution

Version: 2.0.0 (Progressive Disclosure) Last Updated: November 13, 2025 Lines: ~200 (was 555) Context Reduction: 64%