← Back to Case Studies
product walkthroughmomentumvalidation

Anatomy of a Validation: A Momentum Strategy Through Seven Layers

This walkthrough follows a single strategy from upload to Validation Passport. The strategy is realistic: a long-only equity momentum model that a competent independent trader might develop in Python. It's not a toy example. It's also not perfect.

That's the point. A validation engine that only confirms what you already know is useless. The value shows up when the engine finds what you missed.


The Strategy

Name: US Equity Momentum (12-1)

Logic: At the end of each month, rank the Russell 1000 by trailing 12-month return, excluding the most recent month (the standard "12-1" formation period used in the academic momentum literature, Jegadeesh and Titman, 1993). Go long the top decile. Hold for one month. Rebalance.

Backtest period: January 2005 through December 2024, 20 years of daily returns.

Backtest Sharpe ratio: 0.91

Parameters: Formation period (12 months), holding period (1 month), decile threshold (top 10%), universe (Russell 1000). Four tunable parameters.

Format: Python script uploaded via the platform. The engine converts it to a canonical format, extracts the return series, and begins validation.

The strategy looks good on paper. Nearly 20 years of data. A Sharpe close to 1. Minimal parameter tuning. Grounded in a well-documented academic factor.

Here's what validation actually found.


Layer 1: Statistical Core

Purpose: Test whether the 0.91 Sharpe ratio is statistically real or an artifact of overfitting and multiple testing.

CPCV (Combinatorially Purged Cross-Validation)

The engine generates 10 combinatorial train/test splits with embargo periods to prevent information leakage between folds. Each split produces an out-of-sample (OOS) Sharpe ratio.

Results:

  • In-sample Sharpe range: 0.83 to 1.14
  • OOS Sharpe range: 0.38 to 0.89
  • Mean OOS Sharpe: 0.62
  • OOS degradation: 32%

A 32% degradation from in-sample to out-of-sample is moderate. It tells us the strategy has some genuine signal but the backtest overstates it. The mean OOS Sharpe of 0.62 is the more honest estimate. Still positive, still tradeable, but a meaningfully different proposition from 0.91.

PBO (Probability of Backtest Overfitting)

With only four parameters, the strategy isn't heavily optimized. PBO analysis across the CPCV folds estimates:

PBO: 18%

An 18% probability that the best parameter combination is overfit. For a strategy with this few parameters, that's reasonable. A PBO below 20% doesn't guarantee the strategy works, but it means overfitting isn't the primary concern.

Multiple Testing Correction

The developer reports testing three formation periods (6, 9, 12 months) and two holding periods (1, 3 months) before settling on the 12-1 configuration. That's six combinations tested.

Applying the Bonferroni correction for six tests against the 0.62 OOS Sharpe: the adjusted significance level shifts from 5% to 0.83%. The strategy's t-statistic remains significant at the adjusted threshold.

L1 Score: 64/100. The OOS degradation costs points, but the strategy passes the overfitting checks. The score would be higher with a smaller OOS gap.


Layer 2: Causal Inference

Purpose: Determine whether momentum returns reflect a genuine, persistent market mechanism or a statistical coincidence.

Cross-Asset Momentum Check

The engine tests the momentum signal across subsets of the universe and adjacent asset classes. If momentum works in US large-cap equities, does it also appear in mid-caps, international equities, and commodities during the same periods?

Results:

  • US mid-cap momentum (Russell 2000 top decile): Spearman rank correlation 0.71 with the primary strategy's monthly returns
  • International momentum (MSCI EAFE): Spearman rank correlation 0.53
  • Commodity momentum: Spearman rank correlation 0.29

The high cross-asset correlation supports a causal mechanism. Momentum is not specific to US large-caps; it operates across markets and asset classes, consistent with behavioral explanations (anchoring bias, herding, slow information diffusion) that have genuine causal backing (Barberis, Shleifer, and Vishny, 1998; Daniel, Hirshleifer, and Subrahmanyam, 1998).

Decay Check

Academic momentum returns have declined over time as the factor became widely known and traded. The engine tests for temporal degradation.

Results:

  • 2005-2014 Sharpe: 0.79
  • 2015-2024 Sharpe: 0.48
  • Degradation rate: 3.1% per year

The signal is decaying. Not dead, but weaker than it was. This is consistent with a real factor being arbitraged down over time, which actually supports the causal thesis. Noise doesn't decay systematically. Real factors do, because capital flows in and compresses the premium.

L2 Score: 68/100. Cross-asset evidence is strong. The temporal decay is consistent with a real but diminishing edge. Points deducted for the degradation trend.


Layer 3: Scenario Lab

Purpose: Test the strategy against synthetic market conditions it hasn't seen in its historical data.

Flash Crash Scenarios

The engine generates synthetic scenarios modeled on liquidity crises: sudden 5-10% market drops over 1-3 days, followed by partial recovery. Momentum strategies are vulnerable to these events because their long portfolio (recent winners) tends to be concentrated in high-beta stocks that fall hardest.

Results:

  • Average drawdown in flash-crash scenarios: -14.2%
  • Recovery time: 47 trading days
  • Maximum scenario drawdown: -23.1%

Momentum Crash Scenarios

The engine specifically generates "momentum crash" scenarios, modeled on March 2009 and January 2001, where recent losers (the short side of a long-short momentum portfolio, or equivalently the stocks momentum avoids) dramatically outperform recent winners.

For a long-only implementation like this one, momentum crashes manifest as sharp underperformance versus the benchmark rather than absolute losses. The engine tests both.

Results:

  • Average relative underperformance in momentum crash scenarios: -18.7% over 3 months
  • Absolute drawdown in worst scenario: -31.4%
  • Probability-weighted expected shortfall (at 5%): -22.8%

Extended Bear Market

Synthetic 18-month bear market with 40% cumulative decline:

Results:

  • Strategy return: -38.2% (vs. benchmark -40.0%)
  • The strategy provides almost no protection in a sustained downturn. Momentum stocks in a bear market are the ones that haven't fallen yet, which means they're the ones that fall last, not the ones that don't fall.

L3 Score: 52/100. Momentum crashes and extended bear markets represent genuine risks. The strategy survives but doesn't protect. The scenario lab won't kill a strategy for underperforming in a crash, but it will penalize the lack of tail-risk protection.


Layer 4: Regime Intelligence

Purpose: Identify market regimes and evaluate strategy performance within each one.

HMM Regime Detection

The engine's Hidden Markov Model identifies three regimes in the benchmark (S&P 500) over the 2005-2024 period:

RegimeLabelDurationFraction
State 0Bull/Low Volatility142 months59%
State 1Transition/Choppy63 months26%
State 2Bear/Crisis35 months15%

Per-Regime Strategy Performance

RegimeStrategy SharpeBenchmark Sharpe
Bull/Low Vol1.240.98
Transition0.410.22
Bear/Crisis-0.84-1.12

There it is. In the bull regime, the strategy looks excellent. In the bear regime, it has a Sharpe of -0.84. That single number changes the entire risk picture.

The headline backtest Sharpe of 0.91 is a weighted average across regimes. Since 59% of the sample period was a bull market, the composite is dominated by the bull-regime performance. The strategy doesn't have a Sharpe of 0.91. It has a Sharpe of 1.24 in bull markets and -0.84 in bear markets. Those are two different strategies wearing the same name.

Rolling Hurst Exponent

Multi-scale Hurst analysis on the strategy's return series:

WindowHurstInterpretation
63 days0.58Mildly trending
126 days0.54Near random walk
252 days0.51Random walk

The strategy's returns show weak persistence at short horizons and essentially random behavior at longer horizons. This is consistent with a momentum factor that provides a small, persistent edge on average but can't be relied on for directional performance over any given year.

Regime-Conditioned Scoring

With a bear fraction of 15% and worst-regime Sharpe of -0.84, L4 applies two adjustments to the composite score:

  1. Threshold shift: min(0.15 * 15, 10) = 2.25 points. Pass boundaries are lowered slightly, recognizing that 15% bear exposure is within normal bounds for a long-equity strategy.
  2. Sharpe discount: Worst-regime Sharpe of -0.84 triggers a 15% discount to the L1 Sharpe score, because the overall Sharpe masks regime-conditional weakness.

L4 Score: 55/100. The bear-market performance is the primary issue. The strategy's composite Sharpe conceals meaningful regime dependence. The score would be substantially higher with bear-regime Sharpe above 0.


Layer 5: Execution Realism

Purpose: Model real-world trading frictions and determine whether the backtested returns survive them.

Transaction Cost Modeling

The strategy rebalances monthly, trading roughly 20% of the portfolio (as the top decile changes). For Russell 1000 stocks:

Cost ComponentEstimate
Bid-ask spread (half-spread)3.2 bps per trade
Market impact (Almgren-Chriss)1.8 bps per trade
Commission0.5 bps per trade
Total per-trade cost5.5 bps

Annual Cost Impact

With 20% monthly turnover, annual two-way turnover is approximately 240%. At 5.5 bps per trade:

Annual friction cost: 1.32% of portfolio value

Against the strategy's gross annual return of approximately 9.1% (derived from the 0.62 OOS Sharpe, assuming 10% annualized volatility), friction consumes about 14.5% of gross returns.

Regime-Stressed Execution

Under bear-market conditions, the execution layer applies stress multipliers:

ComponentNormalBear/Crisis
Spread3.2 bps9.6 bps (3x)
Impact1.8 bps3.6 bps (2x)
ADV100%50%

During a crisis rebalance, per-trade costs increase to 14.2 bps. If the strategy rebalances at normal frequency during a crisis, annual friction cost jumps to 3.4%.

Capacity Estimate

At $50 million AUM, market impact is negligible for Russell 1000 stocks. At $500 million, impact costs increase by approximately 40%. The strategy's estimated capacity ceiling before impact costs erode the edge below zero: approximately $2 billion.

L5 Score: 71/100. Transaction costs are manageable at the target AUM. The strategy scores well here because it trades liquid, large-cap equities with moderate turnover. Points deducted for elevated crisis-period friction and the capacity constraint at larger sizes.


Layer 6: Live Monitoring Setup

Purpose: Establish decay detection baselines for ongoing surveillance.

Since this is a pre-deployment validation, L6 sets baseline parameters rather than generating a retrospective score. The monitoring framework would track:

  • Rolling 60-day Sharpe ratio with 2-sigma bands
  • Monthly alpha versus the benchmark
  • Factor exposure drift (momentum loading, beta, sector concentration)
  • Signal decay rate (change in cross-sectional return spread between top and bottom deciles)

The temporal decay detected in L2 (3.1% per year decline in Sharpe) establishes an expected degradation path. If live performance decays faster than 3.1% annually, the monitoring system flags it as potential edge exhaustion beyond normal factor compression.

L6: Baseline set. No score assigned pre-deployment.


Layer 7: Integrity

Purpose: Cryptographic audit trail and tamper-evident validation record.

The validation run is hashed and timestamped. The resulting Validation Passport includes:

  • SHA-256 hash of the input strategy code
  • Timestamp of validation run
  • All layer scores with underlying metrics
  • Parameter sensitivity analysis
  • Regime-conditional performance breakdown

The passport can be shared with investors, allocators, or compliance teams. It proves the strategy was validated at a specific point in time with specific results, without revealing the strategy's logic.

L7: Passport generated.


The Validation Passport

Composite Score: 61/100

LayerScoreWeightContribution
L1: Statistical Core640.3019.2
L2: Causal Inference680.106.8
L3: Scenario Lab520.105.2
L4: Regime Intelligence550.158.25
L5: Execution Realism710.3524.85
Composite64.3 (floored to 61)

The floor constraint applies: composite score cannot exceed the minimum component score plus 20 points. With L3 scoring 52, the maximum composite is 72. The weighted sum of 64.3 stands because it's below the floor ceiling.

Actually, applying the regime-conditioned adjustments from L4 (Sharpe discount of 15% to L1 contribution), the adjusted composite comes to approximately 61.

Verdict: Weak Pass

A score of 61 falls in the "weak pass" range (40-59 is fail, 60-79 is weak pass, 80+ is pass). The strategy has genuine statistical edge, a credible causal mechanism, and manageable execution costs. It also has meaningful regime dependence and vulnerability to tail events.

What the Developer Learns

The headline finding isn't the score. It's the regime decomposition.

Before validation, the developer believed they had a strategy with a 0.91 Sharpe ratio. After validation, they know they have a strategy with a 1.24 Sharpe in bull markets and a -0.84 Sharpe in bear markets, an OOS Sharpe of 0.62 that's decaying at 3.1% per year, transaction costs consuming 14.5% of gross returns, and vulnerability to momentum crashes.

That's not a rejection. It's a blueprint for improvement. The developer can now ask the right questions. Should they add a regime filter that reduces exposure in bear markets? Should they hedge tail risk with options? Should they accept the regime dependence and size the position accordingly?

The Validation Passport turns "I think this strategy works" into "I know exactly when and how this strategy works, and when it doesn't."


What This Walkthrough Shows

Standard momentum (12-1, long-only, no overlay) isn't a bad strategy. Academic momentum has genuine support from decades of research across markets. The strategy has edge.

It also has blind spots that a backtest won't reveal. Regime dependence. Temporal decay. Tail vulnerability. These aren't opinions. They're measurable properties of the return series that the backtest's composite Sharpe ratio obscures.

The purpose of validation isn't to find reasons to reject every strategy. It's to replace a single number (backtest Sharpe) with a complete picture of when, why, and under what conditions the strategy produces returns.

A developer who deploys this strategy with full knowledge of its regime characteristics makes a fundamentally different risk decision than one who deploys it believing the composite 0.91 Sharpe is the whole story.

The former is making an informed bet. The latter is making a mistake.