Mitigating overfitting with advanced cross-validation

The best techniques on how to split your backtesting data

Introduction

Accurately assessing the performance and generalisation ability of an algorithm is crucial to ensure reliable predictions on unseen data. This is where cross-validation (CV) techniques come into play. The primary objective of CV is to estimate the generalisation error of a model, ultimately preventing overfitting.

When evaluating an algorithm using the same dataset it was trained on, we often achieve impressive results. However, such performance can be misleading, as it lacks the crucial ability to make reliable forecasts. CV mitigates this issue by splitting observations into two distinct sets: the training set and the testing set.

Within the realm of backtesting applied to empirical finance, several alternative CV schemes have been developed to validate the efficacy of trading strategies. These include: basic vanilla backtesting; walk-forward backtesting; K-fold CV backtesting; combinatorial purged cross validation; multiple randomised backtesting.

In this article, we delve into these cross-validation techniques, highlighting their strengths and limitations in assessing the generalisation error of algorithms in the context of trading strategy backtesting. By understanding and implementing these methodologies, traders can make more informed decisions when developing robust trading strategies.

If you would like to get a primer on backtesting first, have a read of our previous article below providing an in-depth introduction to backtesting.

Basic vanilla backtesting

In order to conduct a basic vanilla backtest, the data is divided into different sets. The first set is the in-sample data, which is further divided into training data and cross-validation data. The purpose of the in-sample data is to train the strategy and validate its effectiveness.

The training data is utilised to estimate the model parameters, allowing the strategy to learn from historical patterns and trends. This step helps the strategy to adapt and make informed decisions based on the available information.

The cross-validation data serves a slightly different purpose. It is used to choose a few hyper-parameters, which are settings or configurations that can influence the strategy's performance. By testing the strategy on the cross-validation data, different combinations of hyper-parameters can be evaluated, enabling the selection of the most effective settings.

Finally, the out-of-sample or test data is used to evaluate the performance of the strategy. This data represents new and unseen information, simulating real-world conditions where the strategy encounters unknown market situations. In principle, by assessing how well the strategy performs on this test data, its ability to generalise and make accurate predictions on unseen data can be determined.

The main disadvantage with this and other basic backtesting methods is that it only makes use of a single historical path. Relying on just one realisation of historical data is dangerous - it’s difficult to draw any meaningful conclusions. Worse yet, when parameter tuning is involved, historical outperformance is more often than not merely the result of overfitting.

Figure 1: In-sample vs out-of-sample data split when backtesting [1]

Walk-forward backtesting

The walk-forward (WF) approach is a modified version of the vanilla backtest where the in-sample and out-of-sample windows are continually shifted or rolled forward. It provides a historical simulation of how a trading strategy would have performed in the past. Importantly, each decision made by the strategy is based on observations that occurred before that specific decision.

Figure 2: Data splitting strategy for walk forward analysis [1]

The WF approach offers two significant advantages. Firstly, WF has a clear historical interpretation, meaning its performance can be compared and reconciled with paper trading based on historical data. This allows traders and researchers to assess the real-world viability of a strategy and evaluate its effectiveness.

Secondly, history acts as a filtration mechanism in the walk-forward approach. By using trailing data, the performance evaluation reflects the strategy's ability to adapt and generalise to new, unseen data.

However, it is important to note that WF backtesting also makes use of only one historical path. Additionally, data leakage is a common mistake made that is often overlooked. Data leakage occurs when information from the training set is unintentionally used in the test set. For example, consider a trading strategy that utilises a 100-period simple moving average. After each data split, the first 99 bars will reference data from the previous in-sample period, resulting in data leakage. The more data splits performed, the greater the extent of leakage. Failure to account for data leakage with appropriate purging techniques (more on this later) can lead to full-blown look-ahead bias and highly misleading results.

K-fold CV backtesting

K-fold backtesting involves splitting the dataset into K equally sized subsets or "folds." The model is trained on K-1 folds and tested on the remaining fold. This process is repeated K times, ensuring comprehensive evaluation across the entire dataset. The image below shows an example of a 5-fold CV scheme. Despite its popularity, K-fold CV also suffers from the same previously mentioned drawbacks as WF backtesting.

Figure 3: Data splitting strategy for K-fold CV backtesting [1]

Purging and embargo

To mitigate data leakage, one effective approach is purging. Purging involves removing observations from the training set that have labels overlapping in time with the labels included in the testing set. For example, if we were using a K-fold CV to backtest a trading strategy relying on a 100-day moving average, we would remove the first 99 days of data in the test set and remove the first 99 days of the training fold following the test fold. See the image below for clarity.

Figure 4: Managing data leakage by purging overlapping data [2]

By purging these overlapping observations, the risk of data contamination and bias in the model's training process is reduced. This helps ensure a more accurate evaluation of the strategy's performance on the testing set, as it avoids any influence or leakage from the training data that may artificially enhance the results.

Another issue that must be attended to is the presence of serial correlation in financial features. In simple terms, this refers to instances where the values of a particular financial feature, like stock prices or returns, are not completely independent from each other but are influenced by previous values. This can result in a less overt data leakage that purging may fail to prevent and can be addressed with a technique called embargo [2]. Very similarly to purging, observations that immediately follow an observation in the testing set are removed from the training set. By eliminating these subsequent observations, which may be influenced by the testing set, the risk of biassed training and inflated performance results is minimised. The embargo technique can be used before purging and ensures that the training set remains independent from the testing set, allowing for a more accurate assessment of the strategy's performance on unseen data. A useful rule of thumb is to embargo a number of bars equal to 1% of the bars in the test set.

Figure 5: Managing data leakage due to serial correlation by embargoing data [2]

Combinatorial purged cross validation backtesting

Combinatorial Purged Cross Validation (CPCV) is a backtesting algorithm introduced by Marcos López de Prado in Advances in Financial Machine Learning. It can be seen as a more advanced version of K-fold CV that addresses its weaknesses - data leakage and reliance on a single historical path.

Firstly, data leakage is addressed using the purging and embargo techniques outlined in the previous section of this article. Secondly, we can obtain multiple historical paths by using more than one fold for the test set. For example, consider splitting the data into 6 folds, with not just 1, but 2 test groups. This would give us 15 possible combinations, as derived from the following formula:

Figure 6: Combination formula (often called the “n choose r” formula)

Where:

n is the total number of folds;

r is the number of test groups.

Figure 7: 15 possible combinations of a CPCV scheme with 6 total folds and 2 test groups. The blue portions show the train data, the light red the test data and the dark red the data removed with purging and embargo [3].

With 15 possible combinations each consisting of 2 test groups, we end up with a total of 30 test groups. Furthermore, given that these test groups are uniformly distributed across the entire dataset, we can now construct a total of 5 historical paths (the total number of test groups divided by the number of folds - 30/6) as illustrated in the figure below.

Figure 8: Creating 5 different historical paths using the CPCV technique [3]

To those interested in implementing CPCV, the following links to a repository and online notebook are useful resources get started:

Multiple randomised backtesting

An alternative to CPCV that can also be used to avoid relying on a single historical path is multiple randomised backtesting. This technique involves generating multiple datasets from historical market data in a randomised manner. This is done by randomly selecting different periods of time and a subset of stocks from the universe of available options.

For instance, if the original dataset consists of 2000 stocks spanning a 20-year period, one could randomly choose 100 stocks over a consecutive period of 2 years. This process is repeated multiple times to create randomised datasets. The aim is to introduce variability and capture different market conditions experienced over the 10-year timeframe.

Each resampled dataset possesses its own unique randomness and encompasses various market regimes encountered throughout the 20 years. Subsequently, a walk-forward backtesting approach can be applied (with purging to avoid data leakage) to evaluate the performance of each of these resampled datasets.

Conclusion

Reducing overfitting when backtesting is difficult. The techniques presented above are advanced and require a good grasp of statistics. If you are just starting out, we would recommend checking out the Backtesting 101 article. If you are seasoned trader, the techniques above will add further quivers to your bow to build the confidence we all seek when developing our algorithms.

References

[1] Backtesting Portfolios, Prof. Daniel P. Palomar, MAFS5310 - Portfolio Optimization with R, MSc in Financial Mathematics, The Hong Kong University of Science and Technology (HKUST), Fall 2020-21, https://palomar.home.ece.ust.hk/MAFS5310_lectures/slides_backtesting.pdf 

[2] Advances in Financial Machine Learning, Marcos López de Prado, 2018

[3] The Combinatorial Purged Cross-Validation method: indexing example on crypto, Code notebook by Berend Gort, https://colab.research.google.com/gist/Burntt/f26e5414205542207949aeb9e9cc1ddb/demo_purgedkfoldcv.ipynb#scrollTo=KFmER7NGbpgZ 

Reply

or to participate.