Backtesting 101

An empirical research protocol for part-time traders to backtest systematic trading strategies

Fat Cat Trader
May 21, 2023

1. Abstract

Financial trading requires careful consideration to avoid the pitfalls of data mining bias. Most seemingly lucrative strategies generate their profits from random noise rather than repeatable patterns. Hence, it is deceptively easy to find strategies with impressive historical returns that deliver lacklustre results when deployed in live trading.

This whitepaper presents an empirical research protocol specifically tailored for the field of quantitative finance. Similarly to research protocols, widely utilised in various scientific disciplines, the objective is to create a resilient framework that optimises the exploration of authentic models by minimising errors that could potentially result in misleading findings.

We know this article is long. Don’t worry - not every article in this newsletter will be this length! This post is supposed to act as a one-stop shop comprehensive introduction to Backtesting that you can refer back to time and time again.

2. The basics

A backtest serves as a historical simulation, allowing us to gauge the effectiveness of an algorithmic trading strategy. By running the algorithm through a specified time period, we can calculate the sequence of profits and losses it would have generated. To evaluate the performance of the backtested strategy, popular metrics like the Sharpe ratio are employed to measure the risk-adjusted returns. Armed with these backtest statistics, traders can make informed decisions and allocate capital to the most successful approach. Traders also have an opportunity to identify potential flaws in a model and refine it to improve accuracy.

It is important to distinguish between two distinct perspectives when assessing the measured performance of a backtested strategy: in-sample (IS) and out-of-sample (OOS). The IS performance refers to the simulation conducted using the data used to design the strategy - the "training set". On the other hand, the OOS performance is based on data that was not employed in strategy design - the "testing set". A backtest is considered realistic when the IS performance aligns with the OOS performance. [1]

3. The problem

With the help of modern computer systems, thousands, millions, or even billions of variations of a strategy can be explored and the best-performing variant can be chosen as the "optimal" strategy on the IS dataset. However, excessive testing of countless strategy variations greatly increases the risk of finding a model that performs well by capturing the specific idiosyncrasies of a limited dataset rather than a generalisable pattern.

This false discovery of an "optimal" strategy often performs poorly OOS or in live trading as the selected parameters of the strategy fit the noise in the IS data rather than having any useful predictive power when applied to new datasets. This phenomenon is the result of what is known as “data mining bias” and has been studied extensively in recent theoretical research.

3.1 Deconstructing data mining bias

Data mining is the art of uncovering valuable patterns from massive amounts of data. These patterns hold the power to drive insights and help us make informed decisions. However, in the world of empirical finance, the term "data mining" carries a negative connotation as it suggests a haphazard approach that ignores statistical significance and leads to false discoveries.

Fundamentally, bias emerges from three processes: parameter optimisation, model selection and rejection, and data reuse. Optimisation may lead to overfitting to noise, while selection and rejection introduce selection bias, and data reuse introduces data-snooping bias. The amalgamation of these biases is commonly known as data mining bias. [2]

3.2 The perils of selection bias

Out of all of the components of data mining bias, selection bias is perhaps the stealthiest and most insidious. Both academia and industry experts often engage in running tens of thousands of historical backtests to uncover promising investment strategies. But it is a mere triviality to generate a historical backtest with an impressive Sharpe ratio by exploring thousands of alternative model specifications. What is even more disconcerting is that the authors of these models are incentivised to cherry-pick the best performing backtest and present it as if it were the sole trial, leading to its publication or the launch of a new fund.

The omission of the number of backtest runs and the failure to adjust the reported results accordingly is a well-known issue among practitioners and mathematicians. Very few academic papers disclose the number of trials undertaken to arrive at a particular discovery and surprisingly, standard econometrics textbooks appear to overlook this issue entirely. Acknowledging this issue alone could potentially invalidate the majority of the empirical work conducted over the past decades, as most claimed research findings in financial economics are likely to be false. [3]

To illustrate this point more clearly, behold a classic example of multiple testing. Imagine receiving a promotional email from an investment manager, touting a stock and urging you to evaluate their track record in real-time. You're sent weekly emails for 8 weeks, each recommending a single stock as either a long or short pick, and each week the manager's recommendation is correct. It seems incredible, as the probability of this happening is incredibly low - just 39 in 10,000 (0.5^8=0.0039).

You're impressed and hire the manager, but later discover their strategy: they simply picked a stock at random and sent out 10,000 emails initially, with 50% recommending it as a long pick and 50% as a short pick. As the stock's value increased, the manager trimmed their mailing list to 5,000 (sending only to those who received long recommendations) and continued to halve the list size every week. By the end of the 8th week, only 39 people would have received the supposed "amazing" track record of 8 consecutive correct picks. But in reality, these 39 people were simply the product of chance. If they had realised how the promotion was organised, they would have expected to get such a track record. Thus, there was no real skill involved - it was all random.

This scenario has many practical applications, particularly when evaluating trading strategies. When evaluating a high number of strategies, for instance 10,000, some will inevitably outperform year after year. In fact, even when choosing strategies at random, one would expect at least a few hundred to deliver significant outperformance, purely by chance.

4 A protocol to mitigate data mining bias

Distinguishing models that truly work from those that are simply the byproduct of data mining is arguably the greatest problem plaguing traders. Despite the abundance of purportedly successful models, the potential for mistakes remains ever-present.

There are a variety of techniques and heuristics practitioners can employ to explore and optimise trading strategies without falling prey to data mining bias. Broadly speaking, these practices fall under 3 categories that are captured by these fundamental questions:

How do we correctly handle backtesting data?
What are the properties of the trading strategies that should be backtested?
How do we evaluate the performance of trading strategies?

In the sections that follow, we propose a research protocol for reliable backtesting to provide answers to the above questions. Included are methods to mitigate data mining bias at every stage of the construction and evaluation of trading strategies.

5 Correct handling of data

5.1 Minimise the number of trials

As a trading system developer explores an increasing number of ideas, whether new or modified, the presence of data-mining bias becomes more pronounced. Eventually, this bias grows to such an extent that the likelihood of uncovering a genuine trading advantage through backtesting approaches zero. Hence, minimising the number of backtest trials is paramount.

Presented in Figure 1 is a compelling visualisation illustrating the minimum backtest length needed to safeguard against the emergence of unskilled strategies, boasting a Sharpe ratio of 1. This metric is intricately linked to the number of trials, as elucidated by researchers in a popular paper. [1]

Figure 1 : Shows the minimum backtest length required for the expected maximum sharpe ratio to be lower than 1

In order to ensure an accurate assessment of strategies, it is imperative to employ diligent tracking methods. For example, merely relying on a t-statistic of 2 or higher as a benchmark is insufficient, as chance alone may easily cause one out of just 20 randomly selected strategies to surpass this threshold. Therefore, when multiple strategies are tested, the t-statistic of 2 loses its significance as a meaningful benchmark.

To address this, we must keep track of the number of strategies attempted and also measure their correlations. Furthermore, it is important to note that strategies with relatively low correlations will face a more stringent threshold penalty. For instance, if the 20 strategies tested exhibit a very strong correlation, the overall process can be likened to testing just one strategy.

There is a rich corpus of research detailing several methods that can be employed to quantify and account for the statistically deleterious effects of multiple testing. These include but are not limited to: a deflated Sharpe ratio; family-wise error rate tests (for example the Bonferroni test); the false discovery rate; and data mining bias adjusted empirical t-statistic values. This is a specific topic that deserves more consideration and will be the dedicated subject of discussion for a future article of this newsletter.

5.2 Maximise the sample size

As implied from figure 1, besides limiting the number of trials to an absolute minimum, another lever traders can exploit to minimise data mining bias is increasing the number of trades. To accurately determine the expected returns of a trading strategy, it's essential to have a large enough trade sample due to the non-stationary nature of price series and the occurrence of fat-tailed distributions. The longer and the more inclusive of different market regimes a backtest is, the higher the statistical significance of its results and the more unlikely it is to observe outperformance resulting from pure luck.

The statistical power of a hypothesis refers to the probability of successfully detecting an effect if there is indeed a true effect to be found. In a similar vein, the statistical power of an optimisation is the likelihood of identifying an edge when there is a genuine edge to be discovered. The magnitude of the edge determines the statistical power slope. Robust, yet simple strategies with modest edges will require more data to establish the statistical significance of their outperformance.

Non-stationary means with evolving statistical properties over time

Fat-tailed distribution is a probabilistic distribution with a greater than expected probability of extreme events. In this context, it implies the great influence of extreme events on expected future risk.

5.3 Exclude outlier trades

It is important to exercise caution when evaluating trade outcomes. Trades exhibiting abnormally high profits resulting from unpredictable events should be excluded from the backtest results, as they do not reflect generalizable patterns.

5.4 Separate testing of strategy components

To effectively evaluate our market hypotheses and ensure the success of our strategy, it is crucial to assess each component separately. Rushing into the creation of a complex strategy with multiple moving parts and fitting it to the data before evaluating individual components can lead to flawed results. Each component, such as filters, indicators, signals, and rules, expresses different aspects of our hypothesis. Therefore, we must confirm that each component is working effectively, adding value, validating assumptions, and improving predictions before moving onto the next.

There are several benefits to testing components separately. Firstly, it guards against overfitting. By quickly rejecting a poorly performing indicator, signal process, or other component, we can avoid wasting time fitting an ineffective strategy to the data. Secondly, specific techniques often have tests that are fine-tuned to that technique, providing better power to detect specific effects in the data. Lastly, testing components individually is more efficient, as it allows us to reject a poorly-formed specification and reuse components with known positive properties to increase the chances of success in a new strategy. Ultimately, testing components separately is a more effective use of your time than going through the entire strategy creation process only to reject it at the end.

5.5 Cross validation with one-shot OOS testing

When it comes to evaluating a trading strategy’s performance on new, untouched data, it’s important to respect the significance of genuine hold-out data. Using OOS test results to begin a fresh run or manually adjust strategy parameters introduces the risk of data-snooping bias. Therefore, the unseen, untouched OOS data set should only be used one time, after all test runs on the IS dataset and strategy optimisations have been completed. Once this final stage in the trading evaluation process is reached, a trading strategy ought to be readily discarded should the final OOS backtest fail to produce results in line with those obtained with the IS data.

While straightforward, one-shot OOS testing is easier said than done. It's hard to imagine abandoning the model and throwing away weeks of effort. Most of us lack the decisive attitude required for this task. Instead, we tend to tweak the model until it performs reasonably well on both IS and OOS data. But in doing so, we inadvertently convert the OOS data into IS data. This approach might seem convenient, but it can lead to serious inaccuracies in our results. [4]

6 The types of trading strategies to backtest

6.1 Formulate a strong ex ante hypothesis

Backtesting should only be utilised by traders when they have a solid, unique idea or hypothesis to test. The primary aim should be to challenge and disprove the concept, rather than bolstering it with additional filters and conditions. The critical error made when backtesting is refining the specific rules of their strategy until the underlying idea is validated when the opposite approach is required.

This systematic process aligns seamlessly with the principles of the scientific method. A hypothesis is crafted, and empirical tests are designed to unearth evidence that challenges the hypothesis, a concept known as falsifiability.

The formulation of a logical hypothesis serves as a guiding principle, mitigating the risks of succumbing to data snooping bias from excessive adaptation to the available data. The absence of a sound hypothesis amplifies the likelihood that the model’s IS results are a false positive and its performance will falter in live trading scenarios.

Furthermore, it should be noted that crafting a narrative after the completion of data mining exercises is an error frequently made. Such an after-the-fact story often appears frail, especially if the data mining process had yielded contrasting outcomes. Had the data mining pointed towards an opposing result, the subsequent narrative could have easily aligned with that opposite conclusion. Instead, a logical foundation should precede the data analysis, and a series of empirical tests ought to be designed to gauge the resilience of that foundation. [5]

6.2 Keep it stupid simple. It’s simple, stupid!

When it comes to the issue of overfitting, Enrico Fermi once said: “I remember my friend Johnny von Neumann used to say, with four parameters I can fit an elephant, and with five I can make him wiggle his trunk.” [6]

This principle holds true for backtesting trading strategies as well, presenting traders with some intriguing peculiarities. The scarcity of data in empirical finance further underscores the importance of minimising model complexity, as the sample size required to establish statistical significance rises exponentially with a model’s degrees of freedom. Beyond a certain threshold, adding more parameters can actually diminish the model's performance. This phenomenon is commonly known as "the curse of dimensionality" or the Hughes phenomenon.

To exemplify this concept, picture each parameter of a model as a dimension in a parameter space and consider the following scenario: imagine you're selling rulers in your store. If you only consider length - one dimension, you may only need 5 different sizes to test and see what sells. However, if you're selling boxes, you now have three dimensions to consider. To accurately test which sizes people prefer, you might need to test 125 different sizes (5 sizes for each of the 3 dimensions - 5^3 = 125). If your sample size is limited, your results may be skewed or inconclusive.

The curse of dimensionality arises from the exponential reduction in sample density between data points as the dimensionality increases. As we continue to introduce additional features without a proportional increase in training samples, the feature space expands and becomes increasingly sparse (the Euclidean distance between points in this parameter space increases). This sparsity makes it easier to stumble upon a seemingly "perfect" model, often resulting in overfitting.

Figure 2: Visual illustration of how increasing dimensionality reduces sample density

Trading and investing are multifaceted undertakings that demand individuals to be prepared to face financial losses. However, in a bid to dodge such losses, traders frequently resort to introducing additional parameters and more intricate filters etc. Paradoxically, this inclination to add complexity to trading models often amplifies the likelihood of failure in live trading. The simple truth is that acknowledging and accepting losses as an inevitable part of the process ultimately paves the way to long-term profitability.

Indeed, the curse of dimensionality places simplicity in the spotlight, revealing its innate sophistication. This can also be viewed as the mathematical instantiation of Occam’s Razor, a philosophical concept which asserts that when presented with competing hypotheses to explain a phenomenon, you should lend preference to the one predicated on the fewest assumptions. So to safeguard against overfitting, it is prudent to minimise the trading strategy’s number of tunable parameters. A valuable guideline worth considering is to set an absolute maximum limit of two parameters. This also decreases the need for statistical adjustments to account for data mining bias and the number of trials.

7 Evaluating strategy performance

A nuanced yet vital point to note is that the aim with backtesting is not merely to eliminate all false positives. Such a goal would be simple to achieve – we could simply reject every single strategy or run no backtests at all. Instead, traders face the challenge of threading a delicate balance between minimising false discoveries while not imposing excessively restrictive rules that would lead to overlooking promising strategies.

The main limitation in the aforementioned concept of minimum backtest length is the use of the Sharpe ratio as a benchmark against which the effect of multiple testing is measured. The Sharpe ratio is a fundamentally overfitting-agnostic metric; Every strategy with a noteworthy edge has a high sharpe ratio but the same applies to every overfit strategy as well. Hence, the Sharpe ratio alone is devoid of any inherent informational value to help discriminate a genuine discovery from a false one.

In this section we present methods of gauging the resilience of a strategy that can selectively filter out most random strategies with high Sharpe ratios while discarding very few strategies with true edges and high Sharpe ratios. Utilised once and exclusively in the conclusive stage of a strategy’s evaluation process, these tests can significantly reduce the incremental injection of data mining bias or deterioration in statistical significance per trial run. The implied benefit here is that with the number of trials and backtest length remaining equal, these tests bolster the statistical significance of findings from backtesting. Alternatively, with the statistical significance remaining equal, traders can afford lower backtest lengths and a higher number of trials.

7.1 Robustness tests

The examination of a trading strategy for its robustness is often referred to as sensitivity analysis or, in more colloquial terms, stress testing. The underlying concept is simple: observe the outcome when minute alterations are introduced to the strategy inputs, price data, or other elements within the trading environment. A robust strategy exhibits resilience through relatively unaltered performances, whereas an overfit strategy will typically react disproportionately, sometimes even succumbing entirely, to such changes.

The working definition of an overfit strategy is one that is tailored so precisely to the market during its development phase that it becomes incapable of withstanding any market changes. Ensuring a trading system’s robustness through comprehensive stress testing prepares it for the ever-evolving nature of the markets. Consider the strategy inputs, for instance. Inputs like the lookback length for a moving average may be optimal during the backtest period, but as time progresses, different values may yield greater efficacy. Traders need to ascertain how well the strategy will fare when the inputs are no longer optimal.

Numerous methodologies exist to stress test a strategy. But broadly speaking, they all fall under two main categories:

Testing sensitivity to changing price data;
Testing sensitivity to changing parametric values.

Employing a series of rigorous robustness tests maximises the odds that the system doesn't just excel on training data, but also maintains its performance levels on future live data that has yet to be encountered. These concepts aren’t just intuitive and valid in theory, but they have also been empirically demonstrated by several studies to be statistically significant when put into practice. To name one, a report produced by a managed fund measured the impact of robustness tests on OOS profitability of 1,000 randomly generated US30 strategies. While only 46% of the 1,000 strategies were profitable in the OOS, an impressive 87% of strategies that passed all their robustness tests were profitable. [7]

Clearly, for traders developing their algorithmic systems, adhering closely to such a robustness-driven framework is conducive towards achieving sustained success in the long run. The following sections will outline a series of common robustness tests traders should consider integrating to their backtesting.

7.1.1 Testing on multiple markets and timeframes

As mentioned previously, maximising the backtest length is essential to minimise the unwanted effects of data mining. The most obvious ways of increasing the backtest data sample size is by subjecting the strategy, whenever possible, to rigorous testing on various symbols and timeframes. While each market has its unique traits such as daily volatility, consistent results (or maintained profitability at a minimum) across multiple markets and timeframes is a positive indicator of robustness.

Some traders have a strong preference for multi-market strategies to avoid overfitting, while others will tend to focus on single-market strategies. When building strategies, consider the trade-off between robustness and performance. Expecting a strategy designed for multiple markets to perform as well on any given market as one designed specifically for that market is unrealistic. However, grouping similar markets and building strategies for each group can strike a good balance.

As an example, should a system be crafted for the E-mini S&P 500 futures (ES) and exhibit outperformance, its prowess would ideally persist on another market, such as the E-mini Russell 2000 futures (RTY), which shares similarities but possesses its own distinct characteristics. Although the two markets are related and highly correlated, if the performance is consistent, we can infer that the strategy is unaffected by random market noise as the random elements differ in related markets. Furthermore, it's plausible to assume that the strategy logic is focusing on elements common to both markets. As both markets are stock index futures, the elements are probably related to the trading patterns of stock index futures.

Similarly, consider a trading strategy developed for the 5-minute bar time frame of the E-mini S&P 500 futures (ES). One way to alleviate concerns about the strategy being overfit to spurious patterns on that timeframe would be to ensure outperformance persists with signals generated on different but similar timeframes, for example, on 4, 6 and 7 minute bars.

7.1.2 Testing on time-shifted data

Data providers rely on hourly data that begins at the top of the hour and ends at the conclusion of that hour. For instance, the data for 8:00 to 9:00, 9:00 to 10:00, and so forth. A simple robustness test involves shifting the opening and closing times of every bar by a few minutes (for example, the 9:00 bar would have its open set at 9:04 and close at 10:04). By doing this, we gain additional data that has the same basic properties as the historical market data we used to construct our strategy, but isn't an exact replica.

Shifting the data can work for any timeframe and has the potential to alter bar patterns, highs and lows, consecutive up or down streaks, indicator calculations, and other factors. By retesting your strategy on shifted data, you can determine whether your strategy is overly reliant on the precise patterns found in the historical market data. Should the strategy’s performance degrade significantly on time-shifted data, then it may be an indication that your strategy is overfit and not adaptable to market changes.

7.1.3 Noise testing

In the context of trading edge harvesting, price data can be broken down into two components: the signal and the noise. To ensure that the strategy being developed remains impervious to market noise, the most straightforward method is to verify persistent outperformance after superimposing random noise to the price data during the construction phase.

To introduce random price variations, we utilise two settings. First, the probability of changing any price (open, high, low, close) per bar. Second, the maximum percentage change sets the limit for price adjustments. The actual change is randomly chosen within the range of -Max to +Max, where Max is a percentage of the average true range over a certain lookback period. For instance, if the average true range is 10 points and the maximum percentage change is 20%, the alteration falls between -2 and +2 points. To maintain price order, adjustments are made to keep the open and close within the high/low range.

To truly gauge the resilience of a strategy, we need to go beyond single alternative comparisons. Instead, we can employ Monte Carlo simulations, which involve iteratively varying input variables in a random manner to generate a statistical distribution of results for the dependent function. As a general rule of thumb for Monte Carlo tests, 100-1000 simulations should suffice to get a clear picture of a strategy’s vulnerabilities. If the system’s performance persists across most or all of the noise testing simulations, it is more likely to also persist in live trading as well.

7.1.4 Randomised entry and exit testing

Another type of Monte Carlo test involves anticipating and delaying the exact entry and exit points of trades by a moderate and random amount. The intuition here again is very simple: If a strategy’s performance is contingent on hyperspecific timing with little to no flexibility, clearly this is a sign of overfitting to noise, leaving us with little hope of outperformance in live trading. Similarly to noise testing, alterations to entry and exit times for every trade can be computed using the following settings: the probability of changing the trade times (entry and exit) and the actual change randomly selected between a predetermined -Max to +Max range.

7.1.5 System parameter permutation

To ensure the success of any trading system, it's crucial for traders to gain a thorough understanding of what to expect in terms of performance before allocating capital. Failing to comprehend the probable ranges of performance could lead to abandoning a promising system during an unexpected drawdown or period of underperformance. Worse still, capital may be allocated based on unrealistic expectations, which traditional evaluation methods can inflate, resulting in bad investments.

Most traditional system development approaches rely on a single point estimate of performance, based on a single sequence of trades and a single combination of optimised parameter values. But the extreme performance in historical simulations is unlikely to maintain the same level of performance in the future. This is because of the statistical phenomenon known as regression toward the mean, which dictates that extreme values will tend to move closer to the mean over time. Hence, relying on such limited information makes it difficult to allocate capital effectively.

This is where system parameter permutation (SPP) comes in, providing valuable probabilistic information that helps traders make informed decisions. First introduced by Dave Walton [8], SPP involves testing all possible parameter combinations of a strategy to generate sampling distributions of system performance metrics. By leveraging the statistical law of regression toward the mean and making the most out of historical market data, SPP provides a practical way to estimate the realistic performance of trading system edges and perform statistical significance testing. The SPP process follows these general steps:

The system developer determines the parameter scan ranges for the system being backtested
Each parameter scan range is divided into an appropriate number of observation points representing specific parameter values.
The simulated results for each system variant are combined to create a sampling distribution for each performance metric of interest, such as CAGR, max drawdown, Sharpe ratio, and so on. Each point on a distribution represents the result of a historical simulation run from a single system variant.

The first use of the generated sampling distributions is to adopt the median values as the estimate of future performance, for several reasons. Firstly, the median is not subject to data mining bias because no selection is involved. Secondly, no assumptions regarding the shape of the distribution are required. Lastly, the median is robust in the presence of outlier values. Hence, by utilising the median performance as the best estimate of true system performance, traders can make informed decisions based on more reliable data.

Moreover, the resulting cumulative distribution functions (CDFs) of system performance metrics allow for practical contingency planning based on probabilities, promoting a more realistic approach to decision-making. For example, you may want to test the statistical significance of the SPP long-run performance estimates in absolute or relative terms to a benchmark. Since SPP generates complete sampling distributions, estimated p-values and confidence levels can be directly obtained from the CDF, as shown in figure 3.

CDFs also allow you to make a probability-based decision on whether or not to invest capital into the system. You can set a probability level that is highly improbable but still acceptable as your worst-case scenario (common levels are 1% or 5%). Alternatively, you may specify the least favourable but tolerable level of performance as your worst-case. Once the worst-case probability or level of performance is determined, the CDF of system metrics is examined as illustrated in figure 3. If the worst-case scenario at the respective probability level cannot be accepted, then capital should not be allocated to the system. [8]

Figure 3: Graphs illustrating two different use cases of CDFs (Cumulative Distribution Functions) generated from System Parameter Simulation [8]

It is worth noting that to prevent bias, we must carefully choose the parameter scan ranges before conducting SPP. Repeating SPP with different parameter scan ranges in an attempt to improve the result can lead to positively biased estimates, which would invalidate one of the core strengths of SPP.

7.1.6 System parameter sensitivity testing

Scrutinising the sampling distribution generated using SPP is not the only way to gauge a strategy’s sensitivity to alterations in its parameters. We can also observe the range of outcomes that remain stable even when variables are slightly altered. Imagine a topographical map: the optimal outcome isn't just the highest peak, but the widest expanse of hills. By rigorously testing each parameter of our system, we reduce the risk of accidental discoveries. If a range of values produces consistent levels of profitability, it gives us confidence in the system's longevity. Conversely, if the system is only successful at a single or sparse assortments of peaks, i.e. a very narrow range(s) of parameter values, that is a prime indicator of overfitting. In such cases, we must reject the system.

To assess an optimization profile, consider these key factors:

The proportion of positive optimization runs, indicating that the strategy performs well across a wide range of parameters.
A median profit above zero, indicating overall effectiveness.
A uniform distribution of profits, avoiding extreme fluctuations between positive and negative results.
The top optimization result should not deviate too much from the average result, to avoid distortion from a single exceptional run.
The 3D chart of the optimization landscape should exhibit a stable shape, which can be evaluated visually.

Figure 4: Illustrative trading system optimisation profile for a system with 2 degrees of freedom

For a system with two degrees of freedom, a three-dimensional optimization profile (see figure 4) can help us immediately identify one or multiple combinations of parameter values in a stable region that are likely to perform well in live trading. It should be noted that in this instance, we are optimising for robustness, not performance - we are not selecting the parameter values corresponding to the highest peak in performance, as that would lead to overfitting.

Lastly, a possible enhancement for system parameter sensitivity testing is to combine it with noise testing. By optimising on not just one data series, but 100-1000 noise altered ones, we can further increase our conviction in the system’s resilience. Creating the new optimisation profile would involve the following steps:

Generate 100-1000 variants of noise induced prices series.
For every combination of parameter values, run a backtest on all the generated price series.
For each set of backtests generated for each combination of parameter values, use the median metric values as data points to plot on the optimisation profile.

7.1.7 Walk forward analysis

Walk forward analysis (WFA) involves testing the system's parameters using a rolling optimisation and backtest pattern, creating a series of backtests with non-optimised data (see Figure 5). The process begins with a period of parameter optimisation, followed by a backtest period. By repeating this pattern forward through time, the system's parameters are tested on data they have not been optimised for.

It's essential to note that the system has been developed using the same data that the WFA is conducted over. So while WFA doesn't solve overfitting, create truly unseen data or turn training data into out-of-sample data, it does offer insight into how the parameters react to the optimisation and backtest sequence. If the walk forward optimised parameters perform well across the backtest, it's a sign that the system is adaptable to changing market conditions and can handle unknown future data using parameters that were not optimised for that data.

Figure 5: Diagram showing how data is split in a typical walk forward test

While recognising the merits of WFA, we would be remiss if we did not also acknowledge its many limitations. One of the main drawbacks of WFA is that it tests a single price path, unlike other tests such as noise testing or other Monte Carlo methods, which test multiple price paths simultaneously. Furthermore, walk forward analysis involves multiple data splits, leading to an increased risk of data leakage. Data leakage occurs when information from the IS period is unintentionally used during the OOS period. To illustrate this, consider a trading strategy that utilises a 100-period simple moving average. After each data split, the first 99 bars will reference data from the previous in-sample period, resulting in data leakage. The more data splits performed, the greater the extent of leakage.

There are more sophisticated techniques that address some of these limitations, such as combinatorial purged cross-validation proposed by Marcos Lopez de Prado [Advances in financial machine learning]. While these techniques are beyond the scope of this introduction to backtesting, they will be explained in detail in a future article.

7.2 Market regime conditional analysis

The practices and tests presented thus far help us mainly in answering one fundamental and quantitative question: “did this strategy have a statistically significant edge in the past?”. Other questions that are arguably just as important are “Why did this strategy work in the past? Will this strategy continue to thrive in the future amidst changing market conditions?”. These are more qualitative questions that we can begin to answer through market regime conditional analysis.

Market regime conditional analysis involves studying the relationship between market variables and the performance of a trading strategy under different market conditions. It aims to identify specific market characteristics that significantly influence the strategy's performance. To name a few, these characteristics could include volatility levels, trending and mean-reverting behaviour, liquidity conditions, or macroeconomic factors. By conducting a thorough analysis, traders can gain insights into how their strategies perform across different market regimes, helping them understand the strengths and weaknesses of their approaches. This analysis can also serve as a reality check for our hypothesis; If our analysis of regime-specific performance contradicts our expectations, it indicates a lack of understanding or ineffective construction of the strategy. In both cases, further analysis becomes necessary.

The relevance of market regime conditional analysis for backtesting a trading strategy follows from the realisation that any claims made about strategy evaluation without accounting for changing market conditions are severely limited. Markets are highly complex, constantly evolving, and there’s no guarantee that future market regimes will be the ones favourable to your strategy. Hence, discovering and accounting for the market conditions that are necessary for our strategy to thrive can help us predict whether it will continue to deliver on unseen data, given any state of the market. This is why contextualising a strategy’s returns to different market conditions is paramount and highly complementary to tests based on statistical significance. Taken in isolation, statistical analysis cannot predict when these models will cease to be effective. It disregards one of the most critical causes of strategy performance deterioration - the curse of non-stationarity.

Performing such an analysis can also enhance strategy performance and reduce the risk of overfitting. By incorporating the dynamics of different market regimes, traders can develop strategies that are more robust and capable of adapting to changing market conditions. Strategies that perform well in one regime may underperform or fail in another, so understanding the regime-specific behaviour is crucial. By backtesting a strategy across different regimes and analysing the results, traders can gain insights into the strategy's stability, profitability, and potential risks.

Market regime classification takes the analysis a step further by categorising historical market data into distinct regimes. These regimes represent different market environments with unique characteristics. This classification can be based on statistical techniques, machine learning algorithms, or a combination of both. Commonly used techniques include clustering algorithms, hidden Markov models, fuzzy logic or decision trees. By classifying market regimes, traders can better understand underlying market conditions that contribute to profitability and anticipate their potential shifts. Alternatively, strategy inputs could be tailored to each regime, effectively adapting the strategy to different market conditions.

Finally, market regime analysis helps traders to manage risk effectively. By recognizing the transition between different market regimes, traders can adjust their positions, portfolio allocations, or risk management strategies accordingly. For example, during periods of high volatility or market stress, a trader may choose to reduce leverage, increase position sizes in safe-haven assets, or implement hedging strategies. Market regime analysis provides a framework for dynamically adjusting risk exposure based on the prevailing market conditions.

8 Conclusion

Backtesting is a double edged sword. Used inappropriately, it is very easy for traders to be lulled into a false sense of security as backtesting becomes just a source of confirmation bias. Typically, when a research team has faith in a strategy's soundness, but the backtest results disappoint, they don't abandon it. Instead, they suggest improvements, add rules to the trading system until the results conform to their preconceived notions. Backtests then become rife with overfitting and data snooping. In such cases, if investors allocate capital based on these backtests, the ensuing performance will likely be disappointing.

When executed correctly, backtesting emerges as an invaluable tool that enhances traders’ understanding of strategy performance. Incorporating the following methods and heuristics into your strategy evaluation framework can help set realistic expectations and improve overall trading performance:

Maximise the backtest sample size;
Minimise the number of backtest trials;
Exclude outlier trades;
Test every strategy component separately;
Backtest every strategy only once on OOS data;
Start from a logical hypothesis;
Minimise model complexity;
Test every strategy on multiple markets;
Backtest strategies on multiple time frames;
Backtest using time-shifted data;
Run Monte Carlo simulations on noise induced price series;
Run Monte Carlo simulations with randomised entry and exit times;
System parameter permutation analysis;
System parameter sensitivity analysis;
Walk forward analysis;
Market regime conditional analysis.

9 References

[1] “Pseudo-Mathematics and Financial Charlatanism: The Effects of Backtest Overfitting on Out-of-Sample Performance”, David H. Bailey, Jonathan Borwein, Marcos Lopez De Prado, Qiji Jim Zhu, 2014

[2] “Limitations of Quantitative Claims About Trading Strategy Evaluation”, Michael Harris, 2016

[3] “Backtesting”, Marcos Lopez De Prado, 2015

[4] “Quantitative Trading: How to Build Your Own Algorithmic Trading Business”, 2nd Edition, Ernest P. Chan, 2021

[5] “A Backtesting Protocol in the Era of Machine Learning”, Robert D. Arnott, Campbell R. Harvey, Harry Makowitz, 2018

[6] “A meeting with Enrico Fermi”, Nature (London) 427(6972), 297, F. Dyson, 2004

[7] “How to Avoid Overﬁtting Using Robustness Tests”, Whitepaper by TipToeHippo.com, 2021

[8] “Know Your System! – Turning Data Mining from Bias to Benefit Through System Parameter Permutation”, Dave Walton, 2014

Reply

or to participate.