Topic: The Bootstrap Test: How significant are your back-testing results?
Anyone come across this article?
The Bootstrap hurdling method for statistical significance seems pretty neat for evaluating the viability of trading strategies. I'm thinking Forex Strategy Builder could somehow implement a similar feature in the future:
In very brief terms, the concept uses hypothesis testing to verify whether the test statistic (such as mean return of the back-testing sample) is statistically significant. This is done by establishing the p-value of the test statistic based on the sampling distribution. (Aronson covers the basics of statistical analysis earlier in the book. I have also mentioned previously The Cartoon Guide to Statistics, which covers these concepts too)
The problem with back-testing is that the results generated represent a single sample, which does not provide any information on the sample statistic’s variability and its sampling distribution. This is where bootstrapping comes in: by systematically and randomly resampling the single available sample many times, it is possible to approximate the shape of the sampling distribution (and therefore calculate the p-value of the test statistic).
Once the p-value is obtained, it is simply a matter of deciding which threshold qualifies for statistical significance. Scientists usually determine the statistical significance threshold at 0.05 (ie. the null hypothesis would be rejected for any p-value less or equal to 0.05).
As discussed above, the assumption that the rule does not have predictive power is translated to the arithmetic mean of its returns being equal to zero. In the bootstrap method, rejecting the null hypothesis occurs when the mean arithmetic return is statistically significantly positive.
I am usually no big fan of arithmetic mean of returns as it is a flawed indicator of profitability. In effect, a system can have a positive mean arithmetic return and still be unprofitable – think about a return of 50% followed by a return of -40%: arithmetic mean return is +5%, yet the overall return is minus 5.1%
On the other hand, any profitable rule has a positive mean geometric return, and any rule with positive mean geometric return is profitable. On that basis, using the mean geometric return as the test-statistic in the bootstrap must be more appropriate.
Part 2 also looks handy:
http://www.automated-trading-system.com … tric-mean/
The approach described in the single rule test is not valid when performing data mining (whether testing different rules or different parameter values of the same rule). As per the data mining bias (explained previously), the (best) rule selected from the data mining process will invariably owe a large part of its over-performance to random (good) luck.
The way the bootstrap test deals with the data mining bias is by implementing a concept introduced in White’s Reality Check. The Reality Check derives the sampling distribution appropriate to test the statistical significance of the best rule found by data mining.
In part 1, I introduced the idea that the mean arithmetic return being positive is not equivalent to the strategy being profitable (ie. this is not a sufficient condition). On the other hand, the mean geometric return being positive is a necessary and sufficient condition to the strategy being profitable (ie. both conditions are equivalent).
Therefore bootstrapping using the mean geometric return as the test statistic should provide a better evaluation of the system’s profitability statistical siginificance.
To illustrate the multiple applications of the bootstrapping methodology, I decided to run the test on one of the Trend Following Wizards track record (set of monthly returns). I picked Chesapeake and ran the monthly returns (from 1988 to 2009) through the bootstrap test.
The p-value calculated using the arithmetic mean is 0.000098 (less than 1 chance in 10,000 that this kind of results are due to random luck). Using the geometric mean, the p-value is 0.00022. The values are extremely low, which is not surprising given Jerry Parker’s 20-year track record with only one losing year and a monthly average return of 1.7%.
Many people would point out that survivorship bias should be considered, and obviously it depends on how you look at it. The main point of this dual test is that the geometric p-value is higher than the arithmetic p-value, verifying that it is a stricter test of statistical significance.