Topic: Academic Paper: Evaluation of Trading Strategies
Has anyone here come across the research paper entitled, "Evaluation of Trading Strategies" (Campbell R. Harvey and Yan Liu, Aug 2014)? Has this been looked into and considered by the creators of Forex Strategy Builder?
I think the paper has some important implications when analysing the viability of a strategy. Here are some snips of it:
TWO VIEWS OF MULTIPLE TESTING
There are two main approaches to the multipletesting
problem in statistics. They are known as the
family-wise error rate (FWER) and the false discovery
rate (FDR). The distinction between the two is very
In the family-wise error rate, it is unacceptable to
make a single false discovery. This is a very severe rule
but completely appropriate for certain situations. With
the FWER, one false discovery is unacceptable in 100
tests and equally as unacceptable in 1,000,000 tests. In
contrast, the false discovery rate views unacceptable in
terms of a proportion. For example, if one false discovery
were unacceptable for 100 tests, then 10 are unacceptable
for 1,000 tests. The FDR is much less severe than
Which is the more appropriate method? It depends
on the application. For instance, the Mars One foundation
is planning a one-way manned trip to Mars in 2024
and has plans for many additional landings.12 It is unacceptable
to have any critical part fail during the mission.
A critical failure is an example of a false discovery (we
thought the part was good but it was not—just as we
thought the investment manager was good but she was
The best-known FWER test is called the Bonferroni
test. It is also the simplest test to implement. Suppose
we start with a two-sigma rule for a single (independent)
test. This would imply a t-ratio of 2.0. The interpretation
is that the chance of the single false discovery is only
5% (remember a single false discovery is unacceptable).
Equivalently, we can say that we have 95% confidence
that we are not making a false discovery.
Now consider increasing the number of tests to 10.
The Bonferroni method adjusts for the multiple tests.
Given the chance that one test could randomly show up
as significant, the Bonferroni requires the confidence
level to increase. Instead of 5%, you take the 5% and
divide by the number of tests, that is, 5%/10 = 0.5%.
Again equivalently, you need to be 99.5% confident
with 10 tests that you are not making a single false discovery.
In terms of the t-statistic, the Bonferroni requires
a statistic of at least 2.8 for 10 tests. For 1,000 tests, the
statistic must exceed 4.1
There are usually more discoveries with BHY.
The reason is that BHY allows for an expected proportion
of false discoveries, which is less demanding than
the absolute occurrence of false discoveries under the
FWER approaches. We believe the BHY approach is the
most appropriate for evaluating trading strategies.
FALSE DISCOVERIES AND MISSED
So far, we have discussed false discoveries, which
are trading strategies that appear to be profitable—but
they are not. Multiple testing adjusts the hurdle for significance
because some tests will appear significant by
chance. The downside of doing this is that some truly
significant strategies might be overlooked because they
did not pass the more stringent hurdle.
This is the classic tension between Type I errors
and Type II errors. The Type I error is the false discovery
(investing in an unprofitable trading strategy). The Type
II error is missing a truly profitable trading strategy.
Inevitably there is a tradeoff between these two errors.
In addition, in a multiple testing setting it is not obvious
how to jointly optimize these two types of errors.
Our view is the following. Making the mistake of
using the single test criteria for multiple tests induces a
very large number of false discoveries (large amount of
Type I error). When we increase the hurdle, we greatly
reduce the Type I error at minimal cost to the Type II
In our case, the original p-value is 0.004 and when
multiply by 200 the adjusted p-value is 0.80 and the
corresponding t-statistic is 0.25. This high p-value is significantly
greater than the threshold, 0.05. Our method
asks how large the Sharpe ratio should be in order to
generate a t-statistic of 0.25. The answer is 0.08. Therefore,
knowing that 200 tests have been tried and under
Bonferroni’s test, we successfully declare the candidate
strategy with the original Sharpe ratio of 0.92 as insignificant—the
Sharpe ratio that adjusts for multiple tests
is only 0.08. The corresponding haircut is large, 91%
(= (0.92 – 0.08)/0.92).
Turning to the other two approaches, the Holm
test makes the same adjustment as Bonferroni since the
t-statistic for the candidate strategy is the smallest among
the 200 strategies. Not surprisingly, BHY also strongly
rejects the candidate strategy.
The fact that each of the multiple-testing methods
rejects the candidate strategy is a good outcome because
we know all of these 200 strategies are just random numbers.
A proper test also depends on the correlation among
test statistics, as we discussed previously. This is not an
issue in the 200 strategies because we did not impose any
correlation structure on the random variables. Harvey
and Liu [2014b] explicitly take the correlation among
tests into account and provide multiple-testing-adjusted
Sharpe ratios using a variety of methods.
Recent research by López de Prado and his coauthors
pursues the out-of-sample route and develops a
concept called the probability of backtest overfitting
(PBO) to gauge the extent of backtest overfitting (see
Bailey et al. [2013a, b] and López de Prado ).
In particular, the PBO measures how likely it is for a
superior strategy that is fit IS to underperform in the
OOS period. It succinctly captures the degree of backtest
overfitting from a probabilistic perspective and should
be useful in a variety of situations.
To see the differences between the IS and OOS
approach, we again take the 200 strategy returns in
Exhibit 2 as an example. One way to do OOS testing
is to divide the entire sample in half and evaluate the
performances of these 200 strategies based on the first
half of the sample (IS), that is, the first five years. The
evaluation is then put into further scrutiny based on the
second half of the sample (OOS). The idea is that strategies
that appear to be significant for the in-sample period
but are actually not true will likely to perform poorly
for the out-of-sample period. Our IS sample approach,
on the other hand, uses all ten years’ information and
makes the decision at the end of the sample. Using the
method developed by López de Prado and his coauthors,
we can calculate PBO to be 0.45.18 Therefore, there is
high chance (that is, a probability of 0.45) for the IS best
performer to have a below-median performance in the
OOS. This is consistent with our result that based on
the entire sample, the best performer is insignificant if
we take multiple testing into account. However, unlike
the PBO approach that evaluates a particular strategy
selection procedure, our method determines a haircut
Sharpe ratio for each of the strategies.
In principle, we believe there are merits in both
the PBO as well as the multiple-testing approaches. A
successful merger of these approaches could potentially
yield more powerful tools to help asset managers successfully
evaluate trading strategies.
The multiple-testing problem greatly confounds
the identification of truly profitable trading strategies
and the same problems apply to a variety of sciences.
Indeed, there is an influential paper in medicine by
Ioannidis  called “Why Most Published Research
Findings Are False.” Harvey et al.  look at 315 different
financial factors and conclude that most are likely
false after you apply the insights from multiple testing.
In medicine, the first researcher to publish a new
finding is subject to what they call the winner’s curse.
Given the multiple tests, subsequent papers are likely
to find a lesser effect or no effect (which would mean
the research paper would have to be retracted). Similar
effects are evident in finance where Schwert 
and McLean and Pontiff  find that the impact of
famous finance anomalies is greatly diminished out-ofsample—or
never existed in the first place.
So where does this leave us? First, there is no reason
to think that there is any difference between physical
sciences and finance. Most of the empirical research
in finance, whether published in academic journals or
put into production as an active trading strategy by an
investment manager, is likely false. Second, this implies
that half the financial products (promising outperformance)
that companies are selling to clients are false.
To be clear, we are not accusing asset managers of
knowingly selling false products. We are pointing out
that the statistical tools being employed to evaluate these
trading strategies are inappropriate. This critique also
applies to much of the academic empirical literature in
finance—including many papers by one of the authors
of this article (Harvey).
It is also clear that investment managers want to
promote products that are most likely to outperform in
the future. That is, there is a strong incentive to get the
testing right. No one wants to disappoint a client and no
one wants to lose a bonus—or a job. Employing the statistical
tools of multiple testing in the evaluation of trading
strategies reduces the number of false discoveries.
What are your thoughts on this?