#### Topic: Academic Paper: Evaluation of Trading Strategies

Has anyone here come across the research paper entitled, "Evaluation of Trading Strategies" (Campbell R. Harvey and Yan Liu, Aug 2014)? Has this been looked into and considered by the creators of Forex Strategy Builder?

https://faculty.fuqua.duke.edu/~charvey … tegies.pdf

I think the paper has some important implications when analysing the viability of a strategy. Here are some snips of it:

TWO VIEWS OF MULTIPLE TESTING

There are two main approaches to the multipletesting

problem in statistics. They are known as the

family-wise error rate (FWER) and the false discovery

rate (FDR). The distinction between the two is very

intuitive.In the family-wise error rate, it is unacceptable to

make a single false discovery. This is a very severe rule

but completely appropriate for certain situations. With

the FWER, one false discovery is unacceptable in 100

tests and equally as unacceptable in 1,000,000 tests. In

contrast, the false discovery rate views unacceptable in

terms of a proportion. For example, if one false discovery

were unacceptable for 100 tests, then 10 are unacceptable

for 1,000 tests. The FDR is much less severe than

the FWER.Which is the more appropriate method? It depends

on the application. For instance, the Mars One foundation

is planning a one-way manned trip to Mars in 2024

and has plans for many additional landings.12 It is unacceptable

to have any critical part fail during the mission.

A critical failure is an example of a false discovery (we

thought the part was good but it was not—just as we

thought the investment manager was good but she was

not).The best-known FWER test is called the Bonferroni

test. It is also the simplest test to implement. Suppose

we start with a two-sigma rule for a single (independent)

test. This would imply a t-ratio of 2.0. The interpretation

is that the chance of the single false discovery is only

5% (remember a single false discovery is unacceptable).

Equivalently, we can say that we have 95% confidence

that we are not making a false discovery.Now consider increasing the number of tests to 10.

The Bonferroni method adjusts for the multiple tests.

Given the chance that one test could randomly show up

as significant, the Bonferroni requires the confidence

level to increase. Instead of 5%, you take the 5% and

divide by the number of tests, that is, 5%/10 = 0.5%.

Again equivalently, you need to be 99.5% confident

with 10 tests that you are not making a single false discovery.

In terms of the t-statistic, the Bonferroni requires

a statistic of at least 2.8 for 10 tests. For 1,000 tests, the

statistic must exceed 4.1

......

There are usually more discoveries with BHY.

The reason is that BHY allows for an expected proportion

of false discoveries, which is less demanding than

the absolute occurrence of false discoveries under the

FWER approaches. We believe the BHY approach is the

most appropriate for evaluating trading strategies.

......

FALSE DISCOVERIES AND MISSED

DISCOVERIESSo far, we have discussed false discoveries, which

are trading strategies that appear to be profitable—but

they are not. Multiple testing adjusts the hurdle for significance

because some tests will appear significant by

chance. The downside of doing this is that some truly

significant strategies might be overlooked because they

did not pass the more stringent hurdle.This is the classic tension between Type I errors

and Type II errors. The Type I error is the false discovery

(investing in an unprofitable trading strategy). The Type

II error is missing a truly profitable trading strategy.

Inevitably there is a tradeoff between these two errors.

In addition, in a multiple testing setting it is not obvious

how to jointly optimize these two types of errors.Our view is the following. Making the mistake of

using the single test criteria for multiple tests induces a

very large number of false discoveries (large amount of

Type I error). When we increase the hurdle, we greatly

reduce the Type I error at minimal cost to the Type II

(missing discoveries).

......

In our case, the original p-value is 0.004 and when

multiply by 200 the adjusted p-value is 0.80 and the

corresponding t-statistic is 0.25. This high p-value is significantly

greater than the threshold, 0.05. Our method

asks how large the Sharpe ratio should be in order to

generate a t-statistic of 0.25. The answer is 0.08. Therefore,

knowing that 200 tests have been tried and under

Bonferroni’s test, we successfully declare the candidate

strategy with the original Sharpe ratio of 0.92 as insignificant—the

Sharpe ratio that adjusts for multiple tests

is only 0.08. The corresponding haircut is large, 91%

(= (0.92 – 0.08)/0.92).Turning to the other two approaches, the Holm

test makes the same adjustment as Bonferroni since the

t-statistic for the candidate strategy is the smallest among

the 200 strategies. Not surprisingly, BHY also strongly

rejects the candidate strategy.The fact that each of the multiple-testing methods

rejects the candidate strategy is a good outcome because

we know all of these 200 strategies are just random numbers.

A proper test also depends on the correlation among

test statistics, as we discussed previously. This is not an

issue in the 200 strategies because we did not impose any

correlation structure on the random variables. Harvey

and Liu [2014b] explicitly take the correlation among

tests into account and provide multiple-testing-adjusted

Sharpe ratios using a variety of methods.

......

Recent research by López de Prado and his coauthors

pursues the out-of-sample route and develops a

concept called the probability of backtest overfitting

(PBO) to gauge the extent of backtest overfitting (see

Bailey et al. [2013a, b] and López de Prado [2013]).

In particular, the PBO measures how likely it is for a

superior strategy that is fit IS to underperform in the

OOS period. It succinctly captures the degree of backtest

overfitting from a probabilistic perspective and should

be useful in a variety of situations.To see the differences between the IS and OOS

approach, we again take the 200 strategy returns in

Exhibit 2 as an example. One way to do OOS testing

is to divide the entire sample in half and evaluate the

performances of these 200 strategies based on the first

half of the sample (IS), that is, the first five years. The

evaluation is then put into further scrutiny based on the

second half of the sample (OOS). The idea is that strategies

that appear to be significant for the in-sample period

but are actually not true will likely to perform poorly

for the out-of-sample period. Our IS sample approach,

on the other hand, uses all ten years’ information and

makes the decision at the end of the sample. Using the

method developed by López de Prado and his coauthors,

we can calculate PBO to be 0.45.18 Therefore, there is

high chance (that is, a probability of 0.45) for the IS best

performer to have a below-median performance in the

OOS. This is consistent with our result that based on

the entire sample, the best performer is insignificant if

we take multiple testing into account. However, unlike

the PBO approach that evaluates a particular strategy

selection procedure, our method determines a haircut

Sharpe ratio for each of the strategies.In principle, we believe there are merits in both

the PBO as well as the multiple-testing approaches. A

successful merger of these approaches could potentially

yield more powerful tools to help asset managers successfully

evaluate trading strategies.

......

The multiple-testing problem greatly confounds

the identification of truly profitable trading strategies

and the same problems apply to a variety of sciences.

Indeed, there is an influential paper in medicine by

Ioannidis [2005] called “Why Most Published Research

Findings Are False.” Harvey et al. [2014] look at 315 different

financial factors and conclude that most are likely

false after you apply the insights from multiple testing.In medicine, the first researcher to publish a new

finding is subject to what they call the winner’s curse.

Given the multiple tests, subsequent papers are likely

to find a lesser effect or no effect (which would mean

the research paper would have to be retracted). Similar

effects are evident in finance where Schwert [2003]

and McLean and Pontiff [2014] find that the impact of

famous finance anomalies is greatly diminished out-ofsample—or

never existed in the first place.So where does this leave us? First, there is no reason

to think that there is any difference between physical

sciences and finance. Most of the empirical research

in finance, whether published in academic journals or

put into production as an active trading strategy by an

investment manager, is likely false. Second, this implies

that half the financial products (promising outperformance)

that companies are selling to clients are false.To be clear, we are not accusing asset managers of

knowingly selling false products. We are pointing out

that the statistical tools being employed to evaluate these

trading strategies are inappropriate. This critique also

applies to much of the academic empirical literature in

finance—including many papers by one of the authors

of this article (Harvey).It is also clear that investment managers want to

promote products that are most likely to outperform in

the future. That is, there is a strong incentive to get the

testing right. No one wants to disappoint a client and no

one wants to lose a bonus—or a job. Employing the statistical

tools of multiple testing in the evaluation of trading

strategies reduces the number of false discoveries.

What are your thoughts on this?