Here are my results from my initial runs incorporating OOS.
My working hypothesis is that through the use of OOS data, I can more quickly and efficiently test whether the strategies that I am generating (over the In Sample data) will continue to perform well in the future (OOS, on the data it was not optimized for). The OOS data should simulate placing the EAs on a demo account.
This allows me to change various settings (including, but not limited to, Acceptance Criteria, Monte Carlo variables, Optimizer settings, SL and TP, data sample size, etc.) and determine relatively quickly what effect those changes have in the OOS period. In fact, I can reach a "large enough" sample size to be meaningful, something that is almost impossible deploying strategies on test servers.
I know there will be dissenters, but it is my belief that using the newfound strategy generation settings in this manner will lead to generating EAs that are much more dependable. Please note that it is entirely possible that, once the "optimal" settings are found, the OOS period may be removed completely for live strategy generation. This method is simply using OOS data to determine those settings that make this possible.
This process has been extremely revealing so far. Some things I completely expected; there are a few that surprised me. It will take me a while to fully interpret and understand the data. As always, I am open to changing my mind about anything. Thanks again to Steve/sleytus for making me interested in this more mass-generation-oriented approach than I was using before. I was quite against generation over a short periods initially.
This test was run on slightly different settings than I originally intended, due to the new OOS validation. here are those settings:
1. Historical Data:
* Symbol: EURUSD
* Period: M30
(Data Horizon: 22500 bars; 15750 IS, 6750 OOS)
* In Sample: From 10-23-2015 to 01-26-2017
* OOS: From 01-27-2017 to 08-11-2017
2. Strategy Properties:
(Account size: 100,000)
* Entry Lots: 1
* SL: Always/10-100 pips
* TP: May use/10-100 pips
3. Generator settings:
* Search best: System Quality Number
* OOS: 30%
* 5 steps
* 30% OOS
5. All data validation: YES
6. MC Validation:
* 100 tests
* Validated tests: 95%
(Settings: defaults PLUS "Randomize indicator parameters" [10/10/20 steps])
7. NO market validation
> Complete Backtest:
* Max Amb. bars: 10
* Min Net Profit: 10 (no effect)
* Min Trades: 100 (no effect)
> IS part:
* Min Trades: 100
* Max DD%: 25
* Max Stag%: 35
* Min PF: 1.1
* Min R/DD: 1
> OOS part:
* Min Trades: 25
* Max DD%: 25
* Max Stag%: 50
* Min PF: 1.1
* Min R/DD: 0.5
Philosophy of Setting Acceptance Criteria
When setting Acceptance Criteria Values, I entered values far worse than we would want to use in real trading systems (with the exception of number of trades IS = 100; it's necessary to generate over a sufficient sample of trades). This is so I don't accidentally filter out any strategies by trying to be too accurate, while still eliminating a portion of them.
I will progressively narrow the AC, but only so far as to remove only a very rare potentially good strategy.
Note: I ran more Calculations on IS/OOS data. Because significantly fewer strategies are validated with IS/OSS, the variance is much higher, so I needed a larger sample size. But I was initially very surprised when I noticed the IS/OOS had calculated MORE THAN TEN TIMES the number of strategies on average. First, I thought that my settings must be incorrect somewhere. This was not the case. Then, I thought perhaps the new backtesting engine had a bug in it.
And finally, the results:
Generated Strategies: 180374
Number Passed Validation (Generation step): 4861
Percent Passed Validation (Generation step): 2.69%
Number Passed Validation (MC step): 760
Percent Passed Validation (MC step): 15.63%
Percent Passed Validation (All steps): 0.4213%
Generated Strategies: 4511276
Number Passed Validation (Generation step): 2337
Percent Passed Validation (Generation step): 0.05%
Number Passed Validation (MC step): 842
Percent Passed Validation (MC step): 36.03%
Percent Passed Validation (All steps): 0.0187%
For this initial experiment, I used a test group without the OOS period for the control group. The In Sample length was identical to that of the generation run with OOS.
When complete, I compared the statistics each generation using the number (percentage) of strategies that Passed Validation in each of the Generator Step and The Monte Carlo step.
The difference between the Percentage of "Passed Validation" in the Generator step represents the relative number of strategies generated via In-Sample-only generation that were not viable when trading out of the optimized period (i.e. they were curve-fit) for the following 6.5-month time period.
Notice that for the IS/OOS, nearly all the strategies generated did not go on to be profitable over this 6.5 month period. This is far from the final word, but the reader should at least consider that relying on IS results alone may not be viable. You might say that 6.5 months is too long and you expect most strategies to fail within that time and that your pruning strategy will (eventually) solve the problem. Perhaps.
This test can be repeated using only 10% OOS (~2 months) and I bet the results wouldn't be substantially better; yes, you will have a higher proportion of winning strategies on average. But I contend that that is due mostly to random fluctuation and small sample size. Run your own experiment: 10% OOS over the same period as I have done (except change the OOS period...you will need only 1575 bars OOS). Reset Acceptance Criteria (but change #/trades OOS to 9, which is proportional to my AC #/trades) and quickly generate 100 random strategies. Don't spend time with MC testing. Examine those 100 random strategies and see just how many show a profit after two months OOS. I think you'll be amazed at how many actually do show a profit.
For this test, I used a mandatory time for testing for profitability. That length of time may be changed and the process repeated. You will, of course, find more strategies that are "profitable" in the shorter terms. But you will also need to base your results on MUCH reduced number of trades (or accept that the number of strategies generated will go down relative to that same length of time).
Basing your results on fewer trades leads to greater and greater problems with small sample size and reliability of results. Dr. Thorpe, in his formulation of System Quality Number (my favorite metric), details that you must make at least 40 trades before that statistic becomes truly accurate. I set "Number of Trades" to only 25 in the OOS period to capture more strategies. However, I'm sacrificing some confidence in the results.
But I quickly realized the reason: I am using Monte Carlo simulations that conduct 100 tests upon each strategy that passed the Acceptance Criteria. Because the IS/OOS was more picky in selecting its passed strategies, it had much more time to actually generate strategies rather than spending it on Monte Carlo validation.
Strategies passed Monte Carlo validation at well over twice the rate with IS/OOS. Monte Carlo validation in the IS/OOS was over both IS and OOS, so this certainly contributed to some degree to the higher validation rate. But consider, too, that the overall quality of the strategy's results are degraded by the use of OOS results. This is partially compensated by the fact that I did filter OOS, so they are the highest stats of any OOS runs.
One observation is that with the IS-exclusive generation, since there are many more strategies being generated than can be held by the Collection, it will in the end contain a much better-than-average representation of strategies than the above stats would otherwise indicate.
What do these results tell us and how can they be of use in further testing?
This intial run was not meant to prove anything so much as to provide a baseline for further testing.
The most interesting result to me was that using IS/OOS, my CPU can use its time generating MANY more strategies, rather than spend it on validating Monte Carlo tests.
One thing disturbing to me is that it's hard to immediately tell the difference between stats of the best strategies generated with IS vs. those IS/OOS. I thought the line may be more discernible. (I can hear "I told you so" already!) This seems to be a very strong argument that the actual Acceptance Criteria may have only a very small effect on proper workflow.
I haven't explored the stats very much yet, though, and I expect the real revelations to come with the further testing I have planned. Perhaps when certain AC are combined, they will generate a more predictable result.
Something else that occurred to me about this: it seems to say that the stats we have to work with cannot be heavily relied upon to select strategies. Not to say, for example, that there is not difference in a R/DD of 6 vs. a R/DD of 2. Of course, the generated strategy with R/DD of 5 has a better chance lacking more information. But what I'm saying is that the divide between these two values may be very small (by itself), especially if it is taken from an exclusively IS/Optimized strategy.
This would certainly be in line with the "observations" that there are no discernable performance differences between the "top ten" strategies of a Collection vs. the "second ten".
So if we cannot depend on the actual stats to make much of a difference, what else do we have? Currently, there is Monte Carlo testing and OOS. Monte Carlo settings are possibly the least-understood metric yet most powerful that traders have. From my time using StrategyQuant, I've watched the "gurus" propose all sorts of nonsense about how to validate your strategies, and much of it was focused on Monte Carlo testing.
The big problem is that we are given this tool, and we know it is somehow useful, but most of us don't have the math background to understand how to use it effectively. So someone that wants to sell his courses and/or software comes up with a process that seems logical to him (based on intuition, "observation", astrology, or whatever...) and this process becomes a "gold standard", unquestioned by people that follow it blindly. Because, after all, that's how EVERYONE does it now, so it MUST be the right way.
Trial and error and intuition and observation is very unlikely to get use moving in the right direction. It can actually set you in a direct opposite direction for an entire lifetime (I'm not exaggerating!).
We need to come up with a way scientifically determine the best method of MC analysis. A calculation of an exact or even near-exact method is far beyond my math skills, but I'll be happy with being in the ballpark. Right now, I think we're just flailing blindly about.
Finally, here is an additional quick test I did. The following sample size is super small, so it doesn't necessarily mean anything at all. But this small sample was rather discouraging.
I took the Collections resulting from the above runs and fed them all through 6750 bars of OOS data the immediately PREceeded the In Sample period.
Keep in mind that the IS/OOS strategies were already filtered for good OOS performance in another period adjacent to the In Sample.
Identical Acceptance Criteria was used as per the above OOS AC.
Total In Sample strategies: 300
% that Passed Acceptance Criteria: 21.7%
Total OOS strategies: 581
% that Passed AC: 19.6%
As I said, this is a very small sample, but if this trend continues then performance in OOS data is not consistent.
As long as the OOS period is adjacent to the IS, I don't think it should matter whether or not it is before or after in this case of monitoring performance.