Topic: Using Your Broker's Data -- Yes or No?
The forum seems a bit quiet these days, so I thought I would stir things up by raising a controversial question. I've often seen it recommended we should use data from our particular broker for generating and optimizing strategies. My take on this is it probably doesn't make much difference what the data source is. And here's why...
(1) The data source is a *.csv or *.json file of a few thousand OHLC values -- essentially an array of numbers. A strategy is an algebraic formula. If I give you 5 different *.csv files from 5 different brokers -- i.e. 5 different arrays of numbers -- can your algebraic formula really distinguish between the different brokers? And, if so, what exactly differs between the 5 arrays of numbers?
(2) I have accounts with 5 different brokers. I've found that winning strategies win with each of the brokers and losing strategies lose with each of the brokers. I have yet to find a strategy that performs poorly with one broker and great with another -- or vice versa.
(3) When I "refresh" or re-optimize an existing strategy with a data source from a different broker (but using the exact same data horizon), the back test statistics come out fairly close. Not exact, but similar. Again, I have yet to come across a strategy that has great back test statistics using one data source compared to another (again -- same data horizon).
(4) When I "refresh" or re-optimize an existing strategy using the *same* data source but a different data horizon, then the back test statistics can differ significantly. In fact, the delta in statistics using different data horizons is bigger than when using different data sources.
My point is this -- though the array of numbers from different brokers will differ somewhat, so will the array of numbers from different data horizons (but same broker). Inherent in these arrays of numbers are certain patterns. I'm making the claim that within the same data horizon, the patterns probably do not differ very much from broker to broker. However, when comparing different data horizons, then patterns are more likely to differ. This is probably why strategies perform better or worse at different times -- it depends on how well they have been trained to recognize the current data pattern.
One last point -- many or most of us acknowledge there is a "disconnect" between back testing results and live trading. I suspect this is caused by live trading using tick data, compared with back testing using bar data. Tick and bar data are two completely different beasts. Considering the disconnect introduced by using tick data, then why would we be concerned about the relatively minor differences in *.csv data from different brokers when generating / training strategies? Whatever differences may exist (in broker *.csv data) will eventually be overwhelmed by the differences between ticks and bars.
I'm curious whether any of this makes sense...