Topic: Generator checking with more than one pair

Hey,

I'm testing some strategies on demo accounts and when it comes to generated strategies, it seems that the ones that work well on more than one pair (and possibly more than one interval) behave more consistently. For example if I generate strategy and it works well with EURUSD 15min when backtesting, but fails on every other pair and interval, it will rather behave worse than the one that works ok also for GBPUSD 15min (or other pairs). That made me think that maybe it would be nice to be able to run against more than one pair?

Does anyone familiar with source code can try to estimate if it will be rather a metter of changing the generator only, or is generator also strongly connected with the rest of the app?

2

Re: Generator checking with more than one pair

drogus,

I don't believe that this would be a simple change to the application, as internally FSB has access to a single data file at any one time.  It is possible to copy/paste between two instances that have different  symbols loaded, although I am guessing you were hoping for some type of automation.

I've looked into this for my own research and would like to implement a similar type of arrangement, and have even coded some ideas, but have not been able to find a model that works nicely between different instruments.  Still looking though.

ab

Re: Generator checking with more than one pair

I'm really thinking out loud here, but I have an idea I'd like to try. It's pretty simple I think, but unsure if it would work in reality. drogos makes a good find here - I bet there is something to this phenomenon, that a strategy performing well across multiple instruments would be more robust in general.

So here is my idea:
I'd like to take 4 different data sets from 4 different pairs. I'll trim each to exactly 50,000 bars for consistency. Then I'll write a script to shift the date codes in each dataset, such that they can be sewn together as a unified set. FSB won't know that I've meddled with the data (if I can do it correctly).

Then to test the theory, I would just need to boost FSB's max bars to 200,000 and let Generator find a strategy that works across the entire unified set. The theory is that such a strategy would also perform well across any of those individually sewn datasets.

Worth a shot?

4

Re: Generator checking with more than one pair

Worth a shot?

Most definately.  I have been looking at a bunch of complex ways to achieve this, but your idea of building a synthetic data set is great. 

The only issue that I can see is that at some point in time (where the data sets join), there is going to be a big gap in price (say AUDUSD at .80923 joins to EURUSD at 1.65231 which would possibly give the backtester a minor stroke.

If your tests work OK, I'd be happy to build an add-in that can splice the data files together from a UI - so the approach could be used by all users of FSB.

Re: Generator checking with more than one pair

It might be good to first check for the correlation between the instruments, especially the past few months. If a strategy performs well across 2 or 3 pairs, but those pairs are highly correlated, it seems the price correlation would be the explanation instead of an "edge" of the strategy. The price correlation could probably be checked easier than more investigative steps.

I like dusktrader's idea too. For splicing together synthetic data, perhaps could set the first bar to 1.0000, then adjust the rest of the price bars relatively as O-H-L-C[iBar] - O-H-L-C[first bar]. Then the next data set gets adjusted the same way, but starting from the last dataset's bar value. I think this would be the returns calculated in pips, not as a percentage difference. I'll see if I can do this in Excel then post in a bit. This would be an interesting approach.

Re: Generator checking with more than one pair

krog wrote:

It might be good to first check for the correlation between the instruments, especially the past few months. If a strategy performs well across 2 or 3 pairs, but those pairs are highly correlated, it seems the price correlation would be the explanation instead of an "edge" of the strategy. The price correlation could probably be checked easier than more investigative steps.

Yeah, I should probably mention that these are just my subjective observations, that are not backed with any measures, so if anyone is interested in serious work, you might want to check it more thoroughly wink

Re: Generator checking with more than one pair

Oh yeah, I hadn't thought of the gap in prices between the pairs -- I think that probably would matter, but a formula like what krog is talking about is exactly what I was thinking could fix it (adjust relative to each other). Perhaps a simpler way to "deal with" the gap would be to just have some sort of way to detect the gap and then close the trade.

Another thing that has to be dealt with during the splicing together would be the alignment of the datasets. I think ideally you would want to end on a Friday and start the next dataset on a Monday.

A pretty simple way to accomplish this, I think, would be to import all the data first into separate SQL tables. Then trim the beginnings to all be Sunday's open time and endings to all be Friday's closing time.

Then, my idea to sew them together would be to rewrite the actual date beginning with the first record. In other words, it could start on any Monday and FSB wouldn't care, the script would just have to keep up with the offset. It would be relatively easy to check the original data record-by-record to determine if it was still the correct/expected day-of-week (for example, PHP can easily convert back and forth between date formats and unix timestamps, as well as determining things like day-of-week)
. One potential hiccup could be crossing through a leapyear -- the data could become misaligned (by day) if that happens. So I think you would want to check each record during the output.

I think before we get too fancy with trying to make the dataset values all relative to each other, it would be worth the effort to first accomplish proper sewing-together of the data (maintaining alignment of days) and then just see what happens. You could get around the relative-value issue at first by using an end-of-week close, for example, to ensure that no positions were ever carried across a dataset boundary.

Re: Generator checking with more than one pair

drogus wrote:

Yeah, I should probably mention that these are just my subjective observations, that are not backed with any measures, so if anyone is interested in serious work, you might want to check it more thoroughly wink

drogus, your idea makes good sense to me. Enough that I want to experiment and find out. It's pretty much common knowledge that a "good" discretionary strategy can work across any number of markets. So I think the same would probably hold true with improving robustness of a bot aka eliminating over-optimization.

Whether or not this synthetic dataset idea has merit I am not sure, but it's too simple not to just try it and find out.

Re: Generator checking with more than one pair

Btw, here is one interesting way to read correlation. This site graphs current price action and weights the currencies equally so you can get a good feel for realtime correlation. There are also some MT4 indicators that do the same thing.

http://www.fxmri.com

Post's attachments

correlation.png 47.52 kb, file has never been downloaded. 

You don't have the permssions to download the attachments of this post.

Re: Generator checking with more than one pair

Btw, nevermind what I said about the leapyear. That would never be an issue, because during a leapyear you never actually add or eliminate a real day (LOL)

So that means all you'd have to do is ensure that the synthetic start date was aligned correctly with the actual dataset's day-of-week.

Btw, what does FSB do (if anything) when it comes across a gap in data?  What if a broker feed skips an entire day?  Does it do anything at all?

Re: Generator checking with more than one pair

I was thinking about this at the gym and I came up with another idea that I'd like to try after this one. It might have the same effect but achieve it more efficiently (especially in the case of Generator finding new strategies, for example).

Whereas the original idea was to stitch-together a synthetic dataset, this idea would be to create a synthetic dataset by merge.

If you loaded all the minute bars for each desired pair into a separate table, you could then recurse through each bar one step at a time and average them together to get the synthetic bar data. The resultant dataset would be still 50,000 bars, so it would be more efficient from the Generator standpoint.

The theory in this is that a robust strategy that could perform well on the blended synthetic dataset would therefore also perform well on the individual pair (same objective as the stitched dataset).

If that worked well, it would be easy enough to create similar synthetic data from tickdata as well, except that you'd probably want to weight the ticks in case one pair was more active than another (equal weightings for the movement of each pair).

One issue I can think of would be the Yen pairs that only have 3 digits after the decimal instead of 5 for non-Yen pairs. An easy way to fix this (I think) would be to pad them with zeroes just after the decimal (ie, one point of movement remains the same). The actual value of the pair is irrelevant I think, as we would only be concerned about the movement.

For example:
80.936 --> 80.00936

Re: Generator checking with more than one pair

drogus wrote:

Yeah, I should probably mention that these are just my subjective observations, that are not backed with any measures, so if anyone is interested in serious work, you might want to check it more thoroughly wink

I meant, this feature is a good idea because if a strategy works across more instruments it is probably more likely to have found something good. I was just thinking for coding the feature, add a couple checks that use less code to improve the performance of the Generator -- which does 10's to 100's of thousands of combos.

I made a spread sheet to splice the data together, but can't post it here because the forum says it's too big. I think you can download from:
https://github.com/krogKrog/Forex-Strat … s?raw=true
and a sample file:
https://github.com/krogKrog/Forex-Strat … p?raw=true

Unfortunately is an OpenOffice ods file, OpenOffice says it was too big to save in Excel xls format.
How to use:
1) In A1, paste in 50,000 lines of data from a _USD data file
2) In A50001, paste in 50,000 lines of data from a _USD data file
3) In A100001, paste in 50,000 lines of data from a _JPY data file
4) In A150001, paste in 50,000 lines of data from a _JPY data file
5) Open new spreadsheet
6) from splicer, select H1 to M200000, copy
7) in new sheet, right click, Paste Special
8) deselect formulas, select numbers, paste
9) save as EURUSD5.csv  (or can change EURUSD to any instrument you use for FSB

It's pretty rough, so might take a few tries. Mine crashes every 5 minutes. But with that csv file, I was able to run the Generator on it.

Re: Generator checking with more than one pair

To recreate the sheet:
Date column:
1) in H1, put any starting date you like (ex 1/1/2011 00:00:00)
2) in H2, add your interval - I use m5, so:  = H1 + TIME(0;5;0)
3) copy H2 down to H200000
4) format the column in a FSB-frinedly format (ex, MM/DD/YYYY HH:MM:SS --- no AM or PM, this is a common problem, a lot of times the app tries to put it in)

First Data Set, 50,000 lines, this references every value as against 1:
1) in I1, enter 1
2) in I2, enter    =(B2-$B$1) + $I$1
3) in J1, enter    =(C1-$B$1)+$I$1
4) copy J1 to K1 and L1
5) for Volume column, in M1, enter   =F1
6) copy J1 to M1 down to next line
7) copy I2 to M2 down to 50000

Second Data Set, lines 500001 to 100000:
1) I50001, enter L50000   // start this set from the last close of the previous data set
2) in I50002, enter    =(B50002-$B$50001) + $I$50001
3) in J50001, enter    =(C50001-$B$50001)+$I$50001
4) copy J50001 to K50001 and L50001
5) for Volume column, in M50001, enter   =F50001
6) copy J50001 to M50001 down to next line
7) copy I50002 to M50002 down to 100000

Third Data and Fourth Data Set:
do the same steps above, but change formulas to divide by 100:
= ((C100001-$B$100001)/100)+$I$100001
this will make it suitable for 2 or 3 digits instruments, like EURJPY and GBPJPY

Re: Generator checking with more than one pair

Cool! I will check this definitely, but I have a small question. From what I understood the idea is to normalize all the used pairs to make chart consistent. Wouldn't it change the strategy a bit? I mean, if one chart will be stretched vertically it can change the way the strategy will work. Strategy aims to get maximum profits per day with minimum drawdown, so if price moves will be softer, it will produce different results. Am I right or this has nothing to do with the way generator works?

15 (edited by dusktrader 2012-02-25 18:01:18)

Re: Generator checking with more than one pair

I don't think 'normalize' is what we are doing here, if I'm understanding correctly. I haven't had time to study krog's formula yet (thanks for whipping this up so quick btw!) but I think all we are trying to accomplish is to make the price levels relative to each other to avoid unnatural gaps that would confuse the Generator at the splicepoints.

For example (correct me if I'm wrong) I believe that at any given place within the synthetic dataset, you would still find the same degree of movement in currency. Even the chart should look identical if you were to plot it. The only change that should take place when stitching the datasets would be adjusting the price level to match (I think) at the splice points.

So the theory is that Genetator would be looking at price action only. The next step would be to take that strategy that works well over the synthetic data and run it with actual data from one of the pairs. If this theory works, it should perform well.

EDIT: after thinking some more on this, I think all we are doing is shifting the next dataset to start at the same price level that the last one ended at. So if dataset1 ends at 1.2350 and dataset2 starts at 1.3468, then I believe the correct way to shift dataset2's data would be to simply subtract .1118 from each value. You'd then do the same shift at each new splicepoint.

Re: Generator checking with more than one pair

Here's a description, better than trying to read spreadsheet formulas: This spreadsheet is converts them all to pip differences, then adds back in the starting value, so it is like how many pips away from the starting reference point is a given bar's O-H-L-C. The very first Open is set to 1, no reason just that 1 is usually a friendly number to work with. They are all subtract the first Open, then add 1. My first line:
1.03267    1.03317    1.03257    1.03317
which replaces with:
(1.03267-1.03267) + 1,    (1.03317-1.03267) + 1,    (1.03257-1.03267) + 1,    (1.03317-1.03267) + 1
Then line 50,000, the end of the dataset, the line is:
.99678       .99778     .99578     .99738
The last close of .99738 becomes the starting point for the next data set in line 50001:
.99738,     (1.35580 - 1.35480) + .99738,    (1.35380 - 1.35480) + .99738,   (1.35530 - 1.35480) + .99738
The 1.355...s are the real data from pasted-in columns B-C-D-E, these formulas are in the synthetic columns I-J-K-L.
The second data sets divide by 100 to adjust for 2 and 3 digit pairs like denominated in yen.

I don't know, this may work out or not, this just gives us something to work with to see what actually happens. I suppose one thing to look for is whether the Generator gives you a strategy with fairly linear equity curve, meaning consistent across pairs, or a curve that looks split in quarters, meaning the pairs are different? I don't know, this is interesting to research , and try to interpret.

Here is a screenshot with my synthetic 5-min data:

http://s7.postimage.org/q3gaabbw7/synthetic_Data.jpg

Re: Generator checking with more than one pair

Here is a screenshot of the stitch bar from EURUSD sometime in Feb 2012, with EURJPY sometime in June 2011:

http://s7.postimage.org/53qzrbvg7/synthetic_Stitch.jpg

Re: Generator checking with more than one pair

Hey krog, looking good! One question i had about your above example - seems like we woud want the same data period on all of the groups, because macro events could have affected all pairs to some degree. So for example if your EURUSD group starts on jan 1st then seems like EURJPY should also start on jan 1st at the splicepoint.

Re: Generator checking with more than one pair

Oh yes, I meant the EURUSD is from June 2011 to Feb 2012, 5 minute. In the screenshot, you see the last bars of EURUSD ending in Feb 2012, connecting to the first bars of EURJPY which is June 2011. The EURJPY data period is the same period, June 2011 to Feb 2012.

Re: Generator checking with more than one pair

I finally got around to working on this project. It percolated for awhile in my head and then I fought with it for awhile (I'm not really a programmer, just a wannabe lol). But I feel pretty confident that this data is correct now. In my slavejob I do a lot of data-processing, so this is like a fun puzzle to me.

I'm attaching the stitched (synthetic) data that contains the following:

  • exactly 7 weeks of EURUSD m1 beginning Sunday 1/8/2012 @ 23:00 GMT

  • exactly 7 weeks of GBPUSD m1 data

  • exactly 7 weeks of USDCHF m1 data

  • exactly 7 weeks of USDJPY m1 data ending Friday 7/20/2012 @ 21:59 GMT

There are 197691 bars in this synthetic dataset. Pricing has been normalized so that between seams the data MOVEMENT integrity remains (price is shifted accordingly). Yen pairs are divided by 100 before normalizing. I worked very hard on this normalizing portion to ensure that it was "correct" and takes into account ONLY the pip movement of the pairs, which is the most important aspect I think.

I'm attaching a screenshot here as well showing the Generator chugging. This strategy is not "good enough" yet, but you can see the price curve in background (light grey line) looks normal.

Next steps for me:

  • Find a good strategy with Generator

  • Test strategy on non-synthetic (ie real life) data for these pairs, compare stats

I believe this will be a good test to prove or disprove the theory that a strategy generated with synthetic-stitched data is useful.

Shortly thereafter, I plan to tweak the code a little and do that other variation I mentioned above. In that scenario, I will merge the data across all datasets per minute. I think I will probably average the data for each bar. Resultant synthetic bars should be about 50,000. There will be some issues to deal with, such as missing bars in one of the datasets.

<?php
//stitch together data sequentially
//FSB probably doesn't care about weekends or leapyear, so don't worry about this for now
//prerequisite: ensure all datasets begin/end at exactly start/end of week

//login to database
$linkid = @mysql_connect("localhost", "root", "") or trigger_error(mysql_error(), E_USER_ERROR);
mysql_select_db("dusktrader", $linkid);
$stitched = array(); //holds all stitched, time-shifted bar data
$i=1; //array index, keeps track of stitched bars

//datasets to stitch together, each set prepared with exactly 7 weeks of data
processdata('eurusd_m1', 0); //EURUSD not shifted
processdata('gbpusd_m1', 7);
processdata('usdchf_m1', 14);
processdata('usdjpy_m1', 21);

//insert stitched data to MySQL (easy to export from MySQL in FSB-ready format)
print "$i bars processed<br>";
$j=1;
while ($j < $i)
   {
   $sql = "INSERT INTO stitched_m1 " .
      "(feeddate, feedtime, open, high, low, close, vol) " .
      "VALUES (" .
      "'".$stitched[$j]['feeddate']."'," .
      "'".$stitched[$j]['feedtime']."'," .
      $stitched[$j]['open']."," .
      $stitched[$j]['high']."," .
      $stitched[$j]['low']."," .
      $stitched[$j]['close']."," .
      $stitched[$j]['vol'] .
      ");";
   $resultid = mysql_query($sql, $linkid);
   $j++;
   }

exit;


//read and process bar data from an existing table
//shiftweeks specifies the amount of calendar weeks data will be shifted
//from the original feeddate/feedtime
function processdata($table, $shiftweeks)
   {
   global $linkid, $stitched, $i, $prevclose;   
   
   //get initial opening value so we can calculate the difference
   $sql = "SELECT open FROM $table WHERE id=1"; //first record
   $resultid = mysql_query($sql, $linkid);
   $resulttext = mysql_fetch_object($resultid);
   if (floatval($resulttext->open) > 10) $divisor = 100; else $divisor = 1; //divide Yen pair values by 100
   $open = floatval($resulttext->open) / $divisor;
   if (!$prevclose) $prevclose = $open; //synthetic data start point
   if ($open > $prevclose) $difference = ($open - $prevclose) * -1;
      else $difference = $prevclose - $open;
   
   //loop through all data with price normalizing and time shifting
   $sql = "SELECT * FROM $table"; // ******** APPLY LIMIT HERE for testing
   $resultid = mysql_query($sql, $linkid);
   while($resulttext = mysql_fetch_object($resultid))
      {
      $open = floatval($resulttext->open) / $divisor;
      $high = floatval($resulttext->high) / $divisor;
      $low = floatval($resulttext->low) / $divisor;
      $close = floatval($resulttext->close) / $divisor;
      $timeshifted = timeshift($resulttext->feeddate, $resulttext->feedtime, $shiftweeks);
      $stitched[$i]['table'] = $table;
      $stitched[$i]['feeddate'] = $timeshifted['feeddate'];
      $stitched[$i]['feedtime'] = $timeshifted['feedtime'];
      $stitched[$i]['open'] = ($open + $difference);
      $stitched[$i]['high'] = ($high + $difference);
      $stitched[$i]['low'] = ($low + $difference);
      $stitched[$i]['close'] = ($close + $difference);
      $stitched[$i++]['vol'] = $resulttext->vol;
      }
   $prevclose = $close + $difference; //to be used on next dataset
   }

//parse and shift time by specified weeks into future
function timeshift($feeddate, $feedtime, $shiftweeks)
   {
   $unixtime = strtotime(str_replace(".","-",$feeddate)." ".$feedtime);
   $unixtime += (604800 * $shiftweeks); //# weeks into future

   $timeshifted['feeddate'] = date('Y.m.d', $unixtime);
   $timeshifted['feedtime'] = date('H:i', $unixtime);  
   return $timeshifted;
   }
?>

http://s13.postimage.org/90lqg54tv/synthetic.jpg

Re: Generator checking with more than one pair

Another variation I want to try will be interposing data bars.

In the example I posted, I noticed they are in this order: EURUSD, GBPUSD, USDCHF, USDJPY

perhaps I should have alternated them to be: EURUSD, USDCHF, GBPUSD, USDJPY so that the dollar-strength would be more equally distributed.

The current stitching works by gluing an entire dataset to its next dataset. Another way that would be interesting to test would work like this:

bar1: EURUSD
bar2: USDCHF
bar3: GBPUSD
bar4: USDJPY
bar5: EURUSD ...

Between each bar would be the normalized actual movement in pips. The theory being that the strategy would have to be robust enough to deal with any type of movement, at any given time.

We should also test synthetic data formats on a specific group of non-correlated pairs.

Re: Generator checking with more than one pair

dusktrader wrote:

....

The current stitching works by gluing an entire dataset to its next dataset. Another way that would be interesting to test would work like this:

bar1: EURUSD
bar2: USDCHF
bar3: GBPUSD
bar4: USDJPY
bar5: EURUSD ...

...

How about blocks of bars at a time? The analogy I'm thinking of is when DJ's mix songs, they mix sets of notes from each song together, not individual notes, they preserve some pattern characteristic of the source song. I wonder if a single-bar mixing would remove the anomalies and non-random patterns from naturally occurring datasets? Interesting to research.

23 (edited by rjay 2012-11-19 15:15:52)

Re: Generator checking with more than one pair

As the last post on this thread was March, just wondered if any of the posters came to conclusions about the validity of this approach ?  In other words, were the found strategies more robust and did you feel it was worthwhile "stitching" multiple datasets together ?