"Windowing Help"

maxdama · September 2008

Can someone tell me what the problem is with my setup? Here are the files: experiment setup data set

Basically the experiment is way too accurate so I think data from the "future" might be being used to train the SVM. I've looked to see if the data is in the correct order and I think it is. I think the problem may be that I am using windowing incorrectly but it's hard to verify since the attributes' are hard to scrutinize (especially since they have unhelpful labels and after normalization the values themselves are inscrutable). .991 correlation seems like it should be impossible for 10-day stock market forecasting so I'm sure there is an error somewhere.

My experiment is basically a very simple SVR experiment on stock market data, following the basic guidelines for such a system recommended in the thread "Prediction (Forecasting ) with RM" and the LIBSVM "practical guide to SVM classification".

Thanks

Regards,
Max

haddock · September 2008

Hi Max,

You are right to suppose that in some way data from the future has crept in. Although your data is in sequential order that is not the order that XVal was handling it, because it was using "stratified" sampling. If you change the sampling parameter to "linear" I'm afraid the results don't look so good. My advice is to look into the SlidingWindowValidation operator, and bear in mind that if you don't use the "horizon" parameter correctly you will get results that are suspiciously good ( been there, got the T-shirt ! ).

Ciao

maxdama · September 2008

Haddock,

Thanks for your suggestions, but even after switching to linear sampling the results are very unrealistic (correlations ~.9. It would be a great help if you would modify the experiment to correctly use the windowing operator and copy the xml to this thread. I have been testing as many possible configurations of XValidation, SlidingWindowValidation, and MultivariateSeries2Window but cannot seem to get it to work. It would be even better if you showed one example of each of the two latter operators since there aren't any examples included. Sorry it's a lot to ask but there must be something fundamental I'm overlooking. Thanks.

Regards,
Max

haddock · September 2008

Hi Max,

Intriguing! If I run your code I get the same as you,i.e

Performance:
PerformanceVector [
*****root_mean_squared_error: 7.301 +/- 0.881 (mikro: 7.354 +/- 0.000)
-----absolute_error: 4.584 +/- 0.480 (mikro: 4.584 +/- 5.750)
-----correlation: 0.991 +/- 0.002 (mikro: 0.991) ]
LibSVMLearner.C = 100
LibSVMLearner.gamma = .01

If I run your original code, but set XValidation to do linear sampling, like this...

<?xml version="1.0" encoding="windows-1252"?>
<process version="4.2">

  <operator name="Root" class="Process" expanded="yes">
      <operator name="Input" class="OperatorChain" expanded="yes">
          <operator name="CSVExampleSource" class="CSVExampleSource">
              <parameter key="filename"	value="C:\Users\CJFP\Documents\rm_workspace\aapl.csv"/>
              <parameter key="id_column"	value="1"/>
              <parameter key="id_name"	value="Date"/>
          </operator>
          <operator name="MultivariateSeries2WindowExamples" class="MultivariateSeries2WindowExamples" breakpoints="after">
              <parameter key="horizon"	value="10"/>
              <parameter key="label_dimension"	value="3"/>
              <parameter key="series_representation"	value="encode_series_by_examples"/>
              <parameter key="window_size"	value="30"/>
          </operator>
          <operator name="Normalization" class="Normalization">
              <parameter key="z_transform"	value="false"/>
          </operator>
      </operator>
      <operator name="Model" class="OperatorChain" expanded="yes">
          <operator name="GridParameterOptimization" class="GridParameterOptimization" expanded="yes">
              <list key="parameters">
                <parameter key="LibSVMLearner.C"	value="1,100,10"/>
                <parameter key="LibSVMLearner.gamma"	value=".00001,.0001,.001,.01"/>
              </list>
              <operator name="XValidation" class="XValidation" expanded="yes">
                  <parameter key="sampling_type"	value="linear sampling"/>
                  <operator name="Learner" class="OperatorChain" expanded="yes">
                      <operator name="LibSVMLearner" class="LibSVMLearner">
                          <parameter key="C"	value="10.0"/>
                          <list key="class_weights">
                          </list>
                          <parameter key="gamma"	value="0.01"/>
                          <parameter key="svm_type"	value="epsilon-SVR"/>
                      </operator>
                  </operator>
                  <operator name="Tester" class="OperatorChain" expanded="yes">
                      <operator name="ModelApplier" class="ModelApplier">
                          <list key="application_parameters">
                          </list>
                      </operator>
                      <operator name="RegressionPerformance" class="RegressionPerformance">
                          <parameter key="absolute_error"	value="true"/>
                          <parameter key="correlation"	value="true"/>
                          <parameter key="main_criterion"	value="root_mean_squared_error"/>
                          <parameter key="root_mean_squared_error"	value="true"/>
                      </operator>
                      <operator name="ProcessLog" class="ProcessLog">
                          <parameter key="filename"	value="maxXV"/>
                          <list key="log">
                            <parameter key="Correlation"	value="operator.RegressionPerformance.value.correlation"/>
                            <parameter key="RMSE"	value="operator.RegressionPerformance.value.root_mean_squared_error"/>
                            <parameter key="C"	value="operator.LibSVMLearner.parameter.C"/>
                            <parameter key="Gamma"	value="operator.LibSVMLearner.parameter.gamma"/>
                            <parameter key="Time"	value="operator.XValidation.value.looptime"/>
                          </list>
                          <parameter key="persistent"	value="true"/>
                      </operator>
                  </operator>
              </operator>
          </operator>
      </operator>
  </operator>

</process>

I get very different results, like this....

PerformanceVector [
*****root_mean_squared_error: 7.930 +/- 6.408 (mikro: 10.199 +/- 0.000)
-----absolute_error: 6.567 +/- 5.394 (mikro: 6.569 +/- 7.802)
-----correlation: 0.648 +/- 0.196 (mikro: 0.982) ]
LibSVMLearner.C = 100
LibSVMLearner.gamma = .001

As I mentioned in my last post, stratified sampling, which takes "random subsets with class distribution kept constant", is not appropriate because all class values have to be known in order to derive the distribution. In your case the label starts at about 10 and ends up around 150! There is also the problem that time series prediction requires that a "horizon" be specified, because today's 10 day forecast should not be validated against tomorrow's data. XValidation does not give you that option, but SlidingWindow does. So changing your code to

<?xml version="1.0" encoding="windows-1252"?>
<process version="4.2">

  <operator name="Root" class="Process" expanded="yes">
      <operator name="Input" class="OperatorChain" expanded="yes">
          <operator name="CSVExampleSource" class="CSVExampleSource">
              <parameter key="filename"	value="C:\Users\CJFP\Documents\rm_workspace\aapl.csv"/>
              <parameter key="id_column"	value="1"/>
              <parameter key="id_name"	value="Date"/>
          </operator>
          <operator name="MultivariateSeries2WindowExamples" class="MultivariateSeries2WindowExamples" breakpoints="after">
              <parameter key="horizon"	value="10"/>
              <parameter key="label_dimension"	value="3"/>
              <parameter key="series_representation"	value="encode_series_by_examples"/>
              <parameter key="window_size"	value="30"/>
          </operator>
          <operator name="Normalization" class="Normalization">
              <parameter key="z_transform"	value="false"/>
          </operator>
      </operator>
      <operator name="Model" class="OperatorChain" expanded="yes">
          <operator name="GridParameterOptimization" class="GridParameterOptimization" expanded="yes">
              <list key="parameters">
                <parameter key="LibSVMLearner.C"	value="1,100,10"/>
                <parameter key="LibSVMLearner.gamma"	value=".00001,.0001,.001,.01"/>
              </list>
              <operator name="SlidingWindowValidation" class="SlidingWindowValidation" expanded="yes">
                  <parameter key="test_window_width"	value="10"/>
                  <operator name="Learner" class="OperatorChain" expanded="yes">
                      <operator name="LibSVMLearner" class="LibSVMLearner">
                          <parameter key="C"	value="10.0"/>
                          <list key="class_weights">
                          </list>
                          <parameter key="gamma"	value="0.01"/>
                          <parameter key="svm_type"	value="epsilon-SVR"/>
                      </operator>
                  </operator>
                  <operator name="Tester" class="OperatorChain" expanded="yes">
                      <operator name="ModelApplier" class="ModelApplier">
                          <list key="application_parameters">
                          </list>
                      </operator>
                      <operator name="RegressionPerformance" class="RegressionPerformance">
                          <parameter key="correlation"	value="true"/>
                          <parameter key="main_criterion"	value="root_mean_squared_error"/>
                          <parameter key="root_mean_squared_error"	value="true"/>
                      </operator>
                      <operator name="ProcessLog" class="ProcessLog">
                          <parameter key="filename"	value="MaxSW"/>
                          <list key="log">
                            <parameter key="Correlation"	value="operator.RegressionPerformance.value.correlation"/>
                            <parameter key="RMSE"	value="operator.RegressionPerformance.value.root_mean_squared_error"/>
                            <parameter key="C"	value="operator.LibSVMLearner.parameter.C"/>
                            <parameter key="Gamma"	value="operator.LibSVMLearner.parameter.gamma"/>
                            <parameter key="Time"	value="operator.XValidation.value.looptime"/>
                          </list>
                      </operator>
                  </operator>
              </operator>
          </operator>
      </operator>
  </operator>

</process>

and running produces the following....

PerformanceVector [
*****root_mean_squared_error: 8.406 +/- 9.537 (mikro: 12.713 +/- 0.000)
-----correlation: 0.058 +/- 0.629 (mikro: 0.971) ]
LibSVMLearner.C = 100
LibSVMLearner.gamma = .01

So non-linear sampling without a forecast horizon must add up to a lot of what is termed "data snooping", and probably normalizing all the examples up front does not help ( I seem to remember it being discussed by Ingo, he of the pointy head, elsewhere in this forum ).

For more on data-snoop check out http://data-snooping.martinsewell.com/. It is a real issue, and if you plan to trade on what you find by mining, be very sure that your models are snoop-poop free!

Caio

maxdama · September 2008

Haddock,

Thanks for your help. I'm not sure I understand why a problem remains after switching XValidation to linear sampling. Is it that the training set is wrapping around the test set and due to the inclusion of lagged values by MultivariateSeries2Window (MVS2W) it is training on some test data? As I increase the "horizon" parameter on MVS2W the performance decreases, which I wouldn't expect if good results were simply caused by snooping.

Is SlidingWIndowValidation (SWV) meant to be used with MVS2W? If so, should the parameter values for "horizon" in each match up or what? Thanks so much for your help, I'd like to have as intuitive an understanding of these powerful operators as possible in spite of the lack of documentation. Maybe Ingo will give some tips too.

Regards,
Max

haddock · September 2008

Consider some 10 day forecast training examples, the last of which is for July 4; that last example contains series looking back 30 days as Attributes, and a Label taken from 10 days into the future, from the 14th of July. Any model generated using the July 4 example as training would not be available until after July 14th. So a horizon needs to be specified, to prevent the model from July 4 being tested on examples for the period July 5-14, when it would not be available.

Moreover that model has some knowledge of the period July 4-14 encoded into it, so to use it in validation on examples for July 5-14 introduces data snooping. Therefore, if you have no horizon specified for the validation, you should expect the performance to decrease as you increase the horizon , because less snooping is included.

Hope that makes it a bit less murky....

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Windowing Help"

Answers