an identical way to select from many forecast methods like it is with ensemble classification ?

data1maths · May 2018

Hello everyone,

I am wondering if i can sellect from many forecast methods (AR, ARMA,ARIMA,MA,Windowing(svm;NN...)...) applied on a time serie the best forecast and plot it or to take the average of those forecasts at each time like it is the case for ensemble classification(bagging&boosting) with many machine learning models.

I need your help.

Thank you

Regards.

tftemme · June 2018

Hi data1math,

The answer to your questions is a little bit convoluted, depending on how you define "best forecast", if you want to select or average and that there are some differences between all kind of ARIMA models and the application for Windowing and then training a regression model on the data.

So I have to open up the scope of my answer.

First, I just want to recap the difference between ARIMA models and Windowing(Regression model):

As the ARIMA methods contains p AR terms, d differentations and q MA terms; AR(p,0,0), ARMA(p,0,q), ARIMA(p,d,q), MA(0,0,q) are all the same method, but with different parameters. So I will always talk about ARIMA only.

For training one ARIMA model, only one Window (which can be the whole time series) is used. The model is then used to predict all future values (with increasing uncertainty), from the end of the training window on. If the training window changes (e.g. when new values are measured) the ARIMA model has to be trained again, cause it depends on the previous values.

In contrast, the Windowing operator converts the time series data set into a standard Machine Learning data set. Each window is one example with the values as attributes and the 'to be forecasted value' as the label attribute. A regression model (e.g. SVM, Random Forest, GBT, ...), once trained, can be applied on new windows. But it predicts only one future value (e.g. the next value). If the value two steps in the future shall be predicted, another model has to be trained.

Next, how to select the "best forecast":

To select the best forecast is a typical optimization problem. We try out different parameters and/or different models and select the one which delivers the best performance. To do this we have to define what the best performance is. For the ARIMA models there are two ways of chosing the kind of performance we want to use as a quality measure.

We can use information criteria performance measures (aic, bic, aicc) which are basically describing how good the model fits the training data. The ARIMA Trainer outputs this performance criteria at its performance output port. The 'Automized Arima on US - Consumption data' process in the template folder of the Time Series Extension Samples folder, demonstrate how to use this performance to optimize an ARIMA model. Be aware that this is a training performance, which does not need to describe the performance on new data correctly.

The other kind of performance measures are the standard performance measures for a regression problem. A forecast model is trained and then used to predict values for the next horizon. These forecasted values can be compared to the true values and a performance measure can be calculated the same way as for any other regression problem.

The Forecast Validation operator can be used to calculate this performance for a ARIMA model. Please have a look into the 'Forecast Validation of ARIMA Model for Lake Huron' process, again in the template folder of the Time Series Extension Samples folder.

For a Windowing (Regression Model), the standard Cross Validation setup can be used. Please have a look into the 'Create Model for Gas Prices' template process in the Time Series Extension Samples folder.

Next, how to average:

So of course you can apply as many models which gives you forecasted values and average the results. When you use the Windowing operator, you get a standard Machine Learning data set, so you can even use the Bagging operator of RapidMiner.

For ARIMA models, you can train models with different parameters p,d,q, on the same training window. You could also use different lengthes for the training window (which would be the most similar thing to boosting, as possible), but the end of the training window has to be always the same, cause if not, the trained models, do not calculate prediction for the same value and can't be average anymore.

Best you collect all different models in a collection. Then you apply all models (with Loop Collection) and averages over the results.

Last I append a RapidMiner process, which tests different parameters of ARIMA models as well as different combinations of the window size in the Windowing operator and number of trees of a Random Forest regression. All different forecast models (the different ARIMA models, and the different Windowing(Random Forest Regression)) are compared with each other, and the best performing (in terms of least_square_error) is selected.

Hopes this helps,

Best regards,

Fabian

Spoiler

<process version="8.2.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="8.2.000" expanded="true" height="68" name="Retrieve Lake Huron" width="90" x="112" y="34">
        <parameter key="repository_entry" value="//Time Series Extension Samples/data sets/Lake Huron"/>
      </operator>
      <operator activated="true" class="concurrency:optimize_parameters_grid" compatibility="8.2.000" expanded="true" height="166" name="Optimize Parameters (2)" width="90" x="447" y="34">
        <list key="parameters">
          <parameter key="Select Subprocess.select_which" value="[1.0;2;2;linear]"/>
        </list>
        <process expanded="true">
          <operator activated="true" class="select_subprocess" compatibility="8.2.000" expanded="true" height="124" name="Select Subprocess" width="90" x="380" y="34">
            <parameter key="select_which" value="2"/>
            <process expanded="true">
              <operator activated="true" class="concurrency:optimize_parameters_grid" compatibility="8.2.000" expanded="true" height="145" name="Optimize Parameters (3)" width="90" x="179" y="34">
                <list key="parameters">
                  <parameter key="ARIMA Trainer (2).plithiumorder_of_the_autoregressive_model" value="[0.0;3;3;linear]"/>
                  <parameter key="ARIMA Trainer (2).dlithiumdegree_of_differencing" value="[0.0;1;1;linear]"/>
                  <parameter key="ARIMA Trainer (2).qlithiumorder_of_the_moving-average_model" value="[0.0;3;3;linear]"/>
                </list>
                <parameter key="error_handling" value="ignore error"/>
                <process expanded="true">
                  <operator activated="true" class="timeseries:forecast_validation" compatibility="0.3.000-SNAPSHOT" expanded="true" height="145" name="Forecast Validation" width="90" x="380" y="34">
                    <parameter key="time_series_attribute" value="Lake surface level / feet"/>
                    <parameter key="no_overlapping_windows" value="true"/>
                    <process expanded="true">
                      <operator activated="true" class="timeseries:arima_trainer" compatibility="0.3.000-SNAPSHOT" expanded="true" height="103" name="ARIMA Trainer (2)" width="90" x="112" y="34">
                        <parameter key="time_series_attribute" value="Lake surface level / feet"/>
                      </operator>
                      <connect from_port="training set" to_op="ARIMA Trainer (2)" to_port="example set"/>
                      <connect from_op="ARIMA Trainer (2)" from_port="forecast model" to_port="model"/>
                      <portSpacing port="source_training set" spacing="0"/>
                      <portSpacing port="sink_model" spacing="0"/>
                      <portSpacing port="sink_through 1" spacing="0"/>
                    </process>
                    <process expanded="true">
                      <operator activated="true" class="performance_regression" compatibility="8.2.000" expanded="true" height="82" name="Performance" width="90" x="179" y="34"/>
                      <connect from_port="test set" to_op="Performance" to_port="labelled data"/>
                      <connect from_op="Performance" from_port="performance" to_port="performance 1"/>
                      <portSpacing port="source_test set" spacing="0"/>
                      <portSpacing port="source_model" spacing="0"/>
                      <portSpacing port="source_through 1" spacing="0"/>
                      <portSpacing port="sink_test set results" spacing="0"/>
                      <portSpacing port="sink_performance 1" spacing="0"/>
                      <portSpacing port="sink_performance 2" spacing="0"/>
                    </process>
                  </operator>
                  <connect from_port="input 1" to_op="Forecast Validation" to_port="example set"/>
                  <connect from_op="Forecast Validation" from_port="model" to_port="output 1"/>
                  <connect from_op="Forecast Validation" from_port="performance 1" to_port="performance"/>
                  <portSpacing port="source_input 1" spacing="0"/>
                  <portSpacing port="source_input 2" spacing="0"/>
                  <portSpacing port="sink_performance" spacing="0"/>
                  <portSpacing port="sink_model" spacing="0"/>
                  <portSpacing port="sink_output 1" spacing="0"/>
                  <portSpacing port="sink_output 2" spacing="0"/>
                </process>
              </operator>
              <connect from_port="input 1" to_op="Optimize Parameters (3)" to_port="input 1"/>
              <connect from_op="Optimize Parameters (3)" from_port="performance" to_port="output 1"/>
              <connect from_op="Optimize Parameters (3)" from_port="parameter set" to_port="output 2"/>
              <connect from_op="Optimize Parameters (3)" from_port="output 1" to_port="output 3"/>
              <portSpacing port="source_input 1" spacing="0"/>
              <portSpacing port="source_input 2" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
              <portSpacing port="sink_output 3" spacing="0"/>
              <portSpacing port="sink_output 4" spacing="0"/>
            </process>
            <process expanded="true">
              <operator activated="true" class="concurrency:optimize_parameters_grid" compatibility="8.2.000" expanded="true" height="145" name="Optimize Parameters (4)" width="90" x="179" y="34">
                <list key="parameters">
                  <parameter key="Windowing.window_size" value="[10;30;2;linear]"/>
                  <parameter key="Random Forest.number_of_trees" value="[30;50;2;linear]"/>
                </list>
                <process expanded="true">
                  <operator activated="true" class="timeseries:windowing" compatibility="0.3.000-SNAPSHOT" expanded="true" height="82" name="Windowing" width="90" x="179" y="34">
                    <parameter key="time_series_attribute" value="Lake surface level / feet"/>
                    <parameter key="window_size" value="30"/>
                  </operator>
                  <operator activated="true" class="concurrency:cross_validation" compatibility="8.2.000" expanded="true" height="145" name="Cross Validation" width="90" x="380" y="34">
                    <process expanded="true">
                      <operator activated="true" class="concurrency:parallel_random_forest" compatibility="8.2.000" expanded="true" height="103" name="Random Forest" width="90" x="112" y="34">
                        <parameter key="number_of_trees" value="50"/>
                        <parameter key="criterion" value="least_square"/>
                      </operator>
                      <connect from_port="training set" to_op="Random Forest" to_port="training set"/>
                      <connect from_op="Random Forest" from_port="model" to_port="model"/>
                      <portSpacing port="source_training set" spacing="0"/>
                      <portSpacing port="sink_model" spacing="0"/>
                      <portSpacing port="sink_through 1" spacing="0"/>
                    </process>
                    <process expanded="true">
                      <operator activated="true" class="apply_model" compatibility="8.2.000" expanded="true" height="82" name="Apply Model" width="90" x="112" y="34">
                        <list key="application_parameters"/>
                      </operator>
                      <operator activated="true" class="performance_regression" compatibility="8.2.000" expanded="true" height="82" name="Performance (2)" width="90" x="246" y="34"/>
                      <connect from_port="model" to_op="Apply Model" to_port="model"/>
                      <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
                      <connect from_op="Apply Model" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
                      <connect from_op="Performance (2)" from_port="performance" to_port="performance 1"/>
                      <portSpacing port="source_model" spacing="0"/>
                      <portSpacing port="source_test set" spacing="0"/>
                      <portSpacing port="source_through 1" spacing="0"/>
                      <portSpacing port="sink_test set results" spacing="0"/>
                      <portSpacing port="sink_performance 1" spacing="0"/>
                      <portSpacing port="sink_performance 2" spacing="0"/>
                    </process>
                  </operator>
                  <connect from_port="input 1" to_op="Windowing" to_port="example set"/>
                  <connect from_op="Windowing" from_port="windowed example set" to_op="Cross Validation" to_port="example set"/>
                  <connect from_op="Cross Validation" from_port="model" to_port="output 1"/>
                  <connect from_op="Cross Validation" from_port="performance 1" to_port="performance"/>
                  <portSpacing port="source_input 1" spacing="0"/>
                  <portSpacing port="source_input 2" spacing="0"/>
                  <portSpacing port="sink_performance" spacing="0"/>
                  <portSpacing port="sink_model" spacing="0"/>
                  <portSpacing port="sink_output 1" spacing="0"/>
                  <portSpacing port="sink_output 2" spacing="0"/>
                </process>
              </operator>
              <connect from_port="input 1" to_op="Optimize Parameters (4)" to_port="input 1"/>
              <connect from_op="Optimize Parameters (4)" from_port="performance" to_port="output 1"/>
              <connect from_op="Optimize Parameters (4)" from_port="parameter set" to_port="output 2"/>
              <connect from_op="Optimize Parameters (4)" from_port="output 1" to_port="output 3"/>
              <portSpacing port="source_input 1" spacing="0"/>
              <portSpacing port="source_input 2" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
              <portSpacing port="sink_output 3" spacing="0"/>
              <portSpacing port="sink_output 4" spacing="0"/>
            </process>
          </operator>
          <connect from_port="input 1" to_op="Select Subprocess" to_port="input 1"/>
          <connect from_op="Select Subprocess" from_port="output 1" to_port="performance"/>
          <connect from_op="Select Subprocess" from_port="output 2" to_port="output 1"/>
          <connect from_op="Select Subprocess" from_port="output 3" to_port="output 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_performance" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
          <portSpacing port="sink_output 3" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Retrieve Lake Huron" from_port="output" to_op="Optimize Parameters (2)" to_port="input 1"/>
      <connect from_op="Optimize Parameters (2)" from_port="performance" to_port="result 1"/>
      <connect from_op="Optimize Parameters (2)" from_port="parameter set" to_port="result 2"/>
      <connect from_op="Optimize Parameters (2)" from_port="output 1" to_port="result 3"/>
      <connect from_op="Optimize Parameters (2)" from_port="output 2" to_port="result 4"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
      <portSpacing port="sink_result 5" spacing="0"/>
    </process>
  </operator>
</process>

sgenzer · May 2018

tagging @tftemme

data1maths · June 2018

Honestly this is by far the best detailled response I've ever had, thank you so much .

As you've recommended i'll try your process on my case and I'll tell about my progress later on.

Thank you again for your help.

Regards,

Data1maths

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

an identical way to select from many forecast methods like it is with ensemble classification ?

Best Answer

Answers