RapidMiner

Time Series & Prediction Label Value Range

Regular Contributor

Time Series & Prediction Label Value Range

[ Edited ]

 

Following the tutorials from 2010, "Rapidminer 5.0 Video Tutorial #10 - Financial Time Series Modeling" from Thomas Ott,

I get prediction labels in the format '31.000' etc., while my actual label values are between 0 and 9 (see below).

What's going on here? Is it because of my RM-Version, or did I made an unforced mistake?

Who can help?

PS:

Label = n1

My out of sample data are the last 10 of a bigger sample (youngest).

My inner sample data is of the rest of the data (historically earlier). 

 

messed_up_1.gif

 

PS: Are there any new videos- related to time series available/found?

Attachments

8 REPLIES
Community Manager

Re: Time Series & Prediction Label Value Range

I'm not sure what your process looks like and what algorithm you are using but if you remember from my tutorials that point forecasting was not as robust as trend forecasting in RapidMiner. If you want to do get point forecasts I suggest using the forecast library and R and wrapping it inside RapidMiner. 

 

There is one updated written tutorial in Vijay and Bala's book, I think Chapter 10.

Regards,
Thomas - Community Manager
LinkedIn: Thomas Ott
Regular Contributor

Re: Time Series & Prediction Label Value Range

[ Edited ]

Thank you for the response Thomas,

 

Actually I want to predict directions but I'm wondering about the value ranges in the forecast.

Can you confirm that I made no significant mistake and that this is still the right way to do so?

Here is my process (Attachment)....

 

Attachments

Community Manager

Re: Time Series & Prediction Label Value Range

I see that you're using an SVM with a dot kernel. What is this time series? Production units? Sales? The application of the SVM, it's kernel, C value, and gamma can have a dramatic effect on the forecasting the direction of your time series (see attached).  Without knowing the data, it almost looks like a GLM would work better but I would check.

C vs gammaC vs gamma

 

 

 

 

Regards,
Thomas - Community Manager
LinkedIn: Thomas Ott
Regular Contributor

Re: Time Series & Prediction Label Value Range

[ Edited ]

Hello T-Bone, thank you for the response.

 

I see the C-parameter of the SVM operator but no gamma. How did you produced the image 'C vs gamma'

The data is real live data (see attachment). 

If I had 100 datasets could I use 90 of them as inner sample data and 10 of the 100 as outer sample/validation data?

Should the validation data be younger then the training data?

 

Thank you for the advices.

Highlighted
Community Manager

Re: Time Series & Prediction Label Value Range

[ Edited ]

Ah yes, the gamma parameter becomes available once you change the kernel from dot to anything else.  So I changed it to an RBF kernel, which tends to perform better in time series. I also took your process and then created a parameter optimization scheme on it. Once the C and gamma changed, the results started to come into line.

 

<?xml version="1.0" encoding="UTF-8"?><process version="7.3.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.3.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="7.3.001" expanded="true" height="68" name="Retrieve plus_5_inner_sample_sqlite" width="90" x="45" y="85">
        <parameter key="repository_entry" value="../data/plus_5_inner_sample_sqlite"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="7.3.001" expanded="true" height="82" name="Set Role" width="90" x="246" y="187">
        <parameter key="attribute_name" value="drDateText"/>
        <parameter key="target_role" value="id"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="series:windowing" compatibility="7.3.000" expanded="true" height="82" name="Windowing" width="90" x="380" y="187">
        <parameter key="window_size" value="1"/>
        <parameter key="create_label" value="true"/>
        <parameter key="label_attribute" value="n1"/>
      </operator>
      <operator activated="true" class="optimize_parameters_grid" compatibility="7.3.001" expanded="true" height="124" name="Optimize Parameters (Grid)" width="90" x="581" y="187">
        <list key="parameters">
          <parameter key="SVM.kernel_gamma" value="[0.001;1000;10;logarithmic]"/>
          <parameter key="SVM.C" value="[0;10000;10;linear]"/>
        </list>
        <process expanded="true">
          <operator activated="true" class="series:sliding_window_validation" compatibility="7.3.000" expanded="true" height="124" name="Validation" width="90" x="112" y="34">
            <parameter key="training_window_width" value="20"/>
            <parameter key="training_window_step_size" value="5"/>
            <parameter key="test_window_width" value="20"/>
            <parameter key="horizon" value="5"/>
            <process expanded="true">
              <operator activated="true" class="support_vector_machine" compatibility="7.3.001" expanded="true" height="124" name="SVM" width="90" x="179" y="34">
                <parameter key="kernel_type" value="radial"/>
                <parameter key="kernel_gamma" value="0.0039810717055349725"/>
              </operator>
              <connect from_port="training" to_op="SVM" to_port="training set"/>
              <connect from_op="SVM" from_port="model" to_port="model"/>
              <portSpacing port="source_training" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true">
              <operator activated="true" class="apply_model" compatibility="7.3.001" expanded="true" height="82" name="Apply Model" width="90" x="112" y="34">
                <list key="application_parameters"/>
              </operator>
              <operator activated="true" class="series:forecasting_performance" compatibility="7.3.000" expanded="true" height="82" name="Performance" width="90" x="313" y="34">
                <parameter key="horizon" value="1"/>
              </operator>
              <connect from_port="model" to_op="Apply Model" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
              <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
              <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_averagable 1" spacing="0"/>
              <portSpacing port="sink_averagable 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="log" compatibility="7.3.001" expanded="true" height="82" name="Log" width="90" x="246" y="85">
            <list key="log">
              <parameter key="C" value="operator.SVM.parameter.C"/>
              <parameter key="Gamma" value="operator.SVM.parameter.kernel_gamma"/>
              <parameter key="Forecast Perf" value="operator.Validation.value.performance"/>
            </list>
          </operator>
          <connect from_port="input 1" to_op="Validation" to_port="training"/>
          <connect from_op="Validation" from_port="model" to_port="result 1"/>
          <connect from_op="Validation" from_port="averagable 1" to_op="Log" to_port="through 1"/>
          <connect from_op="Log" from_port="through 1" to_port="performance"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_performance" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="false" class="legacy:write_model" compatibility="7.3.001" expanded="true" height="68" name="Write Model" width="90" x="246" y="34">
        <parameter key="model_file" value="C:\0000_TRANSFER\HTML5\LOTTO_CORE_2016\lottoData\PLUS_5_ARCHIV_DATA\testmod.mod"/>
      </operator>
      <operator activated="true" class="retrieve" compatibility="7.3.001" expanded="true" height="68" name="Retrieve plus_5_outer_sample_sqlite" width="90" x="112" y="493">
        <parameter key="repository_entry" value="../data/plus_5_outer_sample_sqlite"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="7.3.001" expanded="true" height="82" name="Set Role (2)" width="90" x="246" y="340">
        <parameter key="attribute_name" value="drDateText"/>
        <parameter key="target_role" value="id"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="false" class="read_excel" compatibility="7.3.001" expanded="true" height="68" name="Read Excel" width="90" x="112" y="187">
        <parameter key="excel_file" value="C:\0000_TRANSFER\HTML5\LOTTO_CORE_2016\lottoData\PLUS_5_ARCHIV_DATA\plus_5_inner_sample_sqlite.xlsx"/>
        <parameter key="sheet_number" value="2"/>
        <parameter key="imported_cell_range" value="A1:F4451"/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
        </list>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="n1.true.integer.attribute"/>
          <parameter key="1" value="n2.true.integer.attribute"/>
          <parameter key="2" value="n3.true.integer.attribute"/>
          <parameter key="3" value="n4.true.integer.attribute"/>
          <parameter key="4" value="n5.true.integer.attribute"/>
          <parameter key="5" value="drDateText.true.polynominal.attribute"/>
        </list>
      </operator>
      <operator activated="false" class="read_excel" compatibility="7.3.001" expanded="true" height="68" name="Read Excel (2)" width="90" x="112" y="340">
        <parameter key="excel_file" value="C:\0000_TRANSFER\HTML5\LOTTO_CORE_2016\lottoData\PLUS_5_ARCHIV_DATA\plus_5_outer_sample_sqlite.xlsx"/>
        <parameter key="sheet_number" value="2"/>
        <parameter key="imported_cell_range" value="A1:F11"/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
        </list>
        <parameter key="date_format" value="yyyy-MM-dd"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="n1.true.integer.attribute"/>
          <parameter key="1" value="n2.true.integer.attribute"/>
          <parameter key="2" value="n3.true.integer.attribute"/>
          <parameter key="3" value="n4.true.integer.attribute"/>
          <parameter key="4" value="n5.true.integer.attribute"/>
          <parameter key="5" value="drDateText.true.polynominal.attribute"/>
        </list>
      </operator>
      <operator activated="false" class="legacy:read_model" compatibility="7.3.001" expanded="true" height="68" name="Read Model" width="90" x="514" y="391">
        <parameter key="model_file" value="C:\0000_TRANSFER\HTML5\LOTTO_CORE_2016\lottoData\PLUS_5_ARCHIV_DATA\testmod.mod"/>
      </operator>
      <operator activated="true" class="series:windowing" compatibility="7.3.000" expanded="true" height="82" name="Windowing (2)" width="90" x="380" y="340">
        <parameter key="window_size" value="1"/>
        <parameter key="label_attribute" value="n1"/>
      </operator>
      <operator activated="true" class="apply_model" compatibility="7.3.001" expanded="true" height="82" name="Apply Model (2)" width="90" x="782" y="289">
        <list key="application_parameters"/>
      </operator>
      <connect from_op="Retrieve plus_5_inner_sample_sqlite" from_port="output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Windowing" to_port="example set input"/>
      <connect from_op="Windowing" from_port="example set output" to_op="Optimize Parameters (Grid)" to_port="input 1"/>
      <connect from_op="Optimize Parameters (Grid)" from_port="result 1" to_op="Apply Model (2)" to_port="model"/>
      <connect from_op="Retrieve plus_5_outer_sample_sqlite" from_port="output" to_op="Set Role (2)" to_port="example set input"/>
      <connect from_op="Set Role (2)" from_port="example set output" to_op="Windowing (2)" to_port="example set input"/>
      <connect from_op="Windowing (2)" from_port="example set output" to_op="Apply Model (2)" to_port="unlabelled data"/>
      <connect from_op="Apply Model (2)" from_port="labelled data" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <description align="center" color="yellow" colored="false" height="135" resized="true" width="394" x="374" y="26">N(x)-Richtungs-Prediction Part A&lt;br&gt;Vorgehen: Node Windowing n1-n5 schalten&lt;br/&gt;(siehe Bookmarks)&lt;br/&gt;&lt;br&gt;</description>
    </process>
  </operator>
</process>

With respect to your question on using a Cross or Split Validation, you could try those operators but then you lose the dependency of the time series.

 

Note: I don;t know how powerful your machine is but the more parameters you choose to optimize will increase the run time. 

Regards,
Thomas - Community Manager
LinkedIn: Thomas Ott
Regular Contributor

Re: Time Series & Prediction Label Value Range

[ Edited ]

Thank you very very much Thomas.

It's wonderful. I'm speechless.

 

A few questions remain.

If I understand it right, I now can take the best performing C- and gamma parameters from the log, rewire the setup and use them "hard coded" to get the best predictions for the complete dataset in a shorter time. Is this right?

A prediction for an up-to-date tomorrow data is represented in the last row of the result. Is this right?

 

results_tomorrow_0.png

Community Manager

Re: Time Series & Prediction Label Value Range

With respect to your first question, yes. The optimized values of C and gamma can now be used in your process. Just put them into the parameters and run your process again. This time faster. 

 

With respect to your last question, yes you should expect the value to be lower.  When using Windowing and setting your Label column, you will shift back your label value in time and use the window to predict the value for the current window.  It's a bit confusing but for a refresheer check out this Community thread: http://community.rapidminer.com/t5/RapidMiner-Studio/Time-Series-using-Windowing-operator-in-RapidMi...

 

In cases like this I usually convert the label to Down or Up values using the Classify by Trend operator. Good luck!

Regards,
Thomas - Community Manager
LinkedIn: Thomas Ott
Regular Contributor

Re: Time Series & Prediction Label Value Range

[ Edited ]

Thank you Thomas,

 

 

Currently I have no idea how to use the Classify by Trend Operator.

But since I will write to Excel I can classify via VBA.

 

Are there any usefull features to determine overfitting in RM?

 

In this case, what do you think perfomance wise about SVM versus Recurrent Neural Networks?

I tried RNN a little bit in TensorFlow (with no success, still learning).

 

PS: I tried to give you thumb up, but my browser fails at that point.