Options

# Univariate Forecasting

Hello all,

I'm new to RapidMiner and having a problem with arranging my operators to carry out a univariate time series forecasting. I need some help.

Here, I have a dataset consists of one attribute, i.e. the amount of beer production each month. There are approximately 476 rows in the dataset and each row represents the beer production at 1 month. So I divided this dataset manually to 70% and 30% for training and testing respectively. After that, I prepared the operators in RapidMiner as follows:

I realized that there's something wrong with my testing part (that begins with ExcelSampleSource, ModelLoader, and ModelApplier). First, I'm supposed to forecast the values for the next 30% of the original dataset, but this testing dataset already contains the actual values. I actually need to compare the forecasting results with these values at the end.

I'm so confused about this. Should I actually not divide my original dataset? How do I let RapidMiner learn the model by feeding the first 70% of the data so that it can produce forecast values of the following 30% of the data?

I'm sorry if there are any unclarity in my writing. I've tried to search the archive on this forum regarding this matter and I still don't understand. I'll be very grateful if anyone could help. (sorry for the very long post ;p)

Thanks in advance,

Wendy

I'm new to RapidMiner and having a problem with arranging my operators to carry out a univariate time series forecasting. I need some help.

Here, I have a dataset consists of one attribute, i.e. the amount of beer production each month. There are approximately 476 rows in the dataset and each row represents the beer production at 1 month. So I divided this dataset manually to 70% and 30% for training and testing respectively. After that, I prepared the operators in RapidMiner as follows:

- Applying Series2WindowExamples operator in order to apply windowing.
- Let an algorithm (such as NeuralNet, LibSVMLearner, etc) to produce model based on the training data. This is achieved in a cross-validation scheme.
- Thinking that I should get a correct model from the above steps, I load my testing dataset (which is 30% in portion). Then I called the stored model to be applied on that testing data.

Now that I come to think of it, I believe that I have arranged the operators wrongly, but I don't know the correct way to do it. Regarding the cross-validation operator, I've read some threads in the forum, and I found out that this could lead into training the model falsely using values that come after the forecast values. Guess I would have to use SlidingWindowValidation instead.

<operator name="Root" class="Process" expanded="yes">

<operator name="ExcelExampleSource" class="ExcelExampleSource">

<parameter key="excel_file" value="C:\Documents and Settings\Wendy\Desktop\newelec.xls"/>

<parameter key="sheet_number" value="2"/>

</operator>

<operator name="Series2WindowExamples" class="Series2WindowExamples">

<parameter key="series_representation" value="encode_series_by_examples"/>

<parameter key="window_size" value="10"/>

</operator>

<operator name="XValidation" class="XValidation" expanded="yes">

<parameter key="number_of_validations" value="2"/>

<operator name="OperatorChain" class="OperatorChain" expanded="yes">

<operator name="NeuralNet" class="NeuralNet">

<list key="hidden_layer_types">

</list>

<parameter key="training_cycles" value="1000"/>

<parameter key="learning_rate" value="0.7"/>

<parameter key="momentum" value="0.7"/>

</operator>

<operator name="ModelWriter" class="ModelWriter">

<parameter key="model_file" value="C:\Documents and Settings\Wendy\Desktop\newelec_model.mod"/>

</operator>

</operator>

<operator name="OperatorChain (2)" class="OperatorChain" expanded="yes">

<operator name="ModelApplier" class="ModelApplier">

<list key="application_parameters">

</list>

</operator>

<operator name="Performance" class="Performance">

</operator>

</operator>

</operator>

<operator name="ExcelExampleSource (2)" class="ExcelExampleSource">

<parameter key="excel_file" value="C:\Documents and Settings\Wendy\Desktop\newelec.xls"/>

<parameter key="sheet_number" value="3"/>

</operator>

<operator name="ModelLoader" class="ModelLoader">

<parameter key="model_file" value="C:\Documents and Settings\Wendy\Desktop\newelec_model.mod"/>

</operator>

<operator name="ModelApplier (2)" class="ModelApplier">

<list key="application_parameters">

</list>

</operator>

</operator>

I realized that there's something wrong with my testing part (that begins with ExcelSampleSource, ModelLoader, and ModelApplier). First, I'm supposed to forecast the values for the next 30% of the original dataset, but this testing dataset already contains the actual values. I actually need to compare the forecasting results with these values at the end.

I'm so confused about this. Should I actually not divide my original dataset? How do I let RapidMiner learn the model by feeding the first 70% of the data so that it can produce forecast values of the following 30% of the data?

I'm sorry if there are any unclarity in my writing. I've tried to search the archive on this forum regarding this matter and I still don't understand. I'll be very grateful if anyone could help. (sorry for the very long post ;p)

Thanks in advance,

Wendy

0

## Answers

68RM Founderusing the

SlidingWindowValidationis definetely the better evaluation for uni-variate time series forecasting than a single split into training and test set. In other words, there is no need for a 70:30 training:test set split.The

SlidingWindowValidationsimulates the time and moves the evaluation window over the data, always using the data inside the training window (past data within the simulation) for training and the next test window on the time series for testing and evaluation. Inside the training window of the SlidingWindowValidation, you should use aSeries2WindowExamplesor aMultiVariateSeries2WindowExamplesoperator to create the training examples for your regression learner, e.g. a neural net or an SVM or a linear regression or so. Since training and test data need to represented the same way, you should also use the corresponding windowing with Series2WindowExamples and MultiVariateSeries2WindowExamples, respectively, on the test set, i.e. on the data within the test window. Formore information about and examples of uni-variate and multi-variate time series predictionset-ups, I can recommend theTraining Course "Time Series Analysis with Statistical Methods"and theWebinar "Time Series Analysis and Forecasts with RapidMiner".Best regards,

Ralf

7Contributor IIThanks so much for your reply! It really helps me to better understand the windowing approach. By the way, the training course and the webinar are just too costly for a student like me. Besides, my location is in Asia .

I'm a little bit curious about this ChangeAttributeRole operator that you used. What does it do, exactly? It changes the attribute type, as written, but what is the implication? Oh yeah, in this scheme, do I have to set the same values for horizon, window_size and step_size for both Series2WindowExamples in training and testing stage?

I have a question for the SlidingWindowValidation operator. Suppose there are 476 rows in my dataset. I want RapidMiner to use the first 337 rows for training, and use the rest (139 rows) to be forecast. Initially, I set up the parameters of SlidingWindowValidation as follows:

- training_window_width = 337
- training_window_step_size = 1
- test_window_width = 139
- horizon = 1

(note: the window size for both Series2WindowExamples in the training and testing are 20)However, I realized that in the testing stage, RapidMiner starts forecasting the value of row number 358 onwards, because the test window starts at row 338 (window size: 20).

In order to solve this, should I set the training_window_width = 317 and test_window_width = 159? But then, the training could miss some of the rows for learning (esp. row 318 to 337), because they are used to forecast the value at row 338. Can anyone help me to answer this?

I'm sorry for asking a lot of questions.

Thanks again,

Wendy

68RM FounderSetting the attribute role of a (time series) attribute to

labeltells RapidMiner to use this attribute (time series) as the one to be predicted in the forecasts. In case of multi-variate times serieses, you may have many input time serieses that the model can be built upon, but typically you intent to predict one particular time series based on these multiple time serieses. Correspondingly, you mark the target time series to be the label.Yes, because otherwise the trained model does not fit to the test data, when you apply the model. The window length and all other potential pre-processing steps have to be identical between training and test. Otherwise the particular model is not appropriate.

Best regards,

Ralf

7Contributor IIRegards,

Wendy

1Contributor IIf not, how can i make these operators usable for my models in Rapidminer Beta 5.0?

Regards,

Partha

2,531Unicornwe have a Time Series Extension and decided to include these operators into the extension, rather than diminishing everything, having a part of the operators in the core and the remaining rest in the extension. So you will have to install the Time Series Extension for getting access to these operators. It will be available for download with the new RapidMiner 5.0 release this week.

Greetings,

Sebastian

68RM Founderin other words: The RapidMiner time series data mining processes I posted earlier work fine with

RapidMiner 4.6 and its value series pluginas they are now and they will also work fine with theRapidMiner 5version and itstime series extensionthat will be released this week.Best regards,

Ralf