Time Series using Windowing operator in RapidMiner

rainaadirainaadi Member Posts: 4 Contributor I
edited November 2018 in Help

I'm trying to use a time series model in RapidMiner to forecast premium paid to an insurance company. Specifically, I have an entry for each month from January 2009 - December 2015, I want to be able to forecast the data for the next 12 months (January 2016-December 2016).

I'm having trouble understanding how the Windowing operator works, I have a few questions:

1) What goes into selecting a window size? If I want to forecast Premium over the next 12 months, is my window size 12? And if so, why do I get 12 attributes for each original attribute in my data set (the original Premium amount in one of these 12)? I get that this is supposed to explain the corresponding label value (which is just the next row's original Premium, not sure why this is happening either), but where are these numbers coming from and why does RapidMiner generate these?

2) What does the option "create single attributes" do?

3) The horizon field: If this is the distance between the last window value and the value to predict, does this mean I can't at once predict the next 12 months of data? Even if I enter the horizon as 1 (which I take to mean, give me the prediction for January 2016 since the last data point is for December 2015), then why is there no label value for December 2015 or January 2016 in the output when I run the process?

I'm a beginner, and I would really appreciate any help!

Tagged:

Best Answer

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn
    Solution Accepted

    Hi Rainaddi,

     

    I'm the author of those old videos and you're right, I didn't explain why I choose the Windowing parameters as I did.

     

    First off, there's another (older) and more detailed explanation of the Windowing operator in our community: http://community.rapidminer.com/t5/RapidMiner-Studio/Prediction-Forecasting-with-RM/td-p/210 check that out too.

     

    Great questions, let me start by prefacing that Series extension is a fantastic for forecasting trend directions and it's decent at doing point forecasts too but in a point forecast is what you're after, I'd mashup the R Forecast` Library in Studio. Pretty easy to do.

     

    Note, a lot of the parameters I chose will typically be a first starting point. I will make a "best guess" and then from there use a Parameter Optimization to vary parameters such as Window Size, Training/Testing Window Width, Step Size, etc. 

     

    I think Simafore's blog said it best, using the Windowing operator is like taking a "cross section of data" in time. You can have multiple attributes (columns) that have different data points to help describe your label (target variable). For example, let's take this simple stock close dataset. It has XOM, FB, and MSFT Closing values. We're interested in forecasting the trend of XOM_CLOSE using it as the Label (target variable) and FB and MSFT closing prices as part of the input. You want to create a multivariate data set to describe the XOM.

     

     

    WindowingExample 1.png

    So how do you use FB_CLOSE and MSFT_CLOSE in your forecast? That's where the Window operator comes in, I want to take that data and make a "window" of  FB/MSFT data points that describe some XOM data point in time. Question is, what size window to use? That's where a bit of domain knowledge comes in and you'll have to make your first "best guess," remembering that you can change the Window size when you use Parameter Optimization later.  

     

    For this argument, let's take a 5 day Window (the trading week is typically 5 days). That is the Window Size.  The Step Size is how far you want to advance the Window. Setting the Step Size also requires a bit of Domain knowledge because you could have be forecasting for Weekly, Quarterly, or Monthly types of data. For our example, we advanced it by 1 (the next day).  

     

    You should see something like this:

     

    WindowingExample 2.png

    The image above is what you should see. I put red boxes on it to illustrate the next point. The red boxes highlight an important concept. In example row 1, the Date-4 column corresponds to the closing price of XOM and MSFT (FB was cut off in screen shot) to XOM_CLOSE-4 and MSFT_CLOSE-4. Likewise in example row 3, Date-3 corresponds to the closing price of XOM and MSFT for XOM_CLOSE-3 and MSFT_CLOSE-3.  Now you have a 5 day Window of data on an example (row) by example (row) basis. This is good but we're not complete yet.

     

    Why is that important to rotate your data series from columns to rows? You could easily just use a simple univariate column and do a Linear Regression on it, which is just fine, but what if you want to use more than one variable and eventually test the performance (ie. the trend accuracy)? For that you have to transform the data set into the above screenshot because it preps it for the Sliding Window Validation operator (the Sliding Window Validation operator is how you backtest your multivariate data series).

     

    Before you can do that, you'll have to Create a Label from your above data set. You have to tell the Windowing operator what column (attribute) should be used to train a model too. There are two main parameters you should use here, the Create a Label toggle and the Horizon parameter. Those parameters will tell RapidMiner which attribute to use for the Label column (XOM_CLOSE) and what value you want to forecast too, in this case it's the value in Jan 6, 2016 for XOM_CLOSE (73.69)

     

    WindowingExample 3.png

    That looks like this:

     

    WindowingExample 4.png

    The next step would be to feed this data into a Sliding Window Validation operator and nest an algorithm in there to back test your assumptions.

     

    Hope this helps. 

Answers

  • dangdang Member Posts: 11 Contributor II
  • rainaadirainaadi Member Posts: 4 Contributor I

    Yes, that was what I was going off of. The steps in the article are just outlined, not explained.

    For example: "Window size: determines how many "attributes" are created for the cross sectional data. Each row of the original time series within the window width will become a new attribute" - this doesn't really explain why this happens, or what I'm supposed to conclude from the many attributes the Windowing operator generates.

    Same goes for Thomas Ott's youtube videos- these resources are just telling me what to do, rather than explaining why they're doing what they're doing and what that's used for.

    Just hoping for some more clarity on this, since I can't find much online.

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Ok, I should learn to read the first question, not the last one.

     

    Item 1: See my response as well the link I posted

    Item 2: Create Single Attributes parameter has to do with how you want to Studio to recognize the data series. There are additional operators in the Series extension that require the data to be transformed to a "Series" datatype (this is specific for how that particular operator has to read in the data). Typically this is not needed, so leave the toggle on.

    Item 3: You should be able to point forcast your values beyond one, but I've never did that for my specific problems, so I'd suggest you experiment there. Why isn't there label values for Dec 15/Jan 16, great quesiton and that has to do with how large of window you created in the first pass. This is why you will always need to use a second Windowing operator (with no "Create Label" toggled on) for your testing set. I'll have to follow up on this a bit later this week when I have more time. 

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    And  here is a sample process:

     

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="7.1.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.1.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="generate_data" compatibility="7.1.001" expanded="true" height="68" name="Generate Data" width="90" x="45" y="34">
    <parameter key="target_function" value="sinus classification"/>
    </operator>
    <operator activated="true" class="series:windowing" compatibility="5.3.000" expanded="true" height="82" name="Windowing" width="90" x="179" y="34">
    <parameter key="window_size" value="5"/>
    <parameter key="create_label" value="true"/>
    <parameter key="label_attribute" value="label"/>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="7.1.001" expanded="true" height="82" name="Select Attributes" width="90" x="447" y="187">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="label"/>
    <parameter key="invert_selection" value="true"/>
    <parameter key="include_special_attributes" value="true"/>
    </operator>
    <operator activated="true" class="series:windowing" compatibility="5.3.000" expanded="true" height="82" name="Windowing (2)" width="90" x="648" y="187">
    <parameter key="window_size" value="5"/>
    <parameter key="label_attribute" value="label"/>
    <parameter key="horizon" value="5"/>
    </operator>
    <operator activated="true" class="series:sliding_window_validation" compatibility="5.3.000" expanded="true" height="124" name="Validation" width="90" x="581" y="34">
    <parameter key="training_window_width" value="10"/>
    <parameter key="test_window_width" value="10"/>
    <process expanded="true">
    <operator activated="true" class="k_nn" compatibility="7.1.001" expanded="true" height="82" name="k-NN" width="90" x="232" y="34"/>
    <connect from_port="training" to_op="k-NN" to_port="training set"/>
    <connect from_op="k-NN" from_port="model" to_port="model"/>
    <portSpacing port="source_training" spacing="0"/>
    <portSpacing port="sink_model" spacing="0"/>
    <portSpacing port="sink_through 1" spacing="0"/>
    </process>
    <process expanded="true">
    <operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="82" name="Apply Model" width="90" x="112" y="34">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="series:forecasting_performance" compatibility="5.3.000" expanded="true" height="82" name="Performance" width="90" x="313" y="34">
    <parameter key="horizon" value="1"/>
    </operator>
    <connect from_port="model" to_op="Apply Model" to_port="model"/>
    <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
    <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
    <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
    <portSpacing port="source_model" spacing="0"/>
    <portSpacing port="source_test set" spacing="0"/>
    <portSpacing port="source_through 1" spacing="0"/>
    <portSpacing port="sink_averagable 1" spacing="0"/>
    <portSpacing port="sink_averagable 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="82" name="Apply Model (2)" width="90" x="849" y="136">
    <list key="application_parameters"/>
    </operator>
    <connect from_op="Generate Data" from_port="output" to_op="Windowing" to_port="example set input"/>
    <connect from_op="Windowing" from_port="example set output" to_op="Validation" to_port="training"/>
    <connect from_op="Windowing" from_port="original" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="Windowing (2)" to_port="example set input"/>
    <connect from_op="Windowing (2)" from_port="example set output" to_op="Apply Model (2)" to_port="unlabelled data"/>
    <connect from_op="Validation" from_port="model" to_op="Apply Model (2)" to_port="model"/>
    <connect from_op="Validation" from_port="averagable 1" to_port="result 2"/>
    <connect from_op="Apply Model (2)" from_port="labelled data" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>
  • mallsunita13mallsunita13 Member Posts: 4 Contributor I

    Dear Mr.Thomas Ott,

     

    I am doing time series analysis for predicting the size of coming emails for optimizing resources.

    I watched your videos and build a model. I have a dataset which consists of only one variable that is "Total bytes" and this information is based on 9 consecutive weeks of the academic year. I divided my dataset into two parts such as 8weeks data as training and 9th week data as testing.

     

    So I am using SVM and Cross-validation operators. My problem is

    • During using windowing operator I want to select series representation as "encode-series-by-attribute" and window size as "10". But it shows error message "The parameter window-size specifies a window size, but the value 10 exceed the number of attributes".
    • In SVM, it shows an error message that "Support Vector Machine cannot handle polynomial attributes".

    I am new to RapidMiner Studio. Please help me.

     

    Thank you

     

    Sunita

  • iesnaolaiesnaola Member Posts: 8 Contributor II

    Hi there Sunita,

     

    Regarding the SVM error message... Does your dataset have any attribute of type String? Some algorithms can only work with numerical attributes so cannot deal with text attributes. I recommend you to transform your non-numerical attribute to a numerical one.

     

    If you have a String type parameter, you could use the "Nominal to Numerical" module.

     

    I hope this helps you

     

    Iker

  • listslists Member Posts: 39 Guru

    Mistakenly doubled.

  • binsetyawanbinsetyawan Member Posts: 46 Guru

    i have similiar problem like this, for training i use data from 2009-2015 and for testing i use data on 2016 (data is monthly) to predict data for 2017. both of training and testing i set 12 for window size, 1 step size and 1 for horizon.

     

    But the result from testing is only 1 row when i imagine the result is 12 row (12 mont hin 2017). i know why the result is only 1 row, its because the complete window which 12 window size is only 1, when i checked add incomplete windows, its appears 12 row but i think something is not right......

     

    @Thomas_Ott

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    You're going to need the process that Bala D wrote about in his book. Take a look at the last process in this thread. http://community.rapidminer.com/t5/RapidMiner-Studio-Forum/Recall-Error/m-p/37302#U37302

  • luc_bartkowskiluc_bartkowski Member Posts: 46 Maven

    Hi Thomas,

     

    I still have not reached a satisfactory learning point regarding Time Series Analysis and Prediction. I'm sorry.
    I admire your blogs/videos and answers a lot but I still have questions:

     

    Why do the "Windowing" and the "Sliding Window Validator" operators offer both parameters for Window, Step Size and Horizon?

    How do they influence each other or cooperate/work together?

    Are there any "rules of thumb" to implement the parameter sets of both operators in conjunction with each other?

     

    To explain my question:

    If the Windowing operator parameters are Window=1, Step=1, Horizon=1: What happens if the parameters of the adjacent Sliding Window Validator are configured as Window=5, Step=1, Horizon=1, like in your Time Series videos #9? Examining the output example set from the Windowing operator shows already a Label with training data for the predictions for time n+1 because of Horizon=1. Will the Sliding Window Validator create a new Label for its own Step=1? What do we get then, a label fit for a prediction for n=2, a cumulation of n=1 from Windowing Step=1 plus n=1 from the Sliding Window Validator Step=1? Apparently not but I don't understand it. Suppose both Windowing and Sliding Window Validator operators have Horizon = 5: What do we get then, a Label as input for the training of the algorithm for time n+25 or n+10? Again, I do not understand the end-to-end process.

     

    If the answer is that the Windowing and the Sliding Window Validation operators, including their parameters, work completely independently of each other in the end-to-end process: Why would one train/validate an algorithm such as SVM with 5 times more features/attributes than the initial example set from the Windowing parameter? I guess that a Sliding Window Validation parameter Window=5 will result in an SVM hyperplane based on vectors in 5 dimensions. But these vectors point to "nowhere" compared to the vectors from the input example set coming from the Windowing operator with Window=1. Why not set the Window parameter of the Windowing operator then also to 5, so the dimensions of the vectors of the input example set match at least the dimensions of the trained/validated vectors of the SVM hyperplane? 

     

    You might understand: I'm feeling overtrained. ;)

    Please elaborate. Thanks.

  • luc_bartkowskiluc_bartkowski Member Posts: 46 Maven
    Dear Thomas,

    Problem solved, I obtained my learning point. By experimenting.
    The Windowing operator parameters determine the learning and prediction.
    Training/validation of the model is separate.

    Thank you again for your work.
  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Oh hey @luc_bartkowski I didn't see your questions. It's best to get my attention by using the '@' symbol and tagging me in the future.

     

    So just to clarify, the Windowing operator is a transformational operator. It just transforms your time series into a multidimensional data set based on the parameters you use. The Sliding Window Validation operator is use to for training and testing your model for performance. Yes they have similiar parameters called Steps and Widths, it's just they're applied differently. Good luck!

  • akaplanakaplan Member Posts: 2 Newbie
    I have same problem. How can I?
Sign In or Register to comment.