Time Series using Windowing operator in RapidMiner
I'm trying to use a time series model in RapidMiner to forecast premium paid to an insurance company. Specifically, I have an entry for each month from January 2009 - December 2015, I want to be able to forecast the data for the next 12 months (January 2016-December 2016).
I'm having trouble understanding how the Windowing operator works, I have a few questions:
1) What goes into selecting a window size? If I want to forecast Premium over the next 12 months, is my window size 12? And if so, why do I get 12 attributes for each original attribute in my data set (the original Premium amount in one of these 12)? I get that this is supposed to explain the corresponding label value (which is just the next row's original Premium, not sure why this is happening either), but where are these numbers coming from and why does RapidMiner generate these?
2) What does the option "create single attributes" do?
3) The horizon field: If this is the distance between the last window value and the value to predict, does this mean I can't at once predict the next 12 months of data? Even if I enter the horizon as 1 (which I take to mean, give me the prediction for January 2016 since the last data point is for December 2015), then why is there no label value for December 2015 or January 2016 in the output when I run the process?
I'm a beginner, and I would really appreciate any help!
Thomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn
I'm the author of those old videos and you're right, I didn't explain why I choose the Windowing parameters as I did.
First off, there's another (older) and more detailed explanation of the Windowing operator in our community: http://community.rapidminer.com/t5/RapidMiner-Studio/Prediction-Forecasting-with-RM/td-p/210 check that out too.
Great questions, let me start by prefacing that Series extension is a fantastic for forecasting trend directions and it's decent at doing point forecasts too but in a point forecast is what you're after, I'd mashup the R Forecast` Library in Studio. Pretty easy to do.
Note, a lot of the parameters I chose will typically be a first starting point. I will make a "best guess" and then from there use a Parameter Optimization to vary parameters such as Window Size, Training/Testing Window Width, Step Size, etc.
I think Simafore's blog said it best, using the Windowing operator is like taking a "cross section of data" in time. You can have multiple attributes (columns) that have different data points to help describe your label (target variable). For example, let's take this simple stock close dataset. It has XOM, FB, and MSFT Closing values. We're interested in forecasting the trend of XOM_CLOSE using it as the Label (target variable) and FB and MSFT closing prices as part of the input. You want to create a multivariate data set to describe the XOM.
So how do you use FB_CLOSE and MSFT_CLOSE in your forecast? That's where the Window operator comes in, I want to take that data and make a "window" of FB/MSFT data points that describe some XOM data point in time. Question is, what size window to use? That's where a bit of domain knowledge comes in and you'll have to make your first "best guess," remembering that you can change the Window size when you use Parameter Optimization later.
For this argument, let's take a 5 day Window (the trading week is typically 5 days). That is the Window Size. The Step Size is how far you want to advance the Window. Setting the Step Size also requires a bit of Domain knowledge because you could have be forecasting for Weekly, Quarterly, or Monthly types of data. For our example, we advanced it by 1 (the next day).
You should see something like this:
The image above is what you should see. I put red boxes on it to illustrate the next point. The red boxes highlight an important concept. In example row 1, the Date-4 column corresponds to the closing price of XOM and MSFT (FB was cut off in screen shot) to XOM_CLOSE-4 and MSFT_CLOSE-4. Likewise in example row 3, Date-3 corresponds to the closing price of XOM and MSFT for XOM_CLOSE-3 and MSFT_CLOSE-3. Now you have a 5 day Window of data on an example (row) by example (row) basis. This is good but we're not complete yet.
Why is that important to rotate your data series from columns to rows? You could easily just use a simple univariate column and do a Linear Regression on it, which is just fine, but what if you want to use more than one variable and eventually test the performance (ie. the trend accuracy)? For that you have to transform the data set into the above screenshot because it preps it for the Sliding Window Validation operator (the Sliding Window Validation operator is how you backtest your multivariate data series).
Before you can do that, you'll have to Create a Label from your above data set. You have to tell the Windowing operator what column (attribute) should be used to train a model too. There are two main parameters you should use here, the Create a Label toggle and the Horizon parameter. Those parameters will tell RapidMiner which attribute to use for the Label column (XOM_CLOSE) and what value you want to forecast too, in this case it's the value in Jan 6, 2016 for XOM_CLOSE (73.69)
That looks like this:
The next step would be to feed this data into a Sliding Window Validation operator and nest an algorithm in there to back test your assumptions.
Hope this helps.4
this may be helpful for you to understand series operators
Yes, that was what I was going off of. The steps in the article are just outlined, not explained.
For example: "Window size: determines how many "attributes" are created for the cross sectional data. Each row of the original time series within the window width will become a new attribute" - this doesn't really explain why this happens, or what I'm supposed to conclude from the many attributes the Windowing operator generates.
Same goes for Thomas Ott's youtube videos- these resources are just telling me what to do, rather than explaining why they're doing what they're doing and what that's used for.
Just hoping for some more clarity on this, since I can't find much online.
Ok, I should learn to read the first question, not the last one.
Item 1: See my response as well the link I posted
Item 2: Create Single Attributes parameter has to do with how you want to Studio to recognize the data series. There are additional operators in the Series extension that require the data to be transformed to a "Series" datatype (this is specific for how that particular operator has to read in the data). Typically this is not needed, so leave the toggle on.
Item 3: You should be able to point forcast your values beyond one, but I've never did that for my specific problems, so I'd suggest you experiment there. Why isn't there label values for Dec 15/Jan 16, great quesiton and that has to do with how large of window you created in the first pass. This is why you will always need to use a second Windowing operator (with no "Create Label" toggled on) for your testing set. I'll have to follow up on this a bit later this week when I have more time.
And here is a sample process:
Dear Mr.Thomas Ott,
I am doing time series analysis for predicting the size of coming emails for optimizing resources.
I watched your videos and build a model. I have a dataset which consists of only one variable that is "Total bytes" and this information is based on 9 consecutive weeks of the academic year. I divided my dataset into two parts such as 8weeks data as training and 9th week data as testing.
So I am using SVM and Cross-validation operators. My problem is
I am new to RapidMiner Studio. Please help me.
Hi there Sunita,
Regarding the SVM error message... Does your dataset have any attribute of type String? Some algorithms can only work with numerical attributes so cannot deal with text attributes. I recommend you to transform your non-numerical attribute to a numerical one.
If you have a String type parameter, you could use the "Nominal to Numerical" module.
I hope this helps you
i have similiar problem like this, for training i use data from 2009-2015 and for testing i use data on 2016 (data is monthly) to predict data for 2017. both of training and testing i set 12 for window size, 1 step size and 1 for horizon.
But the result from testing is only 1 row when i imagine the result is 12 row (12 mont hin 2017). i know why the result is only 1 row, its because the complete window which 12 window size is only 1, when i checked add incomplete windows, its appears 12 row but i think something is not right......
You're going to need the process that Bala D wrote about in his book. Take a look at the last process in this thread. http://community.rapidminer.com/t5/RapidMiner-Studio-Forum/Recall-Error/m-p/37302#U37302
I still have not reached a satisfactory learning point regarding Time Series Analysis and Prediction. I'm sorry.
I admire your blogs/videos and answers a lot but I still have questions:
Why do the "Windowing" and the "Sliding Window Validator" operators offer both parameters for Window, Step Size and Horizon?
How do they influence each other or cooperate/work together?
Are there any "rules of thumb" to implement the parameter sets of both operators in conjunction with each other?
To explain my question:
If the Windowing operator parameters are Window=1, Step=1, Horizon=1: What happens if the parameters of the adjacent Sliding Window Validator are configured as Window=5, Step=1, Horizon=1, like in your Time Series videos #9? Examining the output example set from the Windowing operator shows already a Label with training data for the predictions for time n+1 because of Horizon=1. Will the Sliding Window Validator create a new Label for its own Step=1? What do we get then, a label fit for a prediction for n=2, a cumulation of n=1 from Windowing Step=1 plus n=1 from the Sliding Window Validator Step=1? Apparently not but I don't understand it. Suppose both Windowing and Sliding Window Validator operators have Horizon = 5: What do we get then, a Label as input for the training of the algorithm for time n+25 or n+10? Again, I do not understand the end-to-end process.
If the answer is that the Windowing and the Sliding Window Validation operators, including their parameters, work completely independently of each other in the end-to-end process: Why would one train/validate an algorithm such as SVM with 5 times more features/attributes than the initial example set from the Windowing parameter? I guess that a Sliding Window Validation parameter Window=5 will result in an SVM hyperplane based on vectors in 5 dimensions. But these vectors point to "nowhere" compared to the vectors from the input example set coming from the Windowing operator with Window=1. Why not set the Window parameter of the Windowing operator then also to 5, so the dimensions of the vectors of the input example set match at least the dimensions of the trained/validated vectors of the SVM hyperplane?
You might understand: I'm feeling overtrained.
Please elaborate. Thanks.
Problem solved, I obtained my learning point. By experimenting.
The Windowing operator parameters determine the learning and prediction.
Training/validation of the model is separate.
Thank you again for your work.
Oh hey @luc_bartkowski I didn't see your questions. It's best to get my attention by using the '@' symbol and tagging me in the future.
So just to clarify, the Windowing operator is a transformational operator. It just transforms your time series into a multidimensional data set based on the parameters you use. The Sliding Window Validation operator is use to for training and testing your model for performance. Yes they have similiar parameters called Steps and Widths, it's just they're applied differently. Good luck!