RapidMiner 9.8 Beta is now available

Be one of the first to get your hands on the new features. More details and downloads here:

GET RAPIDMINER 9.8 BETA

Why Doesn't ARIMA Predict Future Time Series Closing Prices?

SkyTraderSkyTrader Member Posts: 88 Contributor I

I’m really hoping someone can explain what’s what when using ARIMA for Time Series Predictions!?

When I’m using ARIMA and this set up:

Pls see images.




I used a huge window size so I could see what was happening chart wise over the last few months. There’s no zoom in on the charts?

At first I thought I was training on the window size of data and then testing on unseen data (I have 20 years of Dow Jones Open/High/Low/Close plus technical indicators from 2000 to 2020). The reason being is because when I put in a very high window size like 4500 days (approx 18 years of data) I would only see about 2 years of charting results from 2018 to present (which I assumes was the test data) whereas if I had a window size of only 60 days I would see a whole chart of 2000 to 2020. 

But... all the relative error figures were very small, like, 1 or 2%, — which is far too good to be true, right? I assume that is because I am training on one subset of data and testing on the same data (as it rolls along using my  window size and step value settings)? 

The questions I have are:

1) How do I make ARIMA test on two different data sets? One seen and one unseen and untrained? With the Cross Validation operator? And if that is the operator I need, how do I ensure ARIMA trains of specific date ranges so I can make it include calm, low volatility periods and also highly volatile periods, like during Covid19?

2) Make ARIMA test on untrained data? 

3) I want the training period to be unanchored and cover the the first 75% of the dataset (2000 to 2015)?

and lastly,

4) How do I get ARIMA to predict the next 2 (or 5, or 10) days ahead of the last date or row of data I have in Excel — which will be 3rd August — when I update my Excel with Yahoo finance data tomorrow night. I.e., so ARIMA will be predicting the closing prices for the 4th and 5th Aug and beyond?

I’ve tried many window size combinations but the low relative errors must be due to the point I raised about not being tested on unseen data.

Even using ARIMA with a small window size of 10 days it doesn’t make a predictions into the future. 

I’m hoping those that are interested in Financial Time Series forecasting will understand these issues!

Thanks very much in advance, 

Answers

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,636  RM Data Scientist
    Hi,
    i think the operator you are Missing is the Apply Forecast operator. it takes an ARIMA model and forecasts n-points ahead.

    Also keep in mind that the whole validation is used to determine ther performance of the model. You do this do get the correct settings for ARIMA.
    Unlike other ML algorithms you need to "retrain" ARIMA on a new data set if you want to forecast.

    Best,
    MArtin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
    SkyTrader
  • SkyTraderSkyTrader Member Posts: 88 Contributor I
    Hi @mschmitz, right thanks for the feedback and answering question 4). I've used the Apply Forecast Operator and connected it but it thinks it's not connected? Pls see images:







    I'm still unsure about testing, training and validation (the latter of which I didn't know I was even doing) in relation to ARIMA and I still don't understand what the answers are to Qu's 1, 2 and 3?  

    "the whole validation" -- what part of my process was "validating?" I thought I was just training and testing on the whole data set and is this why I got amazing low relative error statistics?

    "you need to "retrain" ARIMA on a new data set if you want to forecast." -- how do I do that please?

    Thanks very much for any advice, as I thought this would be a lot simpler but tbh I find the Help in RM (on the right) very hard to understand for beginners, this despite having 4 years of algorithmic trading experience! It's like the Help is talking to people who already understand everything already.

    Cheers,
    Best,
    Sky Trader
  • tftemmetftemme Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member Posts: 160  RM Research
    Hi @SkyTrader

    Training your data set on a training set and testing the model on an unseen test set is called validation in general. This is true for "normal" machine learning applications as well as for time series problems. There are some difference, one important one is how the training and test sets are set up. In a "normal" machine learning use case, you use most of the time Cross Validation, for a time series problem you want to use Sliding Window Validation. Later is achieved by the Forecast Validation operator.

    So your initial setup was already correct for the validation (testing the performance of your model on unseen data). Concerning your questions:

    1) The Forecast Validation operator directly trains a Forecast Model (in your case ARIMA) in the 'Training' subprocess (the left part of the inner process). This Forecast Model is then used to predict the test window in the 'Testing' subprocess and a performance measure is calculated (in your case by the Performance operator). The important output of the Forecast Validation operator is the evaluated performance of your model and the final model, which is trained on the whole input data.

    2) see 1), the Forecast Validation operator does this automatically

    3) Set the window size to 75 % of your input data. For now, window sizes can only be configured in number of examples

    4) Use the final model of the Forecast Validation operator (the top output port) and connect it to an Apply Forecast operator outside of the Forecast Validation operator. Be aware that you cleary have not connected the Apply Forecast operator in the screenshots of your second post. The operator is only placed on top of the line, the line itself is not connected

    Hopes this helps,
    Best regards,
    Fabian

    PS.: I would recommend to go through the in-Product tutorials (click on the 'Learn' tab in the Welcome panel) to get familiar with concepts of RapidMiner and data science. Though it is not exactly directed to time series, it helps getting familiar with the product
    PPS.: When you try to figure out how operators work, you can also check out the tutorial processes at the end of the help of the operators
  • SkyTraderSkyTrader Member Posts: 88 Contributor I
    edited August 12
    Hi @tftemme,
    Thanks so much again for the help using ARIMA. I have a few questions about the dates and results. (it looks a lot, but I've just added many images to help).

    My Dow Jones data and indicators cover the years 2000 to July 29th 2020. I had a Forecast Validation set up with a Window Size = 250, Step = 20 days and Horizon = 5.

    I was wondering why the Forecast validation results always include, in the very top rows, the test dates of a 5 day horizon in 2020 and from there then start giving results for 2000? Pls see image(s).


    My data starts in Jan 2000, but as an experiment (because Step size is better with monthly values), when I set up a Window Size = 5000 (I have 5177 rows/dates), Step = 1 days and Horizon = 2, the first date in the Example set “Apply Forecast” is 11th Sept. 2000. Why does it start in Sept 2000 and not one day after the first date (Jan 3rd 2000), 4th Jan 2000 (Step size of 1)?


    My "Forecast Validation" results with Window 250, Step =1 and Horizon = 20 shows test results up to 28th August 2020:



    But my "Apply Forecast" chart only shows data up to the 10th July 2020:



    In my "Apply Forecast" charts I was expecting to see a green future prediction plot line til the 28th August 2020?

    So, lastly, how would I get it plot of those August 2020 future values (you can just see a little green prediction at the very end of this chart to July 9th / 10th) and why is it just a green flat line that doesn't show individual dates - I was expecting a zig zaggy plot like my blue close line? 


    I actually decided to check why the above is happening and ran the Window 250 and Step =1 but now with a large 50 day Horizon instead of a 20 day -- and now I get the (correct) reverse thing happening because I now have a correct "Apply Forecast" data table and charts (albeit a smooth flat green line) that shows future values til 15th Oct 2020 and my "Forecast Validation" ends July 4th 2020.

    So I'm not sure why my "Apply Forecast" in the former case with Horizon = 20, doesn't show anything after 20th July? I also have a very thick "Apply Forecast" chart?

    Playing around with different ARIMA window sizes etc, I also notice that my predictions are now very similar? 



    Thanks very much, any help gratefully appreciated!

    Cheers,


  • tftemmetftemme Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member Posts: 160  RM Research
    Hi, 

    I try to answer the questions shortly:

    - Forecast Validation is executed normally in parallel, so the order of the test results depends on which of the windows are executed at first (which is internally processing). You can either disable the parallel execution, or add an Sort operator after it
    - The resulting ExampleSet of the Apply Forecast operators starts either at the beginning of the input time series data (if you have enabled the corresponding parameter) which is used to train the Forecast Model, or it only holds the forecasted values. So in your case the model which is provided to the Apply Forecast operator is trained on the data starting in September. You can insert breakpoints before and after operators (right click on the operators) to go into details where it might behave differently then you expect
    - I have added an answer to the other post about the gaps
    - When you use Apply Forecast it obviously does not have predictions for the training data (missing values) and it does not have real values for the predicted values (in the future). When you use Forecast Validation, the test window contains the predicted values and the real values of the test window, but this is a different situation than using Apply Forecast to predict unknown values in the future
    - Most of the screenshots you have show the results of the Apply Forecast operator. The number of forecasted values by this operator is just defined by the corresponding parameter of the Apply Forecast operator
    - Forecast Model try to predict the future based on past values. It can happen that the best forecast is just a flat line, because there is no proper pattern in the data, and the "zig-zag" in your input data is just noise which cannot be predicted.
    You can try to do an Optimization to find the best parameter setting for the ARIMA model to have the "best" prediction (in terms of the performance measure you use in the Validation)

    Best regards,
    Fabian

  • SkyTraderSkyTrader Member Posts: 88 Contributor I
    Thanks Fabian, @tftemme

    My Horizon has been fixed at 20 (to try and replicate the first ARIMA Apply Forecast consecutive daily results in August). Step Size is 1.

    Turning off parallel execution still brought up the 2020 data first then 2005 data going forward in Forecast Validation. I'm using 75% of my 5200 rows of data (2000 to 2020) which is 3900 rows -- Window Size and Step Size of 1. Why would Forecast Validation produce results from 2005 onwards (after its still reproduced the top rows of 2020 data first), surely I'm training on 2000 to 2015 (75% of data) and Forecast Validation should start from 2015?

    I am baffled why using many of my standard combinations of Window and Step Sizes (but always with Horizon at 20) I cannot get it to reproduce Apply Forecast results consecutively like I did when I first started using ARIMA last week, and which produced results that didn't skip every 3 days even though I would have tested it and ran the ARIMA model on the same data set (last date 29th July 2020)?

    Changing Step from 1 to 100, I then tried using a Sort operator and that didn't fix the issues of seeing 2020 data first at the top of Forecast Validation results so I deleted Sort went back to Step Size 1 and ran it again and now it is showing Forecast Validation starting from 2015 (as it should be, albeit with 2020 results still at the top) and not 2005, so... I am wondering why changing the Step from 1 to 100 and back to 1, does it now produce the correct results from 2015 onwards?

    " The resulting ExampleSet of the Apply Forecast operators starts either at the beginning of the input time series data (if you have enabled the corresponding parameter) which is used to train the Forecast Model,"

    Which parameter is that please?

    "
    When you use Forecast Validation, the test window contains the predicted values and the real values of the test window, but this is a different situation than using Apply Forecast to predict unknown values in the future"

    I'm interested in getting those future predictions using Apply Forecast, why with Window at 3900, Step at 1 and Horizon at 20 (to try and replicate the first ARIMA Apply Forecast consecutive daily results in August) can I never get Apply Forecast to go beyond 20th July? I want results for a 20 day horizon from the 29th July 2020 onwards. A Step of 1 should accommodate that, no? What am I missing here?

    "
    Most of the screenshots you have show the results of the Apply Forecast operator. The number of forecasted values by this operator is just defined by the corresponding parameter of the Apply Forecast operator"

    So in summary, I've set Apply Forecast to be for a horizon of 20, I am still unclear why it is not giving those values going forward into August 2020 from the end of my data (29th July 2020)? I wish I'd written down the Window and Step Size when I got it to give a perfect daily forecast on consecutive days starting in August last week...

    I looked at auto arima in the Samples/Time Series/templates/Automized Arima on US-Consumption data.

    I added an Optimise Grid Operator but can't understand why the Operator parameter filed is unresponsive or how to use the wizard? Pls see image:




    Thanks once again for your input.

    Best,
    Sky Trader.
  • tftemmetftemme Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member Posts: 160  RM Research
    Hi @SkyTrader

    When you configure the Sort operator correctly (by sorting after the corresponding Date attribute) it cannot happen that the 2020 data occurs on top, so there has to be a wrong configuration by yourself.

    As I already said, windowing is always based on number of examples. You can count by yourself. Maybe draw an own example by your own, to better understand the way windowing is working. If you have a better understanding of windowing, you will figure out as well how the windowing is working on your data and which combinations of window size and step size have what effect.
    (Breakpoints help to understand specific steps, because you can directly see the data before and after the step)

    Honestly there are 3 parameters in Apply Forecast. One controls the forecast length, the other two are called "add original time series" and "add combined time series". They are even described in the help text.

    As I said, the configuration of the Forecast Validation has no influence on the predicted values for the future. It is just use to evaluate the performance of the Forecast Model. 
    The Forecast Model which is used by the Apply Forecast operator uses the whole input data as the training data (as it is described in the help text). So the time difference between the predicted values is just based on the time difference of your last two value in the input time series. The number of predicted examples is based on the forecast length parameter.

    Have you placed anything inside the Optimize Parameter operator? Please have a look into the help text of the operator and the tutorial processes. 

    In general I would recommend to go through the in-product tutorials ("Learn" tab in the welcome dialog at the start of RapidMiner)
    Also please study the help text and tutorial processes of operators in more detail.

    Best regards,
    Fabian
  • SkyTraderSkyTrader Member Posts: 88 Contributor I
    edited August 13
    Hi Fabian @tftemme

    Cheers, yes I'm aware of the parameters in the Apply Forecast and what they do. I prefer to just tick the first box (add orig time series) and leave the second unticked for a cleaner results table.

    I'm familiar with windowing in that I've typically used weekly, monthly of quarterly Step Sizes based on how trading companies like hedge funds and EFT's will change their portfolio of stocks, eg after quarterly performance is measured.
    Forecast Horizon seems self explanatory. I've used Walk Forward optimisations a lot in my algo trading which uses a similar concept and allows for anchored (best) or unanchored optimisation of data. 

    "The Forecast Model, which is used by the Apply Forecast operator, uses the whole input data as the training data (as it is described in the help text)."

    "uses the whole input data":
    But isn't Window size what splits the dataset into Training and Test sizes?
    (It seems sensible to have at least half the data and certainly best to be able to cover different market regimes (volatile, non volatile, trending, non trending periods). How accurate those quarterly time delineations/Step Sizes will be with "missing" dates for weekend is still something I'm figuring out). 

    "So the time difference between the predicted values is just based on the time difference of your last two value in the input time series."

    Right, this was the issue with there being 3 days between the two inputs and therefore ending up with skipped dates forecasts eg, 1st, 4th, 7th August 2020 etc.

    Maybe I am missing the point here, again, but (and dependent upon the size of the Step, so it's better to have a value like 1 or 5), if the data ends 29th July and the Apply Forecast Horizon is for 20 days and the Step is small then future predictions in August can be made.

    That's why I thought I had everything sorted with the ARIMA model last week until I went back to it and couldn't replicate those future consecutive August predictions using a myriad of Window and Step size values (pls see my other post).

    Right, I'll take another look at break points but I still feel like something is not working right (my set up) and that is what is confusing me so much because as you said, ARIMA doesn't have a lot of parameters to alter.

    "Have you placed anything inside the Optimize Parameter operator?"

    I haven't got that far because as mentioned above I don't know how to make the wizard work.

    Best regards,
    Sky Trader.

Sign In or Register to comment.