Options

Why has ARIMA Started Skipping Time Series Prediction Dates?

SkyTraderSkyTrader Member Posts: 88 Contributor II
edited August 2020 in Help
Hi there,

Wondered if anyone could figure out this issue? :

My Dow Jones data and indicators cover the years 2000 to July 29th 2020. I had a Forecast Validation set up with a Window Size = 250, Step = 20 days and Horizon = 5.

I've noticed in "Forecast Validation" an odd thing with the Daily predictions. They work throughout the data but when you get to the bottom this no longer the case with gaps between the days?

Pls see following images.


Scrolling down a little further to the very bottom and the prediction now appears to be every 3 or 4 days? It does this whether the Step is 1, 5 or 20.


And again with a different Step = 1:



I'm not sure why it's skipping prediction forecast days and I don't recall changing anything as it used to work fine and give consecutive prediction days: (This last image from last week)



Any help gratefully appreciated as I can't figure out why it's missing days now!?

Answers

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,507 RM Data Scientist
    cc: @tftemme
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    tftemmetftemme Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member Posts: 164 RM Research
    Hi @SkyTrader

    This is due to the fact that your initial data is not equidistant. You have daily data, but you also have gaps (probably from the weekends) in your data. ARIMA (or any forecast model in general) uses the difference between the last two index values, to create the index values for the predicted values (You need equidistant data for ARIMA in general). When by chance the gap is the last entry in the training window (as it is for example for the 2020-07-20) you have 3 days inbetween the forecasted values.

    What you maybe can do is using Equalize Time Stamps to make all your input data equidistant. Note that this basically interpolates the data for the weekend days and the trained ARIMA is also fitted to this interpolated data.

    Best regards,
    Fabian
  • Options
    SkyTraderSkyTrader Member Posts: 88 Contributor II
    Thanks for getting back to me Fabian @tftemme, that makes sense.

    Consecutive Days:

    What I can't figure out is what settings of window size and step size I used  (the step was a 20 day horizon) that gave me that August consecutive days predictions above (and I have tried like a hundred different combinations!) because it's the same data and I haven't updated it?

    "Future" Predictions:

    I was using a 250/5/20 which gave me results up to 28th August (albeit with 3 day gaps!) but when I tried a step size of 6, the Apply Forecast only predicted up to July 14th and not into the future month of August, why is that happening with just a one day increment in step size?

    I even tried a step size of 1 day (250/1/5) and it still only forecast predictions to July 27th?

    I don't understand if the Excel's last datapoint is the 29th July, why it won't forecast from the 29th July, forward? Same with 250/1/20, that only predicts to July 20th? Why not from the 29th July in to August? (I've made sure both Forecast Validation and Apply Forecast have the same horizons).

    Remove Weekends:

    Re: Equalize Time Stamps, it's an idea, I'm just trying to figure out how will I actually know what dates the predictions apply to if the weekends are removed?

    Thanks again for you help,
  • Options
    tftemmetftemme Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member Posts: 164 RM Research
    Sorry, I cannot figure out this from this far away. This depends on your data and the settings. But it is basically counting, cause the settings for window size, step size, horizon size, ... are always number of examples. So you can just count the numbers of examples in your input data. Thats also true for the forecasted values in the Apply Forecast operator. If you have set the length to 20, it will predict 20 examples. And to calculate the timestamps for this 20 examples, it will uses the distance between the last two data points in the input data of the forecast model.
    You can insert breakpoints at all different steps in the process to investigate what the specific operators do.

    Equalize Time Stamps does not remove the weekend. It will add interpolated values for the weekends, so that you truly will have daily data which you can use.
  • Options
    SkyTraderSkyTrader Member Posts: 88 Contributor II
    edited August 2020
    Cheers Fabian, @tftemme,
    May I post the XML and the dataset? I'd really appreciate getting to the bottom of this prediction issue that's developed!?

    Thanks for the Equalize Time Stamps tip. I've got the same issue with the "has indices" and lack of labels again, tried my deleting and re-adding my Read Excel and there's nothing in the drop down list again, even despite using the import wizard to point it to the file on my Mac HD, but this time that didn't fix the lack of attributes in the drop down:



    Tried Retrieve, (pointing to a process that has retrieved and stored the data) and directing it to that local Dow Excel repository but that doesn't work either?

    It's been a really long week since thinking I had ARIMA sorted out and then having all these issues! I can only conclude either it's user error (although I feel like I'm doing everything in the same methodical manner) or RM is not stable?
  • Options
    tftemmetftemme Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member Posts: 164 RM Research
    Hi @SkyTrader

    You can post your xml and the data here (make sure that you are allowed to share your data in public). Unfortunately I probably will not find the time to go deep into your data, and as I said, this is counting, so you should be able to check this by your own (insert breakpoints if you are unsure).

    For the indices attribute. Make sure that you have selected the indices attribute in all operators in the process. From your screenshot it looks that in this case the attribute is really not there, so it is removed from your data at one point. Check by using breakpoints.

    Unfortunately time series handling is more complex than on the first glance. Figuring out how windowing is exactly applied can be confusing. Also we are normally all familiar with concepts of time and dates, but the details are way more complex than people expect (just think about that monthly data is non-equidistant, which is difficult to handle). This (unexpected) complexity of time series problems often leads to a higher user error number.

    Best regards,
    Fabian 
  • Options
    SkyTraderSkyTrader Member Posts: 88 Contributor II
    edited August 2020
    Thanks Fabian, @tftemme

    Attributes missing from drop down:

    I used the break points to help with the attributes issue and it appears that the problem starts at the Forecast Validation operator! Cheers for the tip.

    What happened next makes zero sense to me:

    I decided to delete all the ARIMA process operators and rebuild it so I put new operators in below the original ones (so as to be able to copy the wiring correctly).

    On this new second row, pls see image, I decided to use the Read Excel operator and changed the Date in the import wizard to a polynomial but it still doesn't show Date in the drop down (although at least I can now see Close in the "time series attribute" drop down list).



    I swapped Read Excel for Retrieve (that leads to a Retrieve/Store process in the repository) and now I can see Date in "indices attribute" in the top original process row? I'm not sure why adding a second row would "bring the top row to life" and make the attributes appear in the drop down list?

    I'm wondering why this data issue is happening whether I access Excel from my local hard drive (Read Excel) or the RM repository (Retrieve/Store)? I understand the meta is probably more stable and RM gives a faster read if it's in the local depository.

    "Future" Predictions:

    I did try inserting break points but could not infer anything useful. I will look again.

    The Horizon I had was definitely 20 (from my August daily predictions screenshot) when I got that consecutive daily prediction for Aug 1st, 2nd, 3rd etc last week and there are only Window and Step sizes to play with (and I've tried but can't get any future predictions now). Pls see image:



    All other settings were left the same...

    All I can get now are flat (green line) past predictions in July 2020 (data set ends 29th July) and not for August as I was originally getting. Pls see image:

     
    From earlier trials of window size going into August (My data ends 29th July) with Wd 5000/Step 10/Horizon 20:



    Yes, I can see that the data collation aspect and index type is a big part of ML and the prediction error. My errors are coming out around 1%, which as a stop loss on an Index like the Dow at 28,000 I can live with.

    Time Stamps Equalise:

    Do you have any tips on the settings for Time Stamps to handle the missing weekend days?

    Are these good parameters to use overcome the missing dates? Pls see image:



    Not sure why writing in the word "Date" into "indices attribute" didn't work this time?

    What is the significance in the date format to have the months always in capital letters?

    Here is the XML and link to Dropbox Excel file:

    https://www.dropbox.com/s/2l7397km4bqkcv6/DowJones2000to2020.xlsx 

    Thanks very much again for your time.
Sign In or Register to comment.