🦉🦉   WOOT WOOT!   RAPIDMINER WISDOM 2020 EARLY BIRD REGISTRATION ENDS FRIDAY DEC 13!   REGISTER NOW!   🦉🦉
ALL FEATURE REQUESTS HERE ARE MONITORED BY OUR PRODUCT TEAM.

VOTING MATTERS!

IDEAS WITH HIGH NUMBERS OF VOTES (USUALLY ≥ 10) ARE PRIORITIZED IN OUR ROADMAP.

NOTE: IF YOU WISH TO SUGGEST A NEW FEATURE, PLEASE POST A NEW QUESTION AND TAG AS "FEATURE REQUEST". THANK YOU.

using the ARIMA model

KipirKipir Member Posts: 3 Contributor I
edited November 26 in Product Ideas
Hi,
I'm trying to use ARIMA model to predict some value.

I started with the example "Forecast Validation of ARIMA Model for Lake Huron": I replaced the dataset with mine and modified the parameters of the model to obtain an acceptable RMSE value, but I stopped here.

I tried to read up on the subject  but I still have many doubts:

1. Does the arima model have a training phase in which it learns from the data provided or does the data only serve to verify how good the parameters set in the model are? Or what else?

2. Once the model has been trained, how can I connect it to another data source and verify that it also works on these test data without the model being modified (eg: restarting / resuming learning)?

Many thanks,
Kipir
lionelderkrikor
1
1 votes

Sent to Engineering · Last Updated

TSE-128

Comments

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 911   Unicorn
    Hi Kipir,

    To answer to your questions : 

    1. The definition of a an ARIMA model is given by :  



    The algorithm inside ARIMA is finding the coefficients Alpha(i) and Theta(i) which best fit the training data. To do that,  the algorithm searches to maximize the "loglikelihood" of the model describing the time series and thus it minimize the AIC 'Akaikes Information Criterion' (by default). So the training phase is used to calculate how good the coefficients /the parameters set of the model are.
    Personally I don't like the expression  "learns from the data" to describe the training phase... but it is a question of point of view ....

    2.From my point of view, a model is relevant for a specific time serie. So the 2 phases (training and forecasting) are inseparable.
    I don't see the the interest to train the ARIMA model on a specific training set and forecast this model on an other time series "test set".

    If you are interested how to check the "quality" of a forecast of a model, you can apply the third step of the "Box-Jenkins" method.
     (Box-Jenkins model diagnostics). To my knowledge, it is not implemented in RapidMiner but available in Python and can be a relevant complement to RapidMiner results.
    I will write how to do that in Python in a next post.

    Hope this helps,

    Regards,

    Lionel 
    varunm1sgenzer
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 911   Unicorn

    Firstly, the RM staff can see this post as a "feature request"...

    Kipir, once you have create your ARIMA model, you can check its relevance by applying the third step of the "Box-Jenkins" method (the general method to build and validate an ARIMA model)
    During this third step (the model diagnostics), you will create 4 plots : (Here the diagnostics of an ARIMA(p,d,q) = (1,0,1) model used to fit the "Lake Huron" time serie dataset).




    You have to check the "statistics" of the residuals (the difference between the observed values and the predicted values).
    Generally, when exploring residual errors we are looking for patterns or structure. A sign of a pattern suggests that the errors are not random.
     We expect the residual errors to be random, because it means that the model has captured all of the structure and the only error left is the random fluctuations in the time series that cannot be modeled.
    A sign of a pattern or structure suggests that there is more information that a model could capture and use to make better predictions.
    To check this, you can see :  
    •     The bottom right plot (Correlogram) which shows the autocorrelation of the residuals. You have to check that all (except the first lag) lags are no significant. ==> if not, your residuals are maybe correlated.
    • The top right plot : the 2 curves have to be  confused or almost... ==> if not, your residuals are not random
    • the bottom left plot : the points are generally aligned on the diagonal line. ==> if not, your residuals are not random.
    If your residuals are correlated and/or are not random, you can :  
    •   choose a better parameter set (an other (p,d,q) set) , retrain the ARIMA model and perform again the diagnostics.                             AND/OR
    •  inspect your time series. There are maybe outlier(s) in your training set. You have to determine if this outlier are valid extreme values (in this case do nothing), or are invalid/aberrant values (in this case, remove this (these) value(s) from your training set).

    Hope this helps,

    Regards,


    Lionel

    PS :  Unfortunately, I can not share the notebook python (.ipynb file) on this forum. If you need it, send me a PM.








  • KipirKipir Member Posts: 3 Contributor I
    Hi Lionel,
    thanks for the valuable explanations, now everything is a little clearer to me.
    If I understand correctly then, once the training is completed (on a single dataset) and I get a low error I can apply the Box-Jenkins method and if this gives a positive result (based on the suggestions you indicated) there is not much else to do and I can think of developing the model and try it directly in the production environment (I'm simplifying). Correct? 
    I take advantage of your patience and ask you one last question. Imagine having a temperature sensor installed in a special laboratory and the temperature trend depends on a series of devices installed on site. I have to predict the temperature in an hour so I train my ARIMA model based on a single laboratory dataset and I get some good results. However, since I have several identical laboratories scattered around the country, I would like to see if the model I just trained works well with others, so I would like to use a sensor dataset from another laboratory and validate the results (obviously without modifying the model). Is there a way to do this with RM?

    thank you very much,
    Kipir


    P.S.
    Thanks, but unfortunately I don't use python ...


  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 911   Unicorn
    Hi Kipir,

    Thanks for your message.

     there is not much else to do and I can think of developing the model and try it directly in the production environment (I'm simplifying). Correct? 
    Yes, correct ! 

     I would like to see if the model I just trained works well with others, so I would like to use a sensor dataset from another laboratory and validate the results (obviously without modifying the model). Is there a way to do this with RM?

    OK, I understand what you are trying to do.
    To be honest, I don't see how to do that with an ARIMA model inside RapidMiner. @tftemme any insights ?, thanks you...
    To do that, I think you have to use "classic machine learning" after "windowing" (using the Windowing operator) your training set (temperature of laboratory A).
    As the result you have a trained machine learning model (for example a Gradient Boosted Tree model) . In this case, the model "predict" 
    the value of your temperature (t + a) = f(value(t),value(t-1),value(t-2),....,value(t-w) ) where : 
    • a = step size
    • w = window size
    Then you have to apply the same windowing parameters to your test set time series (temperature of Laboratory B) and then apply the trained model to this preprocessed test set.

    You can see the Create Model for Gas Prices in the processes templates folder of the "Time series" samples folder to understand.

    Hope this helps,

    Regards,

    Lionel


  • KipirKipir Member Posts: 3 Contributor I
    Lionel,
    thank you very much, you are helping me a lot! What you suggested me to do (Windowing + GBT / Gas Prices example) is exactly what I have done so far before trying ARIMA and noting that with this model, RMSE it is much lower, for this I wanted to investigate better in this direction :)

    Thanks again,
    Kipir

  • tftemmetftemme Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member Posts: 130  RM Research
    Hi @Kipir , @lionelderkrikor.

    Thanks for the great illustration of the Third step of the Box-Jenkins method. I have moved this thread into the Product Feedback section and created a ticket for this. Cannot promise any timeline, but it is a good addition to the time series processing in RapidMiner, thanks for the input.

    Concerning applying the ARIMA model on different (but similar) time series. Currently this is not possible with RapidMiner. The difficulty is, that the ARIMA model prediction not only depends on the coefficients, but also on past values and past residuals (so difference between forecasted and real values of the time series). Thus an application of the ARIMA model, always means that the whole time series has to be forecasted (due to the recursive structure of ARIMA), to calculate all residuals and then forecast the future values. Technically this could be implemented, but honestly I am not sure if this would be still correct from a statistic perspective. As @lionelderkrikor correctly stated, the ARIMA model is specific to one time series, so not sure if it would still be a proper application for another data set.

    But using windowing with GBT (or another machine learning regression method) is a very good alternative. Honestly my gut feeling is, that I would expect better performance of this approach in many cases.
    If you want to forecast more than 1 value with this approach have a look at the Multi-Horizon Forecast operator, which provides you with the possibility to built a collection of ML models to forecast more than one value.

    Best regards,
    Fabian


Sign In or Register to comment.