RapidMiner 9.7 is Now Available

Lots of amazing new improvements including true version control! Learn more about what's new here.

CLICK HERE TO DOWNLOAD

how do i get better predictions

LeMarcLeMarc Member Posts: 57 Contributor II
edited April 23 in Help
Hi,

Im using a regression model to predict sales values (label attribute) in order to select those data points where sales values could be potentially wrong. This is defined by deviations between predicted value and original value.
However some predictions are quite close to the original values and some the error rate is above 50 % within the same (artificial) data set. Using a forecasting model (e.g. ARIMA) does not make sense to me, since im not trying to forecast future values for another example set. But rather trying to check if sales values are wrong or right/flag as potentially wrong.

So I was thinking could prediction of the sales value be quite different, because the data is not based on real data?
Does anyone have a suggestion on how to recheck sales values otherwise with supervised learning methods?

Thank you!
Tagged:

Best Answer

Answers

  • LeMarcLeMarc Member Posts: 57 Contributor II
    I used a regression model with an example set of sales Data from the internet. The predictions here are quite close to the actual values. Time frame included several years though.
    In my case Im just checking the sales value for a single month.
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,462  RM Data Scientist
    Hi @LeMarc ,
    this boils down to a general "how do i get better predictions" question... What model did you use?

    Best,
    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • LeMarcLeMarc Member Posts: 57 Contributor II
    edited April 23
    Thanks @mschitz! I changed my question according to your suggestion.
    I tried several different models of Prediction models available in Rapid Miner e.g. DT, RF, GBT,DL etc. Just experimenting ,without optimizing parameter though.
    Edit: Optimizing Parameters & Stacking does not improve performance

    Decision Tree seems to be the best so far. However since its a task for management accounting the predicted values and actual values should be quite close if there is no mistake in the actual values.
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,462  RM Data Scientist
    It is very surprising that a DT is better than a GBT. If a DT works, a RF is usually always better.

    That make me suspicious..
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • varunm1varunm1 Moderator, Member Posts: 1,197   Unicorn
    edited April 23
    Edit: Optimizing Parameters & Stacking does not improve performance
    Does not or Did not? Generally, optimization improves performance unless the default parameters in operators are the best for this data.

    Also, how did you build your models? Did you use any feature selection or generation?

    Did you check correlations between the predictors and outcomes? We can get some idea based on that as well.
    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

  • LeMarcLeMarc Member Posts: 57 Contributor II
    edited April 23
    Unfortunately I cant share my data set. But I can show my models .

    (1) Parameters are set as default. Ratio 0,8/0,2. (100 Examples)
    RMSE are as followed:
    DT = 18788.462 +/- 0.000
    GBT = 15756.644 +/- 0.000
     RF = 12021.061 +/- 0.000
    @mschmitz you are right. It does make sense, that a RM should be better than a simple decision tree as data above shows.
    2) According to the last model with the loop parameter the RMSE is like this (settings are the same as in (1)):

    DT   =  7930.069
    GBT = 11496.235
    RM   = 12440.348

    I dont understand why DT has the lowes RMSE now.

    (3) I also tried the auto model and the RMSE KPI looks like this:
    GLM = 6994.636 +/- 2003.916 (micro average: 7203.218 +/- 0.000)
    DT = 10789.033 +/- 4282.052 (micro average: 11512.035 +/- 0.000)
    RM = 8101.997 +/- 2561.472 (micro average: 8427.034 +/- 0.000)
    So basically result is similar to the first one in regards to which model works has the lower RMSE.

    (4) The ratio was changed to 0,9/0,1.

    DT = 1550.679 +/- 0.000
    GBT = 9779.131 +/- 0.000
    RF = 6380.126 +/- 0.000
    Now DT has the lowest RMSE. But why?

    @varunm1 & @mschmitz It did not work. I did not used any feature selection. Correlation matrix didnt show any interesting result, since there is no real pattern behind the sales values due to artificial data set. What do you mean by GENERATION?
    The model with the lowest RMSE should be chosen right? If RMSE is e.g. 1550.679 is it 15,5 % ? - Im a little bit confused how to read the numbers.

    Something more I dont understand: when using Deep Learning to predict , the performance changes every time the "start execute" button is pressed though nothing else changes.

    Thank you for the help!



  • LeMarcLeMarc Member Posts: 57 Contributor II
    Thank you for the input, the link and your help. It is appreciated.
Sign In or Register to comment.