RapidMiner 9.7 is Now Available

Lots of amazing new improvements including true version control! Learn more about what's new here.

CLICK HERE TO DOWNLOAD

Not normally distributed data

jeroenheijlenjeroenheijlen Member Posts: 4 Learner I
Hi,
I'm trying to find a model to make a prediction for the execution time of a process step. I've data from over 200 different recurring process steps from the past 2 years (160.000 rows in excel sheet). When I plot the execution-time data per event, the data is not normally distributed but more like a Poisson distribution. Just loading the data in Rapidminer Studio and applying the models do not return a good fit. What can i do? (for data pre-processing in Python or R I would need a step-by-step guide because I'm pretty new in all of this)
Some help would really be appreciated!
Best regards
Jeroen  

Answers

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,068   Unicorn
    Hi @jeroenheijlen,

    Have you tried to submit your data to Auto-Model (the AutoML tool of RapidMiner) ?

    Regards,

    Lionel
  • jeroenheijlenjeroenheijlen Member Posts: 4 Learner I
    edited May 31
    Hi @lionelderkrikor , thanks for your reply.
    Yes sure, I tried auto model but even when I already seriously reduced the variation in the inputdata, no model but do a good job for my data:

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,068   Unicorn
    Hi @jeroenheijlen,

    Maybe there are not relationships between your independent features and your label (your target).
    In this case, it is impossible to find a good model and machine learning is of no use...
    In the meantime, you can try to : 
     - enable feature selection / feature generation in the options of AutoModel
     - for your best models, you can tune hyper-parameters to try to increase the accuracy/decrease the error rate.

    Regards,

    Lionel
    jeroenheijlen
  • jeroenheijlenjeroenheijlen Member Posts: 4 Learner I
    Hi @lionelderkrikor
    I'm indeed afraid the variation within each of the process step is too large and therefor no model can find a correlation or prediction fit.
    Thanks for your advise.
    I will try a few more things (auto feature selection fails) such as starting with a smaller dataset (info of only a few of the process steps, remove more of the outliers, but still the data will never be normally distributed) and also once create the set like a binomial outcome (more than 2 hours, less than 2 hours, or so).

    If I ever will succeed, I will post the outcome ;-).
    Best regards
    Jeroen 
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,068   Unicorn
    You're welcome, @jeroenheijlen.

    Good luck ! 

    regards,

    Lionel
    jeroenheijlen
Sign In or Register to comment.