Not normally distributed data

jeroenheijlenjeroenheijlen Member Posts: 4 Learner I
Hi,
I'm trying to find a model to make a prediction for the execution time of a process step. I've data from over 200 different recurring process steps from the past 2 years (160.000 rows in excel sheet). When I plot the execution-time data per event, the data is not normally distributed but more like a Poisson distribution. Just loading the data in Rapidminer Studio and applying the models do not return a good fit. What can i do? (for data pre-processing in Python or R I would need a step-by-step guide because I'm pretty new in all of this)
Some help would really be appreciated!
Best regards
Jeroen  

Answers

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Hi @jeroenheijlen,

    Have you tried to submit your data to Auto-Model (the AutoML tool of RapidMiner) ?

    Regards,

    Lionel
  • jeroenheijlenjeroenheijlen Member Posts: 4 Learner I
    edited May 2020
    Hi @lionelderkrikor , thanks for your reply.
    Yes sure, I tried auto model but even when I already seriously reduced the variation in the inputdata, no model but do a good job for my data:

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Hi @jeroenheijlen,

    Maybe there are not relationships between your independent features and your label (your target).
    In this case, it is impossible to find a good model and machine learning is of no use...
    In the meantime, you can try to : 
     - enable feature selection / feature generation in the options of AutoModel
     - for your best models, you can tune hyper-parameters to try to increase the accuracy/decrease the error rate.

    Regards,

    Lionel
  • jeroenheijlenjeroenheijlen Member Posts: 4 Learner I
    Hi @lionelderkrikor
    I'm indeed afraid the variation within each of the process step is too large and therefor no model can find a correlation or prediction fit.
    Thanks for your advise.
    I will try a few more things (auto feature selection fails) such as starting with a smaller dataset (info of only a few of the process steps, remove more of the outliers, but still the data will never be normally distributed) and also once create the set like a binomial outcome (more than 2 hours, less than 2 hours, or so).

    If I ever will succeed, I will post the outcome ;-).
    Best regards
    Jeroen 
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    You're welcome, @jeroenheijlen.

    Good luck ! 

    regards,

    Lionel
Sign In or Register to comment.