🥳 RAPIDMINER 9.9 IS OUT!!! 🥳

The updates in 9.9 power advanced use cases and offer productivity enhancements for users who prefer to code.

CLICK HERE TO DOWNLOAD

Estimate values excel

andre5007andre5007 Member Posts: 14 Contributor I
Hi guys, I was doing a job but I found a problem and I don't know how to start, I'm really new to using the rapidminer, and I would like to know if anyone could help me.
I have to estimate Feature 8 which is the number of maintenance interventions the device has had. What can I do?
Thanks
André

Answers

  • kaymankayman Member Posts: 606   Unicorn
    Could you share a bit more information, like a current input format example and the expected/wanted outcome? 
    andre5007
  • andre5007andre5007 Member Posts: 14 Contributor I
    edited April 7
    Hey Kayman, I hope everything is fine with you.
    I have these two csv, in which both csv have several feats.
    Feat1- model, Feat2-power measure, Feat3- is something that this object has or does not have, being 1 has and 0 does not, Feat4 is a feature that I don’t know what it is, Feat5- device installation date, Feat6 / 7- It is the latitude and longitude and feat 8 is the number
    maintenance interventions.
    In the CSV Training I have values ​​for feat 8 and in the Test no.
    My goal is to estimate the Feat 8 for the Test set.
    How can I do this?
    If you need more information please tell me 
    Thanks
    
    
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,619   Unicorn
    I believe this is a duplicate thread. I have responded in the other question. You should probably resolve one, duplicate threads are not generally helpful for the community. 
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 305  RM Data Scientist
    Hi @andre5007,

    I worked on your training data a bit to build regression trees based on clean features. The predictive model performs pretty good with 10-fold cross validation. RMSE is as follows

    My process attached for your reference. 
    Cheers,
    YY
    andre5007
  • andre5007andre5007 Member Posts: 14 Contributor I
    I was testing the process you did and started giving the following error.
    
    What I did was when I import the CSV I put all fields as nominal and the problem was overcome, was that the solution?
    
    Then gave this error.
    
    But this one I couldn't solve, do you think you can tell me what to do?
  • andre5007andre5007 Member Posts: 14 Contributor I
    How do you import de csv? 
    When you import the csv do you make any changes?
    Thanks 
    André

  • yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 305  RM Data Scientist
    Hi @andre5007, You have to run the sub-process for data loading first. To run it, you just right click to enable, and make sure this data loading step will be executed before the modeling.




    I used the csv files from you in another thread. They are attached here as well.

    Cheers,
    YY
    andre5007
  • andre5007andre5007 Member Posts: 14 Contributor I
    Hi @yyhuang
    Now I get it, yes.
    Sorry for so many questions because I'm really quite new to this area.
    Could you explain to me for example why the ID's are like this '?' ?

    This way I can understand what?


  • yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 305  RM Data Scientist
    Oh, good catch. Thanks @andre5007. Firstly, the id name is messed in the data loading step, which usually happened in read csv. You can fix the column names with "rename" operator. Later the id values get lost after applying grouped model due to a potential bug (grouped model with the target encoding). To walk around that, you can keep the id as regular attribute before scoring, and set id as special role after scoring, which is shown in the attached process.
    Ps. the feat1 could potentially result in some data leakage if we apply target encoding on such categorical attributes with soo many values. I don't have the context here but you can try to drop it by configuring "Target Encoding".

    Pps. you can round up the predictions after scoring if you prefer to integers.
    HTH!
    andre5007
  • andre5007andre5007 Member Posts: 14 Contributor I
    I've been studying a little bit and I've been looking at the best models I could use and I realized that at first I was using the classification model.
    At this point you used the regression model from what I understand.
    But after the study a doubt arose, wouldn't the best model for my problem be to use the associations & correlations model, because that model answers questions like "what happens together? What changes together?", while the regression one answers more to questions like "how much or how many? How many will happen?".
    I say this because from what I understand each feat has an influence on the result, for example feat 3 is something that the object has or not (1 has 0 has not) and if it is something that can influence the result of feat 8 from what I understand (which is very little) for example in the process you did this was not taken into account right?
    Sorry for the question but I'm really just trying to understand as best I can.
    Thanks
    André
  • yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 305  RM Data Scientist
    What are you trying to predict here? Could you explain the meaning of feat1-feat8?
    andre5007
  • andre5007andre5007 Member Posts: 14 Contributor I
    edited April 18
    I have two csv's, where both csv's have various feats.
    Feat1- model, Feat2-power measure, Feat3- is something that this object has or does not have, being 1 has and 0 has not, Feat4 is a feature that I do not know what it is, Feat5- installation date of the device, Feat6 / 7- is the latitude and longitude and Feat 8 is the number
    maintenance interventions.
    In CSV Training I have values for feat 8 and in Test no.
    My goal is to predict Feat 8 for the Test set.
    I was told I could use rapidminer for this job.
    I was also told that in order to know which model I would use I would first have to relate the feat's.
    For example by doing this I can tell that there is an outlier.

    How can I do this?
    I hope it makes sense 
    If you need more information, please let me know 
    Thanks 
    André
  • yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 305  RM Data Scientist
    Good progress on the detection of outliers by GPS coordinates @andre5007! You can use the visualization plots, and statistical distribution to identify the outliers and exclude them from training. 


    According to your definition, the model is predicting " Feat 8, which is the number of maintenance interventions."
    I will stick to the regression models (KNN, regression tree, Random Forest, GLM, GBT are good choices for regression) because you will predict a numerical target. If the target is categorical, saying true/false, broken/normal, then go classification.

    Besides visualization for data exploration and outlier detection, you can also use some of the outlier detection models (e.g. Tukey test for exponential distribution... )
Sign In or Register to comment.