Data mining

andre5007andre5007 Member Posts: 22 Contributor I
edited May 2021 in Help
Hi guys, I was doing a job but I found a problem and I don't know how to start, I'm really new to using the rapidminer, and I would like to know if anyone could help me.
I have to estimate Feature 8 which is the number of maintenance interventions the device has had. What can I do?
Thanks
André

Best Answer

  • yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data Scientist
    Solution Accepted
    Association and correlations are not meant for predictions. They are more like “descriptive” models. What is your purpose here?

Answers

  • kaymankayman Member Posts: 662 Unicorn
    Could you share a bit more information, like a current input format example and the expected/wanted outcome? 
  • andre5007andre5007 Member Posts: 22 Contributor I
    edited April 2021
    Hey Kayman, I hope everything is fine with you.
    I have these two csv, in which both csv have several feats.
    Feat1- model, Feat2-power measure, Feat3- is something that this object has or does not have, being 1 has and 0 does not, Feat4 is a feature that I don’t know what it is, Feat5- device installation date, Feat6 / 7- It is the latitude and longitude and feat 8 is the number
    maintenance interventions.
    In the CSV Training I have values ​​for feat 8 and in the Test no.
    My goal is to estimate the Feat 8 for the Test set.
    How can I do this?
    If you need more information please tell me 
    Thanks
    
    
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    I believe this is a duplicate thread. I have responded in the other question. You should probably resolve one, duplicate threads are not generally helpful for the community. 
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data Scientist
    Hi @andre5007,

    I worked on your training data a bit to build regression trees based on clean features. The predictive model performs pretty good with 10-fold cross validation. RMSE is as follows

    My process attached for your reference. 
    Cheers,
    YY
  • andre5007andre5007 Member Posts: 22 Contributor I
    I was testing the process you did and started giving the following error.
    What I did was when I import the CSV I put all fields as nominal and the problem was overcome, was that the solution?
    
    Then gave this error.
    But this one I couldn't solve, do you think you can tell me what to do?
  • andre5007andre5007 Member Posts: 22 Contributor I
    How do you import de csv? 
    When you import the csv do you make any changes?
    Thanks 
    André

  • yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data Scientist
    Hi @andre5007, You have to run the sub-process for data loading first. To run it, you just right click to enable, and make sure this data loading step will be executed before the modeling.




    I used the csv files from you in another thread. They are attached here as well.

    Cheers,
    YY
  • andre5007andre5007 Member Posts: 22 Contributor I
    Hi @yyhuang
    Now I get it, yes.
    Sorry for so many questions because I'm really quite new to this area.
    Could you explain to me for example why the ID's are like this '?' ?

    This way I can understand what?


  • yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data Scientist
    Oh, good catch. Thanks @andre5007. Firstly, the id name is messed in the data loading step, which usually happened in read csv. You can fix the column names with "rename" operator. Later the id values get lost after applying grouped model due to a potential bug (grouped model with the target encoding). To walk around that, you can keep the id as regular attribute before scoring, and set id as special role after scoring, which is shown in the attached process.
    Ps. the feat1 could potentially result in some data leakage if we apply target encoding on such categorical attributes with soo many values. I don't have the context here but you can try to drop it by configuring "Target Encoding".

    Pps. you can round up the predictions after scoring if you prefer to integers.
    HTH!
  • andre5007andre5007 Member Posts: 22 Contributor I
    I've been studying a little bit and I've been looking at the best models I could use and I realized that at first I was using the classification model.
    At this point you used the regression model from what I understand.
    But after the study a doubt arose, wouldn't the best model for my problem be to use the associations & correlations model, because that model answers questions like "what happens together? What changes together?", while the regression one answers more to questions like "how much or how many? How many will happen?".
    I say this because from what I understand each feat has an influence on the result, for example feat 3 is something that the object has or not (1 has 0 has not) and if it is something that can influence the result of feat 8 from what I understand (which is very little) for example in the process you did this was not taken into account right?
    Sorry for the question but I'm really just trying to understand as best I can.
    Thanks
    André
  • yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data Scientist
    What are you trying to predict here? Could you explain the meaning of feat1-feat8?
  • andre5007andre5007 Member Posts: 22 Contributor I
    edited April 2021
    I have two csv's, where both csv's have various feats.
    Feat1- model, Feat2-power measure, Feat3- is something that this object has or does not have, being 1 has and 0 has not, Feat4 is a feature that I do not know what it is, Feat5- installation date of the device, Feat6 / 7- is the latitude and longitude and Feat 8 is the number
    maintenance interventions.
    In CSV Training I have values for feat 8 and in Test no.
    My goal is to predict Feat 8 for the Test set.
    I was told I could use rapidminer for this job.
    I was also told that in order to know which model I would use I would first have to relate the feat's.
    For example by doing this I can tell that there is an outlier.

    How can I do this?
    I hope it makes sense 
    If you need more information, please let me know 
    Thanks 
    André
  • yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data Scientist
    Good progress on the detection of outliers by GPS coordinates @andre5007! You can use the visualization plots, and statistical distribution to identify the outliers and exclude them from training. 


    According to your definition, the model is predicting " Feat 8, which is the number of maintenance interventions."
    I will stick to the regression models (KNN, regression tree, Random Forest, GLM, GBT are good choices for regression) because you will predict a numerical target. If the target is categorical, saying true/false, broken/normal, then go classification.

    Besides visualization for data exploration and outlier detection, you can also use some of the outlier detection models (e.g. Tukey test for exponential distribution... )
  • andre5007andre5007 Member Posts: 22 Contributor I
    Hi @yyhuang
    I fully understand why you use the regression method, why the classification method is not the best, but I was kind of at a loss as to why you for example don't use the associations & correlations method is there a reason? 
    By this I am not saying that I doubt your knowledge, as I believe it is much more than mine.
    Thanks for your attention
    André
  • andre5007andre5007 Member Posts: 22 Contributor I
    The attached dataset records a total of 30554 "observations" - 15277 grouped in the Training set and 15277 grouped in the Test set.
    This same dataset intends to extract the dependency that may exist between the "independent" characteristics (Features 1 to 7) and the number of maintenance interventions that the device has had (dependent variable, Feature 8) since it was put into operation (Feature 5).
    The dataset refers to various types of devices that are considered to operate under the same conditions except for those features recorded in the table. Features 6 and 7 refer to the location - Latitude and Longitude - in which the device operates.
    I now have to perform all the Data Mining steps that allow me to estimate Feature 8 for the Test set, and get its 'estimation vector' and from what I understand and what I've been looking for I can't get that from rapidminer.
    Thanks
    André
Sign In or Register to comment.