Options

Applying New Dataset on the Model

JaspreetKaurJaspreetKaur Member Posts: 11 Contributor I
edited May 2020 in Help
I have built a decision tree model on RapidMiner. I get an accuracy of 96.06%. Now, I have got a new dataset and I want to apply this decision tree model on my new dataset. How should I do it to confirm that my accuracy is still at least 95% with a confidence of at least 90% ?
Please advise ASAP!  

Answers

  • Options
    varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
    Hello @JaspreetKaur

    You need to store the trained model in your repository using store operator.

    Then you can retrieve the stored model by dragging and dropping it to the process window and connect the new dataset and this model to apply model and performance operators.
    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

  • Options
    JaspreetKaurJaspreetKaur Member Posts: 11 Contributor I
    Will I get the accuracy by doing so?
  • Options
    JaspreetKaurJaspreetKaur Member Posts: 11 Contributor I
    The Alarm file is what I used to train my dataset and build the dataset. Now, I have the New Alarm file which does not have True Labels and I want to apply my model on this new dataset and check the accuracy.
    Could you help me understand how I should do this now?
  • Options
    varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
    Hello @JaspreetKaur

    You cannot get perfomance metrics without true labels. You can just make predictions on this new dataset using trained model by using apply model operator.

    You can simply connect the dataset to apply model and the trained model to mod port of apply model and make prediction on new dataset.
    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

  • Options
    JaspreetKaurJaspreetKaur Member Posts: 11 Contributor I
    So, what should I do to get the accuracy results? Like a performance classification matrix?
  • Options
    JaspreetKaurJaspreetKaur Member Posts: 11 Contributor I
    How will I know if my model still would give me at least 95% accuracy?
  • Options
    varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
    You need to rely on your validated model performance.
    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

  • Options
    JaspreetKaurJaspreetKaur Member Posts: 11 Contributor I
    Okay, wait, I have been given a hint here. 
    The new Alarm file contains 3464 records, with 453 true alarms. Now, can you help me how I should proceed?
  • Options
    JaspreetKaurJaspreetKaur Member Posts: 11 Contributor I
    But, the trick is I don't know which records are the 453 true ones. How do I find that? PLEASE HELP!
  • Options
    BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Hi @JaspreetKaur,

    if you have labeled data, you can validate the model predictions.
    If you have unlabeled data, there is no machine learning process to validate the predictions. They are often validated in real world later.

    In validation, you compare the model prediction to the actual label. If you don't have a label, you can't compare.

    As @varunm1 mentioned, you're doing a validation during model building. Experience shows that this validation result is applicable to future predictions with the same model if the data doesn't change too much (e. g. there is no concept shift). If the data generating process changes (e. g. new machines are introduced, the weather becomes warmer, ... depends on your scenario), the model starts to get worse. In this case you would retrain the model with recent data when you got the labels.

    Best regards,

    Balázs
  • Options
    JaspreetKaurJaspreetKaur Member Posts: 11 Contributor I
    But my question is how will I get the accuracy on the new data set?
  • Options
    BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    If you have labels, apply your model to the new data set. You will then have a column with the prediction and one with the label. (Make sure they have the appropriate roles.) Then use Performance (or a more specific operator like Performance (Binominal Classification)) to calculate the accuracy.
  • Options
    JaspreetKaurJaspreetKaur Member Posts: 11 Contributor I
    edited May 2020
    I had uploaded the dataset earlier. I don't have the True labels. But I have been given this information that my new dataset contains 3464 records with 453 True values. Now, how should I find out which ones are the true values?
  • Options
    JaspreetKaurJaspreetKaur Member Posts: 11 Contributor I
    BalazsBarany  @varunm1  I still didn't get my question answered. I have been asked to find out the accuracy basis this information and the new dataset of course. Is there any other Rapidminer tool that could help me do so?
    With the fact that I have 453 true values in the new dataset, how can I use this info to find out which records have 453 true values?
  • Options
    lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Hi @JaspreetKaur,

    As mentioned before by @BalazsBarany and @varunm1, the usual methodology in a data science project is : 
    1/ to train  and validate a model by using a LABELLED dataset which allows to calculate the accuracy of the model.
    2/ Then apply the validated model on the new UNLABELLED dataset to perform some predictions. BUT you can not determine the exact accuracy of the model on this UNLABELLED dataset .
    Anyways, I think there is a misunderstanding with the word "True", by "True" you mean the examples which have the value "True" for your predicted label ("Alarm") right ?
    Thus I have applied this methodology and by training a model (Decision tree) with your LABELLED dataset (called "Alarm file") and then I have applied this model to your UNLABELLED dataset (called " New Alarm unscored file") and I have obtained the prediction for your label "Alarm" : There are 410 values equal to  "True" (maybe it is from these values you are talking about) and 3054 values equal to "False". These results were obtained with a Decision tree model but with an other model you will maybe obtain 453 values equal to " True".

    In attached file the process that you need from my point of view.

    Hope it is clear for you now,

    Regards,

    Lionel
  • Options
    JaspreetKaurJaspreetKaur Member Posts: 11 Contributor I
    Thanks so much @lionelderkrikor . This helps me in better understanding the answer. 
    Thank you @BalazsBarany and @varunm1


Sign In or Register to comment.