Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

How to resolve 100% Data accuracy in rapid miner ?? [Urgent]

StudentNeedsHelpStudentNeedsHelp Member Posts: 2 Learner I
edited August 2020 in Help
Hello everyone,

The aim is to catch and predict fraud cases with optimum accuracy based on the dataset provided. For example, cases that are nominated to be fraudulant and turn out to be non fraudulant are not as critical as cases which are predicted to be non fraud and turn out to be.

For this, I wanted to use the Logistic Regression ,Neural Net and Decision Tree for comparison (the work is provided). Whenever I run the models all accuracy is near 100%, surely this is not correct.

I am new to rapid miner and data pre processing, could someone advise me to which direction I should be heading? 

Answers

  • lionelderkrikorlionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Hi@StudentNeedsHelp,

    Given that your dataset is highly imbalanced (there is much more "non fraudulant" than "fraudulent" cases in your dataset)
    that's why the model has difficulties to establish the relationship between your features and the minority class of your label ("fraudulent")
    and in fine the model is considering all the your transaction as "non fraudulent" that's why you have an accuracy near from 100%.
    I think that in your case a better performance indicator is the "class recall". You want in priority correctly predict the fraudulant cases , isn't it ?
    For that you have to upsample your initial dataset by increasing the number of examples of "fraudulent" cases by using for example the
    SMOTE Upsampling operator. This way, you will increase the class recall of the fraudulent cases.

    Ideally, you can use Auto-Model after the upsampling operator and define the cost matrix at the "prepare target" scrreen (typically you "quantify" cost of a misclassifcation of "False negative" and the cost of a misclassificartion of a "false positive" ).
    Auto-Model will be executed to minimize the cost of a misclassification and in fine to maximize the gain...

    Hope this helps,

    Regards,

    Lionel
  • StudentNeedsHelpStudentNeedsHelp Member Posts: 2 Learner I
    hi @lionelderkrikor , Thank you for the explanation it makes a lot more sense now. Yes the priority is to correctly predict the fraud and make sure fraud isnt marked as non fraud. I have used the SMOTE Upsampling now on the Logical Regression with non negative coefficients. The accuracy has dropped down to around 97-98%. Is there a way I can quantify both false negative and positives without using the automodel? second model, the neural network is still displaying imbalance and I am confused as to how to find the rare class responsible.

    thanks
  • lionelderkrikorlionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    @StudentNeedsHelp

    Yes, without Auto-Model, you can use the Performance (Costs) operator to first quantify the cost of a FN and the cost of a FP and  to calculate the final cost of a misclassification.
    Please take a look at the process in attached file using your data to experiment and to understand.... 

    Hope this helps,

    Regards,

    Lionel
Sign In or Register to comment.