Options

# Algorithm Changes for False Negative/False Positive Manipulation

Member Posts: 1 Contributor I

I am puzzled by a particular issue, and I would greatly appreciate if anyone could point me in the right direction to learn about solving this issue.

What I am hoping to do is use data mining to identify patients who could benefit from a cancer screening test that would not be beneficial to the general public.  I am treating this as a classification problem with two groups:  potential cancer patient and not potential cancer patient.  However, I want the algorithm to be biased in a sense.  What I mean by that, is that I'm OK with it calling 100 disease free people potential cancer patients, but most importantly is for me to minimize the ones with actual disease that the algorithm says have no disease.  Because, you see, in one case, falsely triggering the screening test, very little harm is done, but if you avoid the screening test in someone with the disease, a great deal of harm is done.  There is a balance that must be reached here, obviously, because the little harm that is done by screening healthy people can add up if you screen too many of them to find one disease.  That's basically the problem I'm working on:  how to find the right way to identify the people who will benefit.

So, I've gotten to know my way around the basics, but at this point, do I need to learn to write my own algorithm?  Or are there algorithms where I can set some parameters that will bias them in various ways so I can evaluate the results of those biases?

Any assistance you can provide and especially direction to resources where I can learn about this topic in depth will be met with my sincere gratitude

• Options
RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
Hey,

what you are looking for is cost-sensitive learning. Useful operators in RapidMiner are MetaCost and Find Threshold. You may also read something about confidences and roc analysis.
If you need further help, just come back! But be assured, that you definitely don't need to implement a new algorithm

Best, Marius
• Options
Member Posts: 5 Contributor II
You don't need to modify or create new algorithms.

You can use optimize parameters with AUC to assist you, instead of the default accuracy. Also, take a look at recall, which is the % of the cases with cancer you are getting.

Then the score will give you a segmentation in which, for example (numbers are pretty random):

0.9-1: 95% chance of having cancer. 10% recall.
0.8-0.9: 80%. 30% recall.
0.7-0.8: 50%. 70% recall.

And so on... so you will decide how much accuracy you will sacrifice in favor of recalling as much cancer patients as possible. That's your threshold.

First you will find the best possible model, then you will determine a threshold