# Reject Inferencing

I have just started using (read: discovering) Rapid-i... and using an old dataset to learn things.

One of the things I would like to do is: Reject inferencing

Let me explain what I feel it should do:

As I have a dataset of which only a certain percentage has a know outcome (say: 'good' and 'bad'), there is a big group of which I do not know the outcome (say: 'rejects' in credit risk or 'indeterminates' in any other project).

As I do want to incorporate this last group in my modelling (so my datasample has a complete view of the dataset), I would have to use a technique called: "reject inferencing", as by doing so I get a more trustworthy model that is based on my entire customer base.

But as I am using these cases to model on, I want a probability weighing on these cases of there likely behaviour, based on the variables used. Thus creating a training set that is representive of the entire population.

This is something that should be done in the premodeling stage.

So giving the known cases (good and bad) a weight of 100%

and unknown cases a likelyhood weight of good behaviour between 0 and 100%

Now my question:

How do I do this in rapid-i

One of the things I would like to do is: Reject inferencing

Let me explain what I feel it should do:

As I have a dataset of which only a certain percentage has a know outcome (say: 'good' and 'bad'), there is a big group of which I do not know the outcome (say: 'rejects' in credit risk or 'indeterminates' in any other project).

As I do want to incorporate this last group in my modelling (so my datasample has a complete view of the dataset), I would have to use a technique called: "reject inferencing", as by doing so I get a more trustworthy model that is based on my entire customer base.

But as I am using these cases to model on, I want a probability weighing on these cases of there likely behaviour, based on the variables used. Thus creating a training set that is representive of the entire population.

This is something that should be done in the premodeling stage.

So giving the known cases (good and bad) a weight of 100%

and unknown cases a likelyhood weight of good behaviour between 0 and 100%

Now my question:

How do I do this in rapid-i

1

## Answers

347MavenI must admit that I am really interested in this problem. I am currently dealing with the correction of the SampleSelectionBias which can be viewed as the rejection inference problem as well (according to this source: http://people.cs.uu.nl/ad/mbrejinf.pdf). Hence I am a little confused by your explanation... Ok including outcome, yes or no ? This is the nontrivial task of the correction of the sampling selection bias (rejection inference problem). Adjusting the distribution of the trainingset to the distribution of basic population to get more reliable results !

I dont know if i got you right, but: Did you plan to infer the outcome where it is not known to finally learn a model based on both groups (which have both a known outcome now) ? Sorry, this makes no sense for me. Maybe you have confused the mechanism of whether the outcome is known with the mechanism of the outcome itself ? Please be more specific with your task !

Summary: Maybe you are willing to tell which techniques you want to use to deal with the rejection inference ? This way help in how-to-use-RapidMiner can be provided more easily...

greetings

Steffen

23MavenFor instance in application scoring outcome information (loan repayed) is missing for a non random sample (the rejected applicants). So lets use credit scoring as an example

Each customer is qualified as either a good or a bad case. This qualification is based on the amount of time repayment is in

arrears. The evidence of good or bad performance is clearly not available for credit applicants whose application had been rejected.

Nevertheless, these applicants represented a part of the population that would have to be rated in the future. To classify these cases I'd like to use a facility for the automatic inference of outcomes (reject inferencing).

This is based on a unique probabilistic modelling process. The sample contained X attributes for each application for a particular type of loan. Out of

N applicants, Rn have been rejected, Rg are classified as good (i.e. they repaid their loans) and Rb are classified as bad (i.e. they defaulted).

Now you have a choice:

You either just model the known cases (in which case you incorporate the model already in place into your training... which is something you do not want)

Or, you infer. which allows an overall probability of positive behaviour to be set for the rejects as a whole. and thus making good use of the entire population, that is representative of all the customers that come to the door, and not just on the ones you let in previously.

So what you do, or what the software should do for you is the following:

You have known cases

id outcome age income loan ... etc

1 G 32 23000 5000

2 B 19 9000 2500

And unknown cases

3 .. 27 21000 4000

Now by using clustering techniques (looking at the known cases) the software should be able to infer a probability of the outcome being good (G) or bad (B) based on the variables which are known (age, income, loan, ... etc

After which you can use the entire population to create a model on the outcome/infered outcome.

347MavenFirst of all thank you for your detailed explanation. This is exactly the same thing I am currently dealing with. Unfortunately I am not past the "collect information"-step yet so I cannot provide any complete setups in RapidMiner(yet).

Regarding the problem: General problem here: You are able to infer the outcome based on this setup by simply learning classification models which provide confidences (aka estimated probabilities) , e.g. Naive Bayes, Decision Tree etc. This is no problem in RapidMiner. Just study the tutorial (online and manual.pdf) and you will able to do it.

BUT:Learning this models only based on the known cases will bias the probabilities, this is the Sample Selection Bias. So learning a new model based on known cases and inferenced outcome will be biased as well.One direct way to solve this is using Transduction. An implementation of the Transductive SVM can be found in RapidMiner Enterprise Edition.

Another attempts have been made to solve this problem, but none of them can be easily used in RapidMiner by just using existing operators (as far as I see). One of them is indeed using clustering techniques, but it is quite new so I guess you are referring to something else (Type Independent Correction of Sample Selection Bias via Structural Discovery and Re-balancing)

Here you find a survey regarding the general techniques to solve the Sample Selection Bias.

Sorry for throwing too much information at you. I am rather focused on this problem right now and so forget that others maybe dont want to study it that deeply.

greetings,

Steffen

PS: Aside: Google brought this up: Credit Scoring and Sample Selection Bias. I will have a look at it tomorrow, seems to be quite interesting.

23MavenNo, there is not too much information... in fact... I'd love to see more...

Before (that is 5 years ago, from 1997 to 2003) I worked with a tool we developed at the company I was working for, called OMEGA... this tool had reject inferencing embedded in the pre-analysis step, after which the modeling started (using genetic algorithm)

Doing so it was easy to generate very reliable models with a good performance.

Being spoiled, I now have to retrain myself, using rapid-i...

At least I now have a direction to follow... thanx

347MavenJust curious

Steffen

23Mavenfor the inferencing we used clustering techniques (k-nearest neighbour)