The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.
How do I split up scored data into 20 equally sized segments?
simon_philipose
Member Posts: 3 Learner I
Hi there-- still only a few days into using RapidMiner and wasn't sure if/how I could go about doing the following:
I created a logistic regression model for direct mail marketing. I've scored my model onto new data but what I want to be able to do is split the scored data up into 20 different groups based on their descending confidence(responder) value resulting in the A's having 1/20th of the most likely responders, the Bs having 1/20th of the next most likely and so on.
Your help is much appreciated.
-Simon
Tagged:
0
Best Answer
-
rfuentealba RapidMiner Certified Analyst, Member, University Professor Posts: 568 UnicornHi @simon_philipose, and be welcome to the community!
Well, I have a few things for you today. I wouldn't want to be rude telling you how to manage your business, but before working on a solution, I wanted to give you some small advice:Old Man's Advice:Now, Counterintuitive is my codename, hence I decided to see if I could solve your issue. Here it is.
First of all, please take note that what you are asking is counterintuitive from a business perspective. Normally you would not want equal size bins, if you have, e.g., 100 examples divided in 10 groups and 40 of these have 0.1 confidence, groups 7, 8, 9 and 10 would all be the same thing. If you want to apply certain kind of rule system on these but then you nail it with your next campaign (and that happens!), you will have to change your rule system, and fine tune it on every mail campaign.
If I were you, and having worked with e-mail marketing systems in the past, I would have discretized by one of the options that are already available in RapidMiner.
I separated your problem in two subprocesses and an operator:- Rank
- Bin
- Clean
This is how the overall process works.
Inside the Rank subprocess, I used the Sort operator to sort by what would be your Confidence Value, then used Generate ID to add a number, and then Set Role to not use that generated ID as an ID role, because we are going to do math with it. This is how it looks.
Inside the Bin subprocess, I used the Extract Macro operator to extract the number of examples into a variable named MaxID, then used Generate Attributes to introduce a new attribute with a small calculation, which is100 / MaxID * id
However, since MaxID is a scalar value coming from the Extract Macro operator, I need to put it inside %{} and then eval() it, because macros are usually nominal or text (can't remember which one, doesn't matter for this explanation)100 / eval(%{MaxID}) * id
Finally, I used the Discretize by Binning operator to generate 20 bins based in the Group_Model attribute. That overwrites the value stored there with the range. You can discretize by user specification too, if you want to change the names of range1, range2, range3... or use the Replace operator to change range by Group or whatever you want. As usual, I wouldn't want to take the fun of exploring RapidMiner from you. This is how it looks:
And the Clean stage is only a Select Attributes operator, meant to remove the ID we generated at the beginning.
Please find attached process.
Hope it helps,
Rodrigo.8
Answers
You can first use Sort operator to Sort confidence values with the descending order, followed by Split data operator.
In split data operator Parameter window; add partition ratio = 1/20
Hope this helps.
Cheers,
Pavithra
Hi Pavithra,
Thank you for your response. So I ran into a few problems with using the Split Data operator.
1. It splits the dataset into multiple datasets. What I need is one data set but with a field called Model_Group with a value of A, B, C, D, etc. depending on the confidence values.
2. It appears the maximum number of data sets I can split is 8 by putting .125 in the partions ratio field 8 times. I can't do 10, much less 20 different splits.
i would do the following:
Sort - by confidence
Generate ID - to get a index
Use Generate attributes with id%10 to get your Model_Group
Best,
Martin
Dortmund, Germany
If you copy your score attribute first, Discretize by Frequency should be able to do this directly for your score attribute by selecting that attribute and setting the number of bins to 20. This will create exactly the bins you are looking for, although if there are a large number of ties this can sometimes cause problems for the Discretize operators. (The reason you copy the score first is Discretize will replace your selected attribute with a new attribute, so if you still want to have the raw score, you will need two copies of it, one which is binned and one which is not).
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts