The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.
"Dealing with nulls in FP Growth"
Hello everyone,
I am very new to using rapid miner, and have a question about the process in which variables with many levels are translated into binomial variables through FP Growth.
For example, in a survey's data I am trying to analyze, one of our variables is age groups. They go 0-13, 13-18, 19-24, 25-29, etc. I understand that it will create a variable for all of these groups (except one) and assign 1's and 0's for whether it falls into the group or not.
My problem is when a participant did not answer the age question. How does Rapid Miner handle these observations? Will it assign a zero in each of the categories, or eliminate it completely? I believe it would mess up the results if the former and thus would prefer that people who did not answer the age question not be considered when trying to discover rules involving age.
Is this how Rapid Miner already works, and if not, is there some way to set the process up to do this?
Thank you in advance.
Matt
I am very new to using rapid miner, and have a question about the process in which variables with many levels are translated into binomial variables through FP Growth.
For example, in a survey's data I am trying to analyze, one of our variables is age groups. They go 0-13, 13-18, 19-24, 25-29, etc. I understand that it will create a variable for all of these groups (except one) and assign 1's and 0's for whether it falls into the group or not.
My problem is when a participant did not answer the age question. How does Rapid Miner handle these observations? Will it assign a zero in each of the categories, or eliminate it completely? I believe it would mess up the results if the former and thus would prefer that people who did not answer the age question not be considered when trying to discover rules involving age.
Is this how Rapid Miner already works, and if not, is there some way to set the process up to do this?
Thank you in advance.
Matt
Tagged:
0
Answers
Interesting question, in RM speak this is about 'missing values', you have two choices..
1. Filter the examples, so that those with missing values are removed.
2. Transform the data, by 'data cleansing', where you fill in the blanks.
It's a difficult call, more or clean only, so I tend to do both, to see the difference.
Wish I could be more helpful! Good weekend.
Thank you for your response! My question is more regarding how FPGrowth handles missing values. My dataset is a survey with approximately 200 attributes. Every observation has at least one attribute missing, therefore elimination is not an option. I understand I can impute missing values but if FPGrowth can handle missing values this is not required.
To summarize, I'm looking to find out:
1. Can FPGrowth deal with missing values?
2. If FPGrowth can handle missing values, how can I get Polynomial2Binomial to preserve nulls?
For example, for the question about age, if someone did not answer, I would want it to say:
0-13 = null
14-19 = null
20-24 = null... etc
As opposed to:
0-13 = false
14-19 = false
20-24 = false... etc
So for this example, people who did not specify age would have no influence on discovering rules relating to age.
Thanks again for your time!
FPGrowth cannot handle missing values, hence my advice. Try it for yourself.
I also tried to find rule associations with the Apriori Algorithm and ran into a similar problem. It is my understanding confidence is calculated (simply put) as: #FollowingRule / (#FollowingRule + #BreakingRule)
The problem being that when the premises are true and there is a null value in the conclusion, it is considered to be breaking the rule. IE, if I have a rule that states the following:
Premise 1: Favorite beverage - Beer
Premise 2: Car or Truck? - Truck
Conclusion: Likes football? - True
If a respondent indicates that their favorite beverage is beer, prefers cars to trucks, but doesn't answer if they like football, the apriori algorithm will consider this to be breaking the rule just as much as if the same respondent indicated they did not like football. Instead, I want that respondent to not count in the support/confidence calculations at all for that particular rule.
Is there any rule association algorithm out there that I can accomplish this with?
Thanks again for your time.
Missing values and frequent item mining just don't mix conceptually; each example can increase the count of several patterns, so allowing wild cards would explode the combinations. Now you see why I originally said it was an interesting question!
It is a bummer that each of your examples has a missing value, because it means you're going to have to do something with the blanks. One approach might be to replace every missing value with a constant, say 'dummy', and then have a look at the frequent item sets that come out of FPGrowth. You could at least use any item sets that did not contain 'dummy' to make your association rules.
Good luck, just keep grinding it down!
I don't think wildcards is what Matt needs. From what I understand he just does not want missing values to be counted against the rule.
It's hard to accomplish when the data is represented as a 'transaction', i.e. item counts against the rule if it's not present in the transaction. Hence there is not way to account for missing values.
If data represented in a tablular form:
I don't know existing algorithm that approaches the problem from this angle but I see it's value for sociology, business intelligence, etc.