The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.
Help on How to Use FP-Growth
Hi,
Can someone help me on how to use the FP-Growth operator? I am new to Rapidminder and try to use it to do some data mining work.
Here is the toy problem I used:
Transaction Beef Boots Cheese Chicken Clothes Milk
1 TRUE FALSE FALSE TRUE FALSE TRUE
2 TRUE FALSE TRUE FALSE FALSE FALSE
3 FALSE TRUE TRUE FALSE FALSE FALSE
4 TRUE FALSE TRUE TRUE FALSE FALSE
5 TRUE FALSE TRUE TRUE TRUE TRUE
6 FALSE FALSE FALSE TRUE TRUE TRUE
7 FALSE FALSE FALSE TRUE TRUE TRUE
With minimum support is set at 0.3, I can easily find the frequent itemsets as the following:
Itemset Trans Count Support
Beef 4 0.57
Cheese 4 0.57
Chicken 5 0.71
Clothes 3 0.43
Milk 4 0.57
Beef, Cheese 3 0.43
Beef, Chicken 3 0.43
Chicken, Clothes 3 0.43
Chicken, Milk 4 0.57
Clothes, Milk 3 0.43
Chicken, Clothes, Milk 3 0.43
However, FP-Growth outputs:
Size Support Item1 Item2
1 0.571 Cheese
1 0.429 Milk
1 0.429 Clothes
1 0.429 Beef
2 0.429 Cheese Milk
Both the support value and the itemsets are different from hand calculation.
I only used two operators: one for retrieve the data from repository (I checked the data output.
The data looks good) and FP-Growth with "Find min number of itersets" un-checked and the "min support" set to 0.3.
Maybe there are some parameters I should set up? Really appreciate your help!
Jie
Can someone help me on how to use the FP-Growth operator? I am new to Rapidminder and try to use it to do some data mining work.
Here is the toy problem I used:
Transaction Beef Boots Cheese Chicken Clothes Milk
1 TRUE FALSE FALSE TRUE FALSE TRUE
2 TRUE FALSE TRUE FALSE FALSE FALSE
3 FALSE TRUE TRUE FALSE FALSE FALSE
4 TRUE FALSE TRUE TRUE FALSE FALSE
5 TRUE FALSE TRUE TRUE TRUE TRUE
6 FALSE FALSE FALSE TRUE TRUE TRUE
7 FALSE FALSE FALSE TRUE TRUE TRUE
With minimum support is set at 0.3, I can easily find the frequent itemsets as the following:
Itemset Trans Count Support
Beef 4 0.57
Cheese 4 0.57
Chicken 5 0.71
Clothes 3 0.43
Milk 4 0.57
Beef, Cheese 3 0.43
Beef, Chicken 3 0.43
Chicken, Clothes 3 0.43
Chicken, Milk 4 0.57
Clothes, Milk 3 0.43
Chicken, Clothes, Milk 3 0.43
However, FP-Growth outputs:
Size Support Item1 Item2
1 0.571 Cheese
1 0.429 Milk
1 0.429 Clothes
1 0.429 Beef
2 0.429 Cheese Milk
Both the support value and the itemsets are different from hand calculation.
I only used two operators: one for retrieve the data from repository (I checked the data output.
The data looks good) and FP-Growth with "Find min number of itersets" un-checked and the "min support" set to 0.3.
Maybe there are some parameters I should set up? Really appreciate your help!
Jie
0
Answers
It looks like your value count is inverted, so you need to declare explicitly that the positive value is 'TRUE', and vice versa. You can do this by placing a 'Remap Binominals' operator upstream of the 'FPGrowth' operator. While this may seem onerous, it can be useful in other applications, where for instance the absence or presence of something is being investigated.
Good luck!
It works! Thank you very much for the help!
May I ask you a following up questions?
I am looking for a web mining tool to do web usage analysis (Association rules, sequential patterns, etc) for the click stream data. The dataset size is about 10 - 100 millions records with 100 variables. Do you think RapidMiner is the right tool? I know companies using SAS Enterprise Miner. But it is really pricey. Some friends recommend Knowledge Studio or Revolution R. I have watched several RapidMiner video tutorials. I like the elegant GUI design and the simplicity of the drag-and-drop. There are a rich set of operators to cover wide range of problems. What about the performance and accuracy? Really appreciate your advice.
Thanks a lot in advance.
Jie
Glad that worked; on your more general questions it is difficult to be specific, I rather doubt that anyone has sufficient knowledge of all the available packages. FWIW I use RapidMiner to sift for patterns in datasets of the size you mention, and because I need the answers fast I greatly value that RM is open source, and therefore checkable and extendable. I use RM to marshal the data, and CUDA to grind it. Zoooom!
Thank you very much for the quick response. It is very nice to know that you are using RM to mine the dataset of the similar size. I will give RM a try.
Thanks again.
Jie