The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.
Two simple questions
I teach Data Mining at a Business School and I'm considering using Rapid-Miner as the official software (last year I used XLMiner and Rattle/R). I'm translating everything I did with those two packages to Rapid-i. I have two very simple questions.
1) After running a cluster algorithm (say k-means), I'd like to get some basic stats (means, medians, st devs) BY cluster membership. Can I do that?
2) Suppose I have a set of variables (beer=label, income, education, age, woman, etc = attributes) and I want to run a simple linear regression. I want to be able to manually leave some variables out. For instance, I want to omit "age" and "woman". How could I do that? I've tried to use FeatureNameFilter but I can only list one of the two. (I've tried to separate the list of variables I want to omit with commas, semi-colons, etc with no success).
Thanks in advance for any help,
E.
1) After running a cluster algorithm (say k-means), I'd like to get some basic stats (means, medians, st devs) BY cluster membership. Can I do that?
2) Suppose I have a set of variables (beer=label, income, education, age, woman, etc = attributes) and I want to run a simple linear regression. I want to be able to manually leave some variables out. For instance, I want to omit "age" and "woman". How could I do that? I've tried to use FeatureNameFilter but I can only list one of the two. (I've tried to separate the list of variables I want to omit with commas, semi-colons, etc with no success).
Thanks in advance for any help,
E.
0
Answers
Now back to your questions, they are actually ... well .. quite simple! Place an [tt]Aggregation[/tt] operator after the clustering algorithm. You than have to specify which attributes should be aggregated and by which function (mean, median, stddev, min, max, etc). As [tt]group_by[/tt] attribute you have to specify the cluster id. The [tt]FeatureNameFilter[/tt] recognizes regular expressions. The regular expression comprising both attributes age and woman would be [tt]age|woman[/tt]. The [tt]|[/tt] is like a logical or. By the way: the [tt]FeatureNameFilter[/tt] is replaced by the [tt]AttributeFilter[/tt] operator, which allows you also to filter by other conditions than given names or regular expressions, respectively.
Hope that helps,
Tobias
<operator name="Root" class="Process" expanded="yes">
<parameter key="logverbosity" value="warning"/>
<operator name="ExampleSource" class="ExampleSource">
<parameter key="attributes" value="../data/iris.aml"/>
</operator>
<operator name="KMeans" class="KMeans">
<parameter key="k" value="3"/>
</operator>
<operator name="Aggregation" class="Aggregation">
<list key="aggregation_attributes">
<parameter key="a1" value="average"/>
</list>
<parameter key="group_by_attributes" value="cluster"/>
</operator>
</operator>
the problem here is that the [tt]Aggregation[/tt] operator does not look for special attributes when matching the names given as parameters. Hence, you have to make the special cluster attribute (named cluster) to a regular attribute. You can do this by placing a [tt]ChangeAttributeRole[/tt] operator between the clustering operator and the aggregation operator. You can use this code ... Hope that solves the problem.
Regards,
Tobias