Preprocessing grouped data

wessel · June 2010

Dear All,

Is it possible to pre-process groups of data points differently then other groups?

For example, a dataset with 2 groups (ordered by Object ID), with 4 data points in each group.

GroupID, Date, Red, Blue, Class
0000001, 12-4, 113, 122, 0
0000001, 13-4, 114, 122, 0
0000001, 14-4, 112, 121, 1
0000001, 15-4, 113, 122, 0

0000002, 12-4, 119, 122, 0
0000002, 13-4, 133, 122, 0
0000002, 14-4, 100, 121, 1
0000002, 15-4, 114, 122, 0

Is it possible to discretise the attributes red and blue into {High, Medium, Low} ordered by group?
H = "greater then group_mean + group_standard_deviation"
M = "in between group_mean +- standard_deviation"
L = "smaller then group_mean - standard_deviation"

So the result would be:
0000001, 12-4, M, M, 0
0000001, 13-4, H, M, 0
0000001, 14-4, L, L, 1
0000001, 15-4, M, M, 0

0000002, 12-4, M, M, 0
0000002, 13-4, H, M, 0
0000002, 14-4, L, L, 1
0000002, 15-4, M, M, 0

Best regards,

Wessel

IngoRM · June 2010

Hi,

sure this is easily possible, however, processes like this one sometimes look complex since it contains a loop and the usage of macros. I have created a process and uploaded it to myexperiment.org under the name "Discretization into Deviation Interval around Mean".

You can simply download and install our Community Extension via the Update- and Installation option in our Help menu and activate the "myExperiment Browser" in the View menu of RapidMiner. In this view, you can search for the process stated above and directly download it into RapidMiner with a single click. More information can be found in my signature below.

Cheers,
Ingo

wessel · June 2010

Cool, I managed to download your experiment.

But it doesn't show how to deal with groups in data.

I want to discretise data points in group 1 differently from group 2.

Should I split my file into different datasets?
And then loop by dataset?
Because there is no loop by group operator as far as I know.

IngoRM · June 2010

Hi,

I must have overlooked the "groups" part in your request. Handling the groups also makes things indeed a bit harder but this is still possible. You have several options:

diving the data sets according to the groups (I would not recommend this in general due to inefficient memory usage but anyway) and handle each data set on its own before merging them again. This can become a bit tedious if you have lots of groups but it is pretty easy if you have only few groups.
instead of really dividing the data set I would use the Loop Value operator together with Filter Examples in order to loop through the subsets without having to generate copies from the data set. For each subset you will have to calculate the macros for each attribute like in the example process I have uploaded and generate new attributes for the old ones with the construction operator. With Remember, Recall, and Append you can then create a new data set from those subsets (needs more memory) or you create the new attributes in advance and fill them on the fly with Set Data.

Have fun trying out. Processes like those are definitely far beyond the scope of the free support we are giving here at the Community Forum - but we are of course happy to support you as one of our Enterprise Edition customers. Worth a thought...

Cheers,
Ingo

wessel · June 2010

Man, using recall and remember is a nightmare!
It won't recall the dataset from the last loop.
Or it crashes at the first loop iteration because no dataset yet exists.

In case someone else is interested, you can also use "loop clusters".
And you can use write to AML file as a alternative to append recall and remember.

test.csv

ObjectID, Date, Red, Blue, Class
0000001, 12-4, 113, 122, 0
0000001, 13-4, 114, 122, 0
0000001, 14-4, 112, 121, 1
0000001, 15-4, 113, 122, 0
0000002, 12-4, 119, 122, 0
0000002, 13-4, 133, 122, 0
0000002, 14-4, 100, 121, 1
0000002, 15-4, 114, 122, 0

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.0.8" expanded="true" name="Process">
<process expanded="true" height="508" width="681">
<operator activated="true" class="read_csv" compatibility="5.0.8" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30">
<parameter key="file_name" value="/home/wessel/Desktop/test.csv"/>
</operator>
<operator activated="true" class="numerical_to_polynominal" compatibility="5.0.8" expanded="true" height="76" name="Numerical to Polynominal" width="90" x="180" y="30">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="ObjectID"/>
</operator>
<operator activated="true" class="set_role" compatibility="5.0.8" expanded="true" height="76" name="att1 role" width="90" x="315" y="30">
<parameter key="name" value="ObjectID"/>
<parameter key="target_role" value="cluster"/>
</operator>
<operator activated="true" class="read_aml" compatibility="5.0.8" expanded="true" height="60" name="Read AML" width="90" x="447" y="120">
<parameter key="attributes" value="/home/wessel/att"/>
</operator>
<operator activated="true" class="loop_clusters" compatibility="5.0.8" expanded="true" height="76" name="Loop Clusters" width="90" x="447" y="30">
<process expanded="true" height="508" width="748">
<operator activated="true" class="extract_macro" compatibility="5.0.8" expanded="true" height="60" name="avg" width="90" x="45" y="30">
<parameter key="macro" value="average"/>
<parameter key="macro_type" value="statistics"/>
<parameter key="attribute_name" value="Red"/>
</operator>
<operator activated="true" class="extract_macro" compatibility="5.0.8" expanded="true" height="60" name="sd" width="90" x="180" y="30">
<parameter key="macro" value="deviation"/>
<parameter key="macro_type" value="statistics"/>
<parameter key="statistics" value="deviation"/>
<parameter key="attribute_name" value="Red"/>
</operator>
<operator activated="true" class="generate_macro" compatibility="5.0.8" expanded="true" height="76" name="boundaries" width="90" x="315" y="30">
<list key="function_descriptions">
<parameter key="lower_bound" value="%{average} - %{deviation}"/>
<parameter key="upper_bound" value="%{average} + %{deviation}"/>
</list>
</operator>
<operator activated="true" class="discretize_by_user_specification" compatibility="5.0.8" expanded="true" height="94" name="Discretize" width="90" x="447" y="30">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Red"/>
<parameter key="include_special_attributes" value="true"/>
<list key="classes">
<parameter key="L" value="%{lower_bound}"/>
<parameter key="M" value="%{upper_bound}"/>
<parameter key="H" value="Infinity"/>
</list>
</operator>
<operator activated="true" class="write_aml" compatibility="5.0.8" expanded="true" height="60" name="Write AML" width="90" x="581" y="30">
<parameter key="example_set_file" value="/home/wessel/hmm.csv"/>
<parameter key="attribute_description_file" value="/home/wessel/att"/>
</operator>
<connect from_port="cluster subset" to_op="avg" to_port="example set"/>
<connect from_op="avg" from_port="example set" to_op="sd" to_port="example set"/>
<connect from_op="sd" from_port="example set" to_op="boundaries" to_port="through 1"/>
<connect from_op="boundaries" from_port="through 1" to_op="Discretize" to_port="example set input"/>
<connect from_op="Discretize" from_port="example set output" to_op="Write AML" to_port="input"/>
<connect from_op="Write AML" from_port="through" to_port="out 1"/>
<portSpacing port="source_cluster subset" spacing="0"/>
<portSpacing port="source_in 1" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
</process>
</operator>
<connect from_op="Read CSV" from_port="output" to_op="Numerical to Polynominal" to_port="example set input"/>
<connect from_op="Numerical to Polynominal" from_port="example set output" to_op="att1 role" to_port="example set input"/>
<connect from_op="att1 role" from_port="example set output" to_op="Loop Clusters" to_port="example set"/>
<connect from_op="Read AML" from_port="output" to_port="result 2"/>
<connect from_op="Loop Clusters" from_port="out 1" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>

IngoRM · June 2010

Hi,

Man, using recall and remember is a nightmare!
It won't recall the dataset from the last loop.
Or it crashes at the first loop iteration because no dataset yet exists.

it's exactly like in programming

You have to initialize a data set with the correct structure first before you can actually start working. Or you use the operator Handle Exception for handling the special case in the first iteration...

But I am glad to see that you have found a solution!

Cheers,
Ingo

wessel · June 2010

Ah, your tip worked like a charm.
I used example filter "all" + invert filter to create an empty dataset.
Remember it, and recall it inside the loop, and remember the result again.
The process takes about 3 seconds for a dataset with 1M data points.
That is pretty good I think?

I'll try and upload the process to the community myExperiment browser.
Using title: Pre process data per group using recall remember append and loop clusters
And generate random data with "two Gaussian classification" as an example.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.0.8" expanded="true" name="Process">
<process expanded="true" height="508" width="748">
<operator activated="true" class="generate_data" compatibility="5.0.8" expanded="true" height="60" name="Generate Data" width="90" x="45" y="120">
<parameter key="target_function" value="two gaussians classification"/>
<parameter key="number_examples" value="1000000"/>
<parameter key="number_of_attributes" value="1"/>
</operator>
<operator activated="true" class="set_role" compatibility="5.0.8" expanded="true" height="76" name="set cluster" width="90" x="179" y="120">
<parameter key="name" value="label"/>
<parameter key="target_role" value="cluster"/>
</operator>
<operator activated="true" class="filter_examples" compatibility="5.0.8" expanded="true" height="76" name="empty data" width="90" x="313" y="120">
<parameter key="invert_filter" value="true"/>
</operator>
<operator activated="true" class="remember" compatibility="5.0.8" expanded="true" height="60" name="init qwe" width="90" x="447" y="75">
<parameter key="name" value="qwe"/>
<parameter key="io_object" value="ExampleSet"/>
</operator>
<operator activated="true" class="loop_clusters" compatibility="5.0.8" expanded="true" height="76" name="Loop Clusters" width="90" x="514" y="210">
<process expanded="true" height="508" width="748">
<operator activated="true" class="extract_macro" compatibility="5.0.8" expanded="true" height="60" name="avg" width="90" x="45" y="30">
<parameter key="macro" value="average"/>
<parameter key="macro_type" value="statistics"/>
<parameter key="attribute_name" value="att1"/>
</operator>
<operator activated="true" class="extract_macro" compatibility="5.0.8" expanded="true" height="60" name="sd" width="90" x="180" y="30">
<parameter key="macro" value="deviation"/>
<parameter key="macro_type" value="statistics"/>
<parameter key="statistics" value="deviation"/>
<parameter key="attribute_name" value="att1"/>
</operator>
<operator activated="true" class="generate_macro" compatibility="5.0.8" expanded="true" height="76" name="boundaries" width="90" x="315" y="30">
<list key="function_descriptions">
<parameter key="lower_bound" value="%{average} - %{deviation}"/>
<parameter key="upper_bound" value="%{average} + %{deviation}"/>
</list>
</operator>
<operator activated="true" class="discretize_by_user_specification" compatibility="5.0.8" expanded="true" height="94" name="Discretize" width="90" x="447" y="30">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="att1"/>
<parameter key="include_special_attributes" value="true"/>
<list key="classes">
<parameter key="L" value="%{lower_bound}"/>
<parameter key="M" value="%{upper_bound}"/>
<parameter key="H" value="Infinity"/>
</list>
</operator>
<operator activated="true" class="recall" compatibility="5.0.8" expanded="true" height="60" name="Recall" width="90" x="45" y="255">
<parameter key="name" value="qwe"/>
<parameter key="io_object" value="ExampleSet"/>
</operator>
<operator activated="true" class="append" compatibility="5.0.8" expanded="true" height="94" name="Append" width="90" x="251" y="211"/>
<operator activated="true" class="remember" compatibility="5.0.8" expanded="true" height="60" name="qwe" width="90" x="447" y="210">
<parameter key="name" value="qwe"/>
<parameter key="io_object" value="ExampleSet"/>
</operator>
<connect from_port="cluster subset" to_op="avg" to_port="example set"/>
<connect from_op="avg" from_port="example set" to_op="sd" to_port="example set"/>
<connect from_op="sd" from_port="example set" to_op="boundaries" to_port="through 1"/>
<connect from_op="boundaries" from_port="through 1" to_op="Discretize" to_port="example set input"/>
<connect from_op="Discretize" from_port="example set output" to_op="Append" to_port="example set 1"/>
<connect from_op="Recall" from_port="result" to_op="Append" to_port="example set 2"/>
<connect from_op="Append" from_port="merged set" to_op="qwe" to_port="store"/>
<portSpacing port="source_cluster subset" spacing="0"/>
<portSpacing port="source_in 1" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
</process>
</operator>
<operator activated="true" class="recall" compatibility="5.0.8" expanded="true" height="60" name="qwe final" width="90" x="648" y="120">
<parameter key="name" value="qwe"/>
<parameter key="io_object" value="ExampleSet"/>
</operator>
<connect from_op="Generate Data" from_port="output" to_op="set cluster" to_port="example set input"/>
<connect from_op="set cluster" from_port="example set output" to_op="empty data" to_port="example set input"/>
<connect from_op="empty data" from_port="example set output" to_op="init qwe" to_port="store"/>
<connect from_op="empty data" from_port="original" to_op="Loop Clusters" to_port="example set"/>
<connect from_op="qwe final" from_port="result" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

IngoRM · June 2010

Hi again,

great to hear that you managed this. And a few seconds for a million of data points sounds reasonable.

I'll try and upload the process to the community myExperiment browser.

Yes, please! I would really like to see more people sharing their processes there. Those "real-life" processes can serve others to learn more about RapidMiner and what can be done with it. Thanks for that!

Cheers,
Ingo

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Preprocessing grouped data

Answers