Options

Aggregate Duplicates

iasoniason Member Posts: 20 Contributor II
edited November 2019 in Help
Can you suggest a method to remove duplicate examples and add a "count" attribute to the remaining unique items?
I would like to do that to reduce the size of the dataset and then use this counter attribute with a k-NN operator. Is that even possible in RM?
Tagged:

Answers

  • Options
    awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
    Hello

    The aggregate operator is your friend - here's an example
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.011">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.1.011" expanded="true" name="Process">
        <process expanded="true" height="206" width="413">
          <operator activated="true" class="generate_data" compatibility="5.1.011" expanded="true" height="60" name="Generate Data" width="90" x="112" y="75">
            <parameter key="target_function" value="three ring clusters"/>
            <parameter key="number_of_attributes" value="2"/>
          </operator>
          <operator activated="true" class="aggregate" compatibility="5.1.011" expanded="true" height="76" name="Aggregate" width="90" x="313" y="75">
            <list key="aggregation_attributes">
              <parameter key="label" value="count"/>
            </list>
            <parameter key="group_by_attributes" value="|label"/>
          </operator>
          <connect from_op="Generate Data" from_port="output" to_op="Aggregate" to_port="example set input"/>
          <connect from_op="Aggregate" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    regards

    Andrew
  • Options
    iasoniason Member Posts: 20 Contributor II
    Thank you for your reply.
    If I understand correctly, you suggest aggregating duplicates using the aggregate operator and "group by" all attributes.
    How can this be utilized to make a k-NN faster?
    Having 20 million samples with 20 attributes but only 1 million possible attribute combinations will result in a dataset of 1 million examples with 21 attributes.
    How will k-NN work on that (ie use the 21st attribute as weight/count or something).

  • Options
    awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
    Hello

    I think k-NN would still work, the new aggregation attribute would need to be carefully selected in order to ensure that unseen data is near to representative examples.

    As always, an experiment is needed.

    regards

    Andrew
Sign In or Register to comment.