Options

How to log the number of positive and negative examples?

DrGaryDrGary Member Posts: 8 Contributor II
edited November 2018 in Help
When you stop the GUI on an ExampleSet, you can look at the "label" attribute row to see how many positive and negative examples there are in the dataset. But I want to run from the command line and see the dataset class counts in the log.

The DataStatistics operator will write dataset info to the log, but it doesn't include the counts of the label classes. You can add in a DataMacroDefinition operator, but it only offers the total ExampleSet size, not the class counts.

Is there a way to log the class sizes?

Answers

  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    you could first filter the example set according to the label value and then count the examples using the DataStatistics or DataMacroDefinition.
    For this purpos I recommend using a ValueIterator, which will give you each value of an attribute as macro and then filter the examples accordingly.

    Greetings,
      Sebastian
  • Options
    DrGaryDrGary Member Posts: 8 Contributor II
    Sebastian, thanks for the suggestion. Here's what I came up with:

            <operator name="Count class sizes" class="OperatorChain" expanded="yes">
                <operator name="ValueIterator" class="ValueIterator" expanded="no">
                    <parameter key="attribute" value="target_"/>
                    <parameter key="iteration_macro" value="target_value"/>
                    <operator name="ExampleFilter" class="ExampleFilter">
                        <parameter key="condition_class" value="attribute_value_filter"/>
                        <parameter key="parameter_string" value="target_=%{target_value}"/>
                    </operator>
                    <operator name="DataMacroDefinition" class="DataMacroDefinition">
                        <parameter key="macro" value="class_size"/>
                    </operator>
                    <operator name="echo the target value" class="CommandLineOperator">
                        <parameter key="command" value="echo &quot; class &#39;%{target_value}&#39; size = %{class_size}&quot;"/>
                        <parameter key="log_stderr" value="false"/>
                    </operator>
                </operator>
                <operator name="ExampleSetMerge" class="ExampleSetMerge">
                </operator>
            </operator>

    Seems to work pretty well. Is there a way to keep the original ExampleSet and drop the new ones instead of merging the new ones?

    I pushed it with large datasets, and it doesn't seem to use as much memory as you might expect from creating new ExampleSets. I assume that's because views into the current ExampleSet are being created and rows are not duplicated.

    Still, it seems like a lot of overhead for a simple count...
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    you could use the IOStorer and IORetriever for storing it if it is not possible to pass it the usual way. IOMultiplier and IOConsumer might help as well.
    In general I would recommend to switch to RM 5.0 RC, because the flow layout gives you much more intuitive way of handling such problems.

    Greetings,
      Sebastian
  • Options
    IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi,

    aeh, maybe I got it wrong but why do you not simply aggregate and count? Use the label as group by attribute and use a count of the label as aggregation attribute. Just one operator and you are done  ;)

    Here is the process for RM 5 RC (based on the Iris sample data set):

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input>
          <location/>
        </input>
        <output>
          <location/>
          <location/>
        </output>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="Process">
        <process expanded="true" height="280" width="413">
          <operator activated="true" class="retrieve" expanded="true" height="60" name="Retrieve" width="90" x="112" y="165">
            <parameter key="repository_entry" value="//Samples/data/Iris"/>
          </operator>
          <operator activated="true" class="aggregate" expanded="true" height="76" name="Aggregate" width="90" x="246" y="165">
            <list key="aggregation_attributes">
              <parameter key="label" value="count"/>
            </list>
            <parameter key="group_by_attributes" value="label"/>
          </operator>
          <connect from_op="Retrieve" from_port="output" to_op="Aggregate" to_port="example set input"/>
          <connect from_op="Aggregate" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    Cheers,
    Ingo
  • Options
    ui3oui3o Member Posts: 9 Contributor II
    Hi,

    can anyone help me to set up a process, with which I can filter out examples for which an attribute has a value with seldom occurance. The Aggregate-operator (count) calculates the occurances as described above, but how can I use the result to filter?

    Thanks for advice.

    Greetings,


    ui3o
  • Options
    ui3oui3o Member Posts: 9 Contributor II
    Hi there,

    anybody have an idea on that?
    Thanx for help

    Greetings,


    ui3o
  • Options
    IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hey, usually the creation of a process like this is more a consulting task than a simple example process for technical support. However, I just felt like "would be funny to create a nice looping process before the holidays" and here we are:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="Process">
        <process expanded="true" height="341" width="614">
          <operator activated="true" class="generate_data" expanded="true" height="60" name="Generate Data" width="90" x="45" y="30">
            <parameter key="target_function" value="single gaussian cluster"/>
            <parameter key="number_examples" value="500"/>
            <parameter key="number_of_attributes" value="3"/>
          </operator>
          <operator activated="true" class="discretize_by_bins" expanded="true" height="94" name="Discretize" width="90" x="179" y="30">
            <parameter key="number_of_bins" value="5"/>
            <parameter key="range_name_type" value="short"/>
          </operator>
          <operator activated="true" class="remember" expanded="true" height="60" name="Remember" width="90" x="313" y="30">
            <parameter key="name" value="filtered_data"/>
            <parameter key="io_object" value="ExampleSet"/>
          </operator>
          <operator activated="true" class="loop_attributes" expanded="true" height="60" name="Loop Attributes" width="90" x="447" y="30">
            <process expanded="true" height="603" width="626">
              <operator activated="true" class="aggregate" expanded="true" height="76" name="Aggregate" width="90" x="45" y="30">
                <list key="aggregation_attributes">
                  <parameter key="label" value="count"/>
                </list>
                <parameter key="group_by_attributes" value="%{loop_attribute}"/>
              </operator>
              <operator activated="true" class="sort" expanded="true" height="76" name="Sort" width="90" x="179" y="30">
                <parameter key="attribute_name" value="count(label)"/>
                <parameter key="sorting_direction" value="decreasing"/>
              </operator>
              <operator activated="true" class="filter_example_range" expanded="true" height="76" name="Filter Example Range" width="90" x="313" y="30">
                <parameter key="first_example" value="1"/>
                <parameter key="last_example" value="3"/>
                <parameter key="invert_filter" value="true"/>
              </operator>
              <operator activated="true" class="loop_values" expanded="true" height="60" name="Loop Values" width="90" x="447" y="30">
                <parameter key="attribute" value="%{loop_attribute}"/>
                <process expanded="true" height="603" width="626">
                  <operator activated="true" class="recall" expanded="true" height="60" name="Recall (2)" width="90" x="45" y="30">
                    <parameter key="name" value="filtered_data"/>
                    <parameter key="io_object" value="ExampleSet"/>
                  </operator>
                  <operator activated="true" class="filter_examples" expanded="true" height="76" name="Filter Examples" width="90" x="179" y="30">
                    <parameter key="condition_class" value="attribute_value_filter"/>
                    <parameter key="parameter_string" value="%{loop_attribute} = %{loop_value}"/>
                    <parameter key="invert_filter" value="true"/>
                  </operator>
                  <operator activated="true" class="remember" expanded="true" height="60" name="Remember (2)" width="90" x="313" y="30">
                    <parameter key="name" value="filtered_data"/>
                    <parameter key="io_object" value="ExampleSet"/>
                  </operator>
                  <connect from_op="Recall (2)" from_port="result" to_op="Filter Examples" to_port="example set input"/>
                  <connect from_op="Filter Examples" from_port="example set output" to_op="Remember (2)" to_port="store"/>
                  <portSpacing port="source_example set" spacing="0"/>
                  <portSpacing port="sink_out 1" spacing="0"/>
                </process>
              </operator>
              <connect from_port="example set" to_op="Aggregate" to_port="example set input"/>
              <connect from_op="Aggregate" from_port="example set output" to_op="Sort" to_port="example set input"/>
              <connect from_op="Sort" from_port="example set output" to_op="Filter Example Range" to_port="example set input"/>
              <connect from_op="Filter Example Range" from_port="example set output" to_op="Loop Values" to_port="example set"/>
              <portSpacing port="source_example set" spacing="0"/>
              <portSpacing port="sink_example set" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="recall" expanded="true" height="60" name="Recall" width="90" x="447" y="120">
            <parameter key="name" value="filtered_data"/>
            <parameter key="io_object" value="ExampleSet"/>
          </operator>
          <connect from_op="Generate Data" from_port="output" to_op="Discretize" to_port="example set input"/>
          <connect from_op="Discretize" from_port="example set output" to_op="Remember" to_port="store"/>
          <connect from_op="Remember" from_port="stored" to_op="Loop Attributes" to_port="example set"/>
          <connect from_op="Recall" from_port="result" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    The first operators simply create a gaussian distributed data set and discretizes it to create "seldom" values. You of course have to adapt some of the parameters for your concrete data set.

    Cheers and happy holidays,
    Ingo
  • Options
    ui3oui3o Member Posts: 9 Contributor II
    Ingo,

    thx a lot! didn't know, that my question was not just setting the right parameter in the right operator ...  great work and thanks again for you effort.

    Best Regards & Viele Grüße


    ui3o
Sign In or Register to comment.