Options

"to ask about data sampling"

m_r_nourm_r_nour Member Posts: 35 Maven
edited May 2019 in Help
Hi all

I have an unbalanced  dataset . No of data in a class is 500 time more than No. of a data in other groups.

and I want to re sample such that the number of sample in all group is same.
How can I do that?
I tried to use sampling techniques but all of them just re sample and save ratio of number of sample in groups

Thank you for your consideration and time in advance

Regards
REZA
Tagged:

Answers

  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    which RapidMiner version do you use?

    Greetings,
      Sebastian
  • Options
    m_r_nourm_r_nour Member Posts: 35 Maven

    ver 4.6


    to clarification, I want to do this balanced sampling several times and make an average of them performance result to know overall performance in this method


    thanks

    Regards
    REZA
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    I think there are several possibilities you could use:
    If you are going to use a learner supporting example weights, you could use the EqualLabelWeighting. This will not sample the number of attributes, but equalizes the total weight assigned to each label. That might be even better, because no examples will be lost at all.
    Another possibility would be to split the example set several times depending on the label and sample each subset to the same size. After this, all subsets would have to be merged and viola: You have a balanced example set.
    If this becomes unhandy, because you have to many label values, you might use the ValueIterator and an IOStorer and IORetriever...
    Ok, seems to be rather complex. Here's how it would work:
    <operator name="Root" class="Process" expanded="yes">
        <operator name="ExampleSetGenerator" class="ExampleSetGenerator" breakpoints="after">
            <parameter key="target_function" value="polynomial classification"/>
            <parameter key="number_examples" value="1000"/>
        </operator>
        <operator name="ValueIterator" class="ValueIterator" expanded="yes">
            <parameter key="attribute" value="label"/>
            <operator name="ExampleFilter" class="ExampleFilter">
                <parameter key="condition_class" value="attribute_value_filter"/>
                <parameter key="parameter_string" value="label = %{loop_value}"/>
            </operator>
            <operator name="AbsoluteSampling" class="AbsoluteSampling">
            </operator>
            <operator name="Only do if already stored" class="ExceptionHandling" expanded="yes">
                <operator name="Retrieve" class="IORetriever">
                    <parameter key="name" value="SetStorage"/>
                    <parameter key="io_object" value="ExampleSet"/>
                </operator>
                <operator name="ExampleSetMerge" class="ExampleSetMerge">
                </operator>
            </operator>
            <operator name="In every case: Store" class="IOStorer">
                <parameter key="name" value="SetStorage"/>
                <parameter key="io_object" value="ExampleSet"/>
            </operator>
        </operator>
        <operator name="IOConsumer" class="IOConsumer">
            <parameter key="io_object" value="ExampleSet"/>
        </operator>
        <operator name="Final Retrieve" class="IORetriever">
            <parameter key="name" value="SetStorage"/>
            <parameter key="io_object" value="ExampleSet"/>
        </operator>
    </operator>
    Hope this will help you, understand what I'm suggesting.

    Greetings,
      Sebastian
  • Options
    m_r_nourm_r_nour Member Posts: 35 Maven
    thanks
Sign In or Register to comment.