"to ask about data sampling"

m_r_nour · November 2009

Hi all

I have an unbalanced dataset . No of data in a class is 500 time more than No. of a data in other groups.

and I want to re sample such that the number of sample in all group is same.
How can I do that?
I tried to use sampling techniques but all of them just re sample and save ratio of number of sample in groups

Thank you for your consideration and time in advance

Regards
REZA

land · November 2009

Hi,
which RapidMiner version do you use?

Greetings,
Sebastian

m_r_nour · November 2009

ver 4.6

to clarification, I want to do this balanced sampling several times and make an average of them performance result to know overall performance in this method

thanks

Regards
REZA

land · November 2009

Hi,
I think there are several possibilities you could use:
If you are going to use a learner supporting example weights, you could use the EqualLabelWeighting. This will not sample the number of attributes, but equalizes the total weight assigned to each label. That might be even better, because no examples will be lost at all.
Another possibility would be to split the example set several times depending on the label and sample each subset to the same size. After this, all subsets would have to be merged and viola: You have a balanced example set.
If this becomes unhandy, because you have to many label values, you might use the ValueIterator and an IOStorer and IORetriever...
Ok, seems to be rather complex. Here's how it would work:

<operator name="Root" class="Process" expanded="yes">
    <operator name="ExampleSetGenerator" class="ExampleSetGenerator" breakpoints="after">
        <parameter key="target_function"	value="polynomial classification"/>
        <parameter key="number_examples"	value="1000"/>
    </operator>
    <operator name="ValueIterator" class="ValueIterator" expanded="yes">
        <parameter key="attribute"	value="label"/>
        <operator name="ExampleFilter" class="ExampleFilter">
            <parameter key="condition_class"	value="attribute_value_filter"/>
            <parameter key="parameter_string"	value="label = %{loop_value}"/>
        </operator>
        <operator name="AbsoluteSampling" class="AbsoluteSampling">
        </operator>
        <operator name="Only do if already stored" class="ExceptionHandling" expanded="yes">
            <operator name="Retrieve" class="IORetriever">
                <parameter key="name"	value="SetStorage"/>
                <parameter key="io_object"	value="ExampleSet"/>
            </operator>
            <operator name="ExampleSetMerge" class="ExampleSetMerge">
            </operator>
        </operator>
        <operator name="In every case: Store" class="IOStorer">
            <parameter key="name"	value="SetStorage"/>
            <parameter key="io_object"	value="ExampleSet"/>
        </operator>
    </operator>
    <operator name="IOConsumer" class="IOConsumer">
        <parameter key="io_object"	value="ExampleSet"/>
    </operator>
    <operator name="Final Retrieve" class="IORetriever">
        <parameter key="name"	value="SetStorage"/>
        <parameter key="io_object"	value="ExampleSet"/>
    </operator>
</operator>

Hope this will help you, understand what I'm suggesting.

Greetings,
Sebastian

m_r_nour · November 2009

thanks

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"to ask about data sampling"

Answers