🦉 🎤   RapidMiner Wisdom 2020 - CALL FOR SPEAKERS   🦉 🎤

We are inviting all community members to submit proposals to speak at Wisdom 2020 in Boston.


Whether it's a cool RapidMiner trick or a use case implementation, we want to see what you have.
Form link is below and deadline for submissions is November 15. See you in Boston!

CLICK HERE TO GO TO ENTRY FORM

Train on subset of data XValidate on full set of data

noah977noah977 Member Posts: 32  Guru
edited November 2018 in Help
Hi,

Thanks for all the help so far.  I couldn't have gotten this far without all the advice of the people here.  You guys are great!

My next challenging question...

I want to train a model on a subset of the data, but then test it during the XV stage on the FULL set of data.

For example, imagine data where the label is height and the input variable is birth-weight.
I want to say,

  1) Train an  SVM to regress height from birth-weight, but ONLY use birth-weight > 6 kg for training."
  2) TEST using XValidation against ALL the input data.

The premise is that learning from a subset of data will create a more accurate model to use against all the data.  (yes, for my application, this has been proven to work.)

So as I iterate through different values of the SVM parameters, I want to train on a subset, but test on the full set.

How can I do this in RM?? 

Thanks

Answers

  • steffensteffen Member Posts: 347  Guru
    Hello

    I am afraid I got you wrong. As far as I understand, you mean with "all input data" "inputdata without restrictions"
    you can use ExampleFilter in the training step ...something like this

    <operator name="Root" class="Process" expanded="yes">
        <operator name="XValidation" class="XValidation" expanded="yes">
            <operator name="OperatorChain" class="OperatorChain" expanded="yes">
                <operator name="ExampleFilter" class="ExampleFilter">
                    <parameter key="condition_class" value="attribute_value_filter"/>
                </operator>
                <operator name="LibSVMLearner" class="LibSVMLearner">
                </operator>
            </operator>
            <operator name="OperatorChain (2)" class="OperatorChain" expanded="yes">
                <operator name="ModelApplier" class="ModelApplier">
                </operator>
                <operator name="ClassificationPerformance" class="ClassificationPerformance">
                </operator>
            </operator>
        </operator>
    </operator>
    noah977 wrote:

    The premise is that learning from a subset of data will create a more accurate model to use against all the data.  (yes, for my application, this has been proven to work.)
    That's interesting. Normally you introduce something called "sample selection bias" or "incidental truncation" this way. Such a bias normally HARMS the performance. I am  interested in this problem , so I would really appreciate some comments about this issue from your side (maybe per PM ?) :).

    kind regards,

    Steffen

Sign In or Register to comment.