Options

Train on subset of data XValidate on full set of data

noah977noah977 Member Posts: 32 Maven
edited November 2018 in Help
Hi,

Thanks for all the help so far.  I couldn't have gotten this far without all the advice of the people here.  You guys are great!

My next challenging question...

I want to train a model on a subset of the data, but then test it during the XV stage on the FULL set of data.

For example, imagine data where the label is height and the input variable is birth-weight.
I want to say,

  1) Train an  SVM to regress height from birth-weight, but ONLY use birth-weight > 6 kg for training."
  2) TEST using XValidation against ALL the input data.

The premise is that learning from a subset of data will create a more accurate model to use against all the data.  (yes, for my application, this has been proven to work.)

So as I iterate through different values of the SVM parameters, I want to train on a subset, but test on the full set.

How can I do this in RM?? 

Thanks

Answers

  • Options
    steffensteffen Member Posts: 347 Maven
    Hello

    I am afraid I got you wrong. As far as I understand, you mean with "all input data" "inputdata without restrictions"
    you can use ExampleFilter in the training step ...something like this

    <operator name="Root" class="Process" expanded="yes">
        <operator name="XValidation" class="XValidation" expanded="yes">
            <operator name="OperatorChain" class="OperatorChain" expanded="yes">
                <operator name="ExampleFilter" class="ExampleFilter">
                    <parameter key="condition_class" value="attribute_value_filter"/>
                </operator>
                <operator name="LibSVMLearner" class="LibSVMLearner">
                </operator>
            </operator>
            <operator name="OperatorChain (2)" class="OperatorChain" expanded="yes">
                <operator name="ModelApplier" class="ModelApplier">
                </operator>
                <operator name="ClassificationPerformance" class="ClassificationPerformance">
                </operator>
            </operator>
        </operator>
    </operator>
    noah977 wrote:

    The premise is that learning from a subset of data will create a more accurate model to use against all the data.  (yes, for my application, this has been proven to work.)
    That's interesting. Normally you introduce something called "sample selection bias" or "incidental truncation" this way. Such a bias normally HARMS the performance. I am  interested in this problem , so I would really appreciate some comments about this issue from your side (maybe per PM ?) :).

    kind regards,

    Steffen

Sign In or Register to comment.