filter testdata attributes to be contained in the range of training dataset

Fred12Fred12 Member Posts: 344   Unicorn
edited November 2018 in Help

hi,

I have a traindataset and a testdataset, now I want to clean all parameters from testset in a way, that the ranges from the testset are contained in the ranges (min/max) from the attributes from the traindataset...

how is that possible in an easy way? for now, I have to enter all attribute ranges with a filter examples operator.. thats stupid, I'm sure there is an easier way to do it

Tagged:

Answers

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,226   Unicorn

    I am interpreting your question to mean that you are trying to cap the values of a variable so that they do not fall outside the range of those observed in your training set., by manually replacing values outside that range with the observed min or max from the training set.   

     

    If that is correct, what is it that you are trying to  accomplish through this transformation?  In many cases, this would be a superfluous step and would actually represent an unnecessary loss of information.  In the past, when I have interacted with people who have talked about doing this type of transformation, the issue they are concerned with is the treatment of extreme outliers and their potential influence on a scoring model.

     

    However, many different types of predictive models don't use the variables in its scalar form, but rather look at ranges.  Anything in the tree/rule/random forest/SVM family is already largely insensitive to outliers because of this feature. 

     

    The most potentially troublesome model forms would be anything based on a regression framework (liner or logistic) where a raw attribute was being used as a predictor.  If that is your concern, let me suggest two alternative methods of dealing with the problem.

     

    First, you can apply binning to your attribute values in preprocessing and then run the regression on the bins rather than on the raw data.  In many cases, this produces results that are nearly as accurate as the original regression, but then the results are essentially insultated from the effect of outliers, since the bin cutpoints (depending on whether you are doing it by range, by frequency, etc.) are not really affected by outliers because they will all fall in the highest or lowest bins.

     

    A somewhat similar approach is to normalize all inputs by z-score transformation before putting them into the regression---the advantage of this approach is that you retain the true scalar relationship between the discrete values, but you can very easy filter out examples that are outliers (either low or high) because now all attributes are scaled consistently, so a single "filter examples" operator can be used, or a generate attribute can be used where you simply substitute the min and max values you are willing to allow in terms of standard deviations from the mean.

     

    If you really need to do the exact transformation you describe (keep the raw values but replace high or low outlier values), one way would be to use loop attributes to define the min and max of each attribute from the training set, store those values as macros, and then generate new attributes on the test data using the min and max from the training set.  It would be tedious, though, so you may want to think first about whether either of the other simpler methods I describe will be suitable.

     

    Regards,

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Fred12Fred12 Member Posts: 344   Unicorn

    well, this was suggested in the book "Data Mining for the masses" in the chapter for linear / logistic regression to do before applying the model....

     

    I wanted to do the same but it gets tidious with many attributes, so maybe I will consider binning instead... but how to do binning correctly? which method should I use? Is there a difference in them in the end results?

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,226   Unicorn

    There isn't really one correct answer for the optimal binning approach, it depends on your dataset---so you will likely have to try a few different binning methods and see which gives you the best results.   But if you already have a model built and you are just trying to insulate it from the potential effect of future outliers, then normalizing your inputs and filtering out the outliers will probably be less work.  You can easily have RapidMiner normalize all your attributes and then cap them at whatever values you want with just a few operators.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Fred12Fred12 Member Posts: 344   Unicorn

    ok then I will normalize and filter out outliers..

     

    can you give me a hint on whats the best approach on doing outlier detection? 

    is it better to do normalization before outlier detection or should I do not? and which method is best for outlier detection? COF ? LOF (Local outlier factor) , Clustering approaches? I am really confused of the many operators that exist for outlier detection and which of them works best for the dataset...

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,226   Unicorn

    Once again, I suspect there isn't any single answer here that is going to give the best results in all situations.  It depends on your underlying data and also what you are trying to accomplish.

     

    If you are simply trying to do something in line with the spirit of your original request, you actually don't need any of the built-in outlier detection functions at all.  Capping the values of the test set based on the values observed in the training set constrains their impact in linear models to not be outside the range that was used to build the model (prevents extrapolation, usually to prevent any catastrophic and unintended model blow-outs).  In that case, you can simply normalize the attributes that are in the model training set and observe what the minimum and maximum values are, and then screen out or modify any data in the test set that falls outside that range.   In that case, you should definitely normalize first since you want to deal with the normalized values, which are all on the same scale.

     

    If you actually want to go down the road of more complex outlier detection, you can do that as part of the training set data analysis first (prior to building the model), and see what kind of results the different approaches give you.  Once you pick one that you like then you would simply apply the same approach and parameter values to your test data as well and it will accomplish the same effect.

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,226   Unicorn

    Two other comments that may be of interest:

    1) You can & should apply the normalization parameters from the training data on any test data or future data, which is done using the preprocessing output from the normalize operator via an apply model operator.   This would be prior to filtering out or replacing outlier values as discussed.

    2) You should also generally normalize attributes that are going to be used in any numerical distance calculations, such as several which are used in outlier detection algorithms, so you should probably go ahead and do your normalization first in those cases as well.  When you normalize your data based on the z-score transformation you really don't lose any information (and in fact there is an operator, denormalize, that would allow you to undo the transformation if necessary).   In short there is little downside to normalizing data, especially if you have numeric attributes that have inherently different scales, but if you don't do it then you can run into problems.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Fred12Fred12 Member Posts: 344   Unicorn

    as to 

    1) I am already doing that, however,  does this filter out my outliers automatically by the apply model operator? I thought I would have to do this manually?

    2) ok generally, I am doing that, but I noticed that with the k-nn algorithm, when I do k=1 to about 17, I always get about 10% to 15% higher performance if I do not apply normalization, and my dataset is only numerical... why is that? I am testing it in X-Validation manner on the validation data, but its always the same...

     

    additional question: Is normalization not automatically applied in a pre-processing step before in the libSVM Implementation? I read that at least in R e1071 package (libSVM) it is done automatically, because in former experiments, normalization has shown better performance in almost all cases, so its automatically applied .. is that true?

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,226   Unicorn

    1) No, applying the normalization in preprocessing isn't going to filter out your outliers, it is just going to make sure that the normalization parameters aren't being recalculated on the test data, so you would need to still filter out the outliers as described previously if that's what you want to do.

    2) I'll have to defer this question regarding k-nn, and your additional question about libSVM as well,  to one of the RapidMiner staff, whose knowledge of the inner workings of these operators is better than my own.  Conceptually, why un-normalized data could provide improved results from a model performance perspective with k-nn is probably an unintended consequence of attribute weighting, because un-normalized data is just going to give higher weights to numerical variables that have larger values when computing distances.  However, whether this is really substantive improvement or overfitting is an open question.  If it is substantive then perhaps a better solution would be to filter out the extra attributes that are bringing performance down when doing the k-nn calculations.

     

    Regards,

     

     

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.