Model performance estimation

npapan69 · November 2018

Dear All,
I have a relatively small dataset with 130 samples and 2150 attributes, and I want to built a classifier to predict 2 classes. Apparently, I need to reduce the number of attributes to avoid overfitting, so I could use i.e. RFE-SVM to reduce the number of attributes to 1 tenth of my samples, which is 13. I'm using a Logistic Regression model, and I need to do some fine tuning of parameters like lambda and alpha. After reading the very informative blog from Ingo, I would like some help on the practical implementation. May I kindly ask from a more experienced member to check the following workflow? Can I trust this implementation and in particular the performance estimates? Is it a good practice to compare the performance from CV with that from a hold-out single set? And if yes these numbers should be more or less the same?

Many thanks in advance,

npapan69

Telcontar120 · November 2018

Cross validation is generally believed to be more accurate than a simple split validation. Split validation measures performance based only one one random sample of the data, whereas cross-validation uses all the data for validation. Think about it this way---the hold-out from a split validation is simply one of the k-folds of a cross-validation. It's inherently inferior to taking multiple holdouts and averaging their performance, which provides not only a point estimate but also a sense of the variance of the model performance as well.

It's different if you have a totally separate dataset (sometimes called an "out of sample" validation, perhaps from a different set of users, or different time period, etc.) that you want to test your model on after the initial construction. In that case your separate holdout might provide additional insight into your expected model performance on new data. But in a straight comparison between split and cross validation, you should prefer cross validation.

rfuentealba · November 2018

Hi @npapan69,

(My poor old Apple MacBook Air is showing signs of age, hence it took me a massive amount of time to check your process without RapidMiner hanging up, so sorry for the delay!)

Now, quoting your last response:

In the -omics field that I'm working its very common to have few samples and way too many attributes, therefore feature selection methods are very important to reduce overfitting.

Yes, in oceanic research I have a similar situation: models with 240 samples and each sample contains 75 attributes, and I struggle to find the least amount of features. If you have more attributes than rows, the amount of combinations that you have to analyze is higher than the amount of samples you have, so mathematically your data accounts for a % of the truth.

In my feature selection approach (as you will see in my process) I start by removing useless and highly correlated features...

Great, but in your process you are doing it inside a cross validation that is inside an optimize operators that is inside of a cross validation. I have moved these processes to the beginning to gain a bit of speed. You don't need to do that on every loop. To illustrate, I'll show you a pseudocode:

// The cross validation operation executes everything inside it as many times
// as chunks of code you have. 
for each block in dataset as i:
    read the block i
    pass the block i to the optimization operator.
    // The optimization operator executes everything inside it as many times
    // as there are parameter combinations to optimize.
    for each value in min-svmrfe, max-svmrfe as j:
        for each value in min-logreg-alpha, max-logreg-alpha as k:
            for each value in min-logreg-lambda, max-logreg-lambda as l:
                // another cross validation
                for each block i as m:
                    read the block m
                    divide the block m in n, o:
                    model = train(j, k, l, n)
                    performance_data = test(model, o)
                    save(performance)<br>

The problem with your approach is that your process has a lot of nesting that chops data, and a number of data preparation approaches that don't really add up to the validation process and you might find better value by adding these before you begin doing optimization and cross validation.

...and then apply RFE-SVM. As a rule of thumb the maximum number of features that will finally comprise the model (signature) should not exceed the 1/10 of the total number of samples used to train the model.

I'm not aware of the specifics of your project, so we'll go ahead with this.

Now the question is if my approach using a nested cross validation operator to select features, train and fine tune the model using 75% of the samples while testing the performance with the 25% of samples test hold out set is correct.

It is.

<comment style="impostor syndrome"><br>As the only unicorn on Earth who doesn't know how to do data science properly,
I do the same exercise with the golden ratio, 75/25 and 80/20 Pareto rule.<br></comment>

And if yes the difference in my performance metrics (accuracy, AUC, etc) between the CV output and my test data output should be minimal? If not is that a sign of overfitting?

It can be overfitting or underfitting. Overfitting is when your model is too trained, underfitting is when your model is trained too little. To estimate which one is it, you should examine your data first. Remember that I recommended you to use x-Means to evaluate how your data is spread? That is why. It will help you figuring out how different are your training and testing datasets.

Should I trust one or the other?

Use the second one to evaluate the first one, go back, retune, retest. Rinse and spin.

Should I verify the absence of overfitting by comparing the 2 outputs?

Yes. However, notice that the details are specific for each business case and it's up to you to decide whether your model is right or not. If you are recommending medicines, you want your model to be as perfect as you can. If you are detecting fraud, it is ok to flag outliers once you do the calculations and check these manually.

Regardless of the specifics, it is an excellent job the one you are making. I made some corrections for you. Please find attached.

All the best,

Rodrigo.

rfuentealba · November 2018

On another note:

Now that my sensei @Telcontar120 mentions it, you have two files: one is filename75.csv and the other is filename25.csv, right? (say yes even if it's not the same name).

If you did that because you want to replace the filename25.csv file with data coming from elsewhere, the process you wrote (and then, the process I sent you) is fine. If you did the split because your target is to prepare that model and perform a split validation after a cross validation, that's not really required. It's safe to use Cross Validation as a better thought Split Validation (until Science says otherwise, but that hasn't happened). In that case, your question:

Should I trust one or the other?

Be safe trusting the Cross Validation.

In the case I sent, I assume that your testing data is new data that comes from outside your sample. A good case to do that is what happened to me in my oceanic research project:

Trained my model with a portion of valid data from 2015 and 2016.
Tested my model with a portion of valid data from 2015 and 2016, but different chunk of it.
Then I have data between 2009 and 2014 that is outside of my sample and I want to score it.

My question is: should I use a new performance validator?

If what I want to validate is how my algorithm behaves, then no, one validation is enough.
If what I want to validate is the way historical data has been scored, then yes, you might see if your algorithm holds against older data: one validator for the model and other for the old data on applied model data.
Everything else, no.

So, rule of thumb: if what's important is the model, go with Cross Validation. if it's historical data that is also scored, perform the validation yourself. If it's new data, don't validate anything, because your new data will be predicted true, not really true and validations ALWAYS come from data you already know.

Hope this helps.

rfuentealba · November 2018

Hello, @npapan69

I got a bit lost reading your description, let me self-explain like I'm 5.

You have 130 samples or horizontal rows. Each sample has 2150 attributes or vertical columns. And you want them to fit in 2 classes on a label. Am I right?

In that case, holy moly... your data is EXTREMELY prone to overfitting and I would run in circles before doing something like that again (someday I'll tell you my story with @grafikbg). There is a massive number of possible combinations to make them fit into your classes, and there is little chance that none of these 2150 attributes is correlated to another. If you want to continue, the first thing you should do is to either remove the correlated attributes or select the most important ones.

What has me confused is that you later explained that you can use SVM-RFE to remove attributes to a tenth, so 13. Am I right? Can the story be that you have 2150 samples or horizontal rows, and each sample has 130 attributes or vertical columns? I would still do the same, remove the correlated attributes and only then apply SVM-RFE, as you said. In fact, SVM-RFE doesn't behave well when there are too many correlated attributes, and 130 is still too large of a number, so there may be some correlations that might not be identifiable at first sight by the bare human eye.

I would save the results of this operation in the repository before continuing with the logistic regression and whatever you want, but at least the data preparation phase would be ready at this point, and you can take advantage of the Optimize Parameters super-operator to do your fine tuning. Regarding your questions:

Q: May I kindly ask from a more experienced member to check the following workflow?
A: I can't fire RapidMiner Studio right now but I promise I will take a look as soon as I finish with my massive thing (it's almost midnight here in Chile).

Q: Can I trust this implementation and in particular the performance estimates?
A: What you are planning to do seems correct, but I would still take correlations out of the rule before saying it is.

Q: Is it a good practice to compare the performance from CV with that from a hold-out single set? If yes these numbers should be more or less the same?
A: My level of English isn't that good. Let's see if I win the lottery with this explanation: Can a Cross-Validation be trusted? Yes, but the amount of data required to make it trustable depends on how variable is your data. Take your data after preparation and perform a few X-Means clusters to get a good grasp on your data variability (or is it variety? I'm sleepy).

I am keeping my promise of checking the process.

Hope this helps,

Rodrigo.

Maerkli · November 2018

Rodrigo, it is brillant!
Maerkli

sgenzer · November 2018

@Maerkli if you like pls use new "reaction" tags: Promote, Insightful, Like, Vote Up, Awesome, LOL

npapan69 · November 2018

Dear Rodrigo,

Thank you for taking the time to respond in detail in my post. Let me clarify, in the -omics sector (on which I'm working) it is very common to have far fewer samples (horizontal entries), than attributes (vertical entries) or features. Therefore various methods are recruited to cone down to the few most informative features that will comprise the -omics signature. In the xml file you will see that apart from RFE I'm removing highly correlated features, as well as features with zero or near-zero variance (useless features). As a rule of thumb someone could consider to use for every feature that will finally contribute to the model at least 10 samples. So given the 130 samples available I'm not suppose to exceed 13 features after the feature reduction techniques applied. Actually by watching Ingo's webinar, I will try the evolutionary feature selection techniques keeping the maximum number of features to be 13. Now the most important part for me is how to validate the model. In our field external validation is considered as the most reliable technique, however, its not very easy to get external data. So if I dont have external data, is it correct to start with a data splitting before doing anything else and to keep 25% of the data, as a hold out test set, train and save my model and afterwards test it with the hold-out set? Or forget about splitting and report (and trust) CV results? Is there a way to do repeated cross validation (like 100 times for example)?

Again many thanks for your time and greetings from Lisbon to the beautiful Chile.

Nikos

npapan69 · November 2018

Many thanks Rodrigo for taking the time to answer in such a detailed way my post. In the -omics field that I'm working its very common to have few samples and way too many attributes, therefore feature selection methods are very important to reduce overfitting. In my feature selection approach (as you will see in my process) I start by removing useless and highly correlated features and then apply RFE-SVM. As a rule of thumb the maximum number of features that will finally comprise the model (signature) should not exceed the 1/10 of the total number of samples used to train the model. Now the question is if my approach using a nested cross validation operator to select features, train and fine tune the model using 75% of the samples while testing the performance with the 25% of samples test hold out set is correct. And if yes the difference in my performance metrics (accuracy, AUC, etc) between the CV output and my test data output should be minimal? If not is that a sign of overfitting? Should I trust one or the other? Should I verify the absence of overfitting by comparing the 2 outputs?

Nikos

npapan69 · November 2018

Again many thanks Rodrigo, for your enlightening answer, and the time devoted to correct my process.

Best wishes,
Nikos

npapan69 · November 2018

Dear Rodrigo,
I must admit that I couldn't find a way to evaluate the training and test data variance by X-means. Probably this is very basic, and I apologise for that, but the X-means operator can receive only a single file as input, and I guess I have to provide 2 files as inputs (75% training, 25% testing). Any workarounds?

Many thanks
Nikos

rfuentealba · November 2018

Hi @npapan69

Sure, just use the Append operator to merge both files as a single one. Make sure that most of the columns have the same names and that's it.

All the best,

Rodrigo.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Model performance estimation

Best Answers

Answers