I am doing a in-hospital mortality prediction model. I have data of 1266 patients thereof 63 died. A first selection of my variables was obtained by automodel from rapidminer checking for quality of the columns for modelling.
--> through its preprocessing 53 columns are left. The columns are binary, categorized or continous. Missingness was adressed either by categorization or Forest imputation so no missing values are anymore in the columns.
(Another approach which didn't brought me good results was:
I did a statistical univariant analysis with t-test, whitney U-test and odds ratios calculation and included significant variables but through the univariant selection I got poor results afterwards by modelling)
So I switched to the automodel preselection and want to obtain as following feature selection.
For the selection I was convinced to use backward selection. Moreover appealing seemed to me the combination with weight by correlation and random forest.
So I thought I do something like this:
Select the 53 columns, and multiply it 3 times
1- upsample - weight by correlation - weights to Data - generate attribut
2- upsample - random forest - weights to data - generate attribut
3- upsample by Smote within Backward elimination operator on the
training side - performing naive bays and validating by crossvalidation.
and then I stuck with the append operator. And I don't get a final selection only three times information about the different weights or a selection according the backward selection.
Actually the backward selection takes for this number of columns a long time so could I make a senseful preselection only by random forest and weights to data maybe and then select only this attributes to do the backward selection.
I am very unsure what is a way how to do it? It doesn't need to be fancy. A simple proper feature selection would make me happy and also if someone could give me an advice. I am not well established in modelling so I don't know on what I can rely.
In total I have to say I am very unsure how to obtain a good feature selection. And as easy as possible won't be bad too as I am more from the medicine than from data science - even I like to go into it.