RapidMiner 9.8 Beta is now available
Be one of the first to get your hands on the new features. More details and downloads here:
Explain Predictions: Ranking attributes that supports and contradicts correct predictions
Most of the feature selection techniques will provide us with the best predictors that support predicting target label. These are mainly dependent on the correlation between the predictor and output label(class).
How can we know which of these variables performed better in predicting a correct label for a particular algorithm?
In RapidMiner, there is an "explain predictions" operator that provides statistical and visual observations to help understand the role of each attribute on prediction. This operator uses local correlation values to specify each attribute (Predictor) role in predicting a particular value related to a single sample in the data. This role can be supporting or contradicting the prediction. These were visualized beautifully with different color variations in red and green. Red color represents attributes that are contradicting prediction, and green color represents attributes that support the prediction.
The process file attached below is based on IRIS dataset. The problem we are looking here is related to the classification of different flowers based on four attributes (a1 to a4). I try to find attribute importance using Auto model. An auto model provides important attributes based on four factors (https://docs.rapidminer.com/8.1/studio/auto-model/). Now, I first observed the importance of attributes in the auto model and found that a2 is the best predictor as you can see in below figure its represented in green. The other three attributes are in yellow, and this means that they have a medium impact on model predictions. To test this, I run the models (5 fold cross validation) with these three attributes included and removed.
Now to observe the effect of having only supporting attributes I removed attributes that were identified above to contradict correct predictions and run the models again. From the results, I observed that the Decision Tree and Gradient boosted tree performance improved. There is no difference in Random forest performance but neural net performance reduced. In machine learning, we try different crazy things as there are no set rules to get better predictions.
Comments and feedback are much appreciated.