Explain Predictions: Ranking attributes that supports and contradicts correct predictions

varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
edited April 2019 in Knowledge Base
Hello,

Most of the feature selection techniques will provide us with the best predictors that support predicting target label. These are mainly dependent on the correlation between the predictor and output label(class). 

A limitation of this process is the importance of attributes changes from one model to another model. This mainly depends on the variations in the strength of attribute in the presence of other attributes and also based on model statistical background.

How can we know which of these variables performed better in predicting a correct label for a particular algorithm? 
In RapidMiner, there is an "explain predictions" operator that provides statistical and visual observations to help understand the role of each attribute on prediction. This operator uses local correlation values to specify each attribute (Predictor) role in predicting a particular value related to a single sample in the data. This role can be supporting or contradicting the prediction. These were visualized beautifully with different color variations in red and green. Red color represents attributes that are contradicting prediction, and green color represents attributes that support the prediction. 

How to know which attributes supported and contradicted correct predictions and vice versa?
As explained earlier the color codes you see in visualization belongs to both correct and incorrect prediction. What if you are interested in finding attributes that support and contradicts correct prediction? This is the motive behind writing this post. In predictive modeling, only a few models can provide global importance of variables. Finding attributes of global significance is difficult in the case of complex algorithms. But, with the help of "explain prediction" operator, we can generate rankings for predictions that supported and contradicted predictions. I will explain this in a process example below.

The process file attached below is based on IRIS dataset. The problem we are looking here is related to the classification of different flowers based on four attributes (a1 to a4). I try to find attribute importance using Auto model. An auto model provides important attributes based on four factors (https://docs.rapidminer.com/8.1/studio/auto-model/).  Now, I first observed the importance of attributes in the auto model and found that a2 is the best predictor as you can see in below figure its represented in green. The other three attributes are in yellow, and this means that they have a medium impact on model predictions. To test this, I run the models (5 fold cross validation) with these three attributes included and removed.



Interestingly, the models did very well in the presence of all four attributes compared to their absence. The kappa values increased from 0.3 to 0.9. So, it means that for this dataset we better include all the four attributes. Now, the next task is trying to understand which attributes did well in predicting the correct label. For this, We utilize the explain predictions operator as well as some regular operators to rank the performance ( provided this ranking method).


I compare four classification models (Decision Tree, Randon Forest, Gradient Boosted Tree & Neural Network) performance and identify the importance of attributes in each model for correct predictions. From the below figure, you can observe that the importance of each attribute varied according to the algorithm. The positive value indicates supporting attributes and negative indicates contradicting attributes related to correct predictions. These attributes were sorted based on their importance.

 
Now to observe the effect of having only supporting attributes I removed attributes that were identified above to contradict correct predictions and run the models again. From the results, I observed that the Decision Tree and Gradient boosted tree performance improved. There is no difference in Random forest performance but neural net performance reduced. In machine learning, we try different crazy things as there are no set rules to get better predictions.

Comments and feedback are much appreciated.

Thanks
Regards,
Varun
https://www.varunmandalapu.com/

Be Safe. Follow precautions and Maintain Social Distancing

Comments

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hey Varun,
    Thanks for the discussion and the thoughts.  I would like to provide some comments on some of the aspects you have mentioned.
    Now, I first observed the importance of attributes in the auto model and found that a2 is the best predictor as you can see in below figure its represented in green. The other three attributes are in yellow, and this means that they have a medium impact on model predictions.
    That is actually not the meaning of those colors.  I have pasted the help section on the colors below as a spoiler.

    The colored status bubble provides a quality indicator for a data column.

    • Red: A red bubble indicates a column of poor quality, which in most cases you should remove from the data set. Red can indicate one of the following problems:
      • More than 70% of all values in this column are missing,
      • The column is practically an ID with (almost) as many different values as you have rows in your data set but does not look like a text column at the same time (see below),
      • The column is practically constant, with more than 90% of all values being the same (stable), or
      • The column has a correlation of lower than 0.0001% or higher than 95% with the label to predict (if a label is existing).
    • Yellow: A yellow bubble indicates a column which behaves like an ID but also looks like a text or which has either a very low or a very high correlation with the target column. They correlation-based yellow bubbles can only appear if the task is "Predict".
      • ID which looks like text: this column has a high ID-ness and would be marked as red but at the same time has a text-ness of more than 85%.
      • Low Correlation: a correlation of less than 0.01% indicates that this column is not likely to contribute to the predictions. While keeping such a column is not problematic, removing it may speed up the model building.
      • High Correlation: a correlation of more than 40% may be an indicator for information you don't have at prediction time. In that case, you should remove this column. Sometimes, however, the prediction problem is simple, and you will get a better model when the column is included. Only you can decide.
    So green does not mean it is most important, it simply means it is safe to use this feature for modeling.  Yellow on the other hand should be checked.  In this case, not because of low correlation but because of high correlation.

    A better piece of information to see likely importance of the feature on the label / for the model is the correlation column.  If you sort according to this column, you will see that the order with respect to importance (correlation with label) is a3, a4, a1, a2.  So in fact a2 - while safe to be used for modeling without additional check - is also in fact likely to be the least important feature.

    Interestingly, the models did very well in the presence of all four attributes compared to their absence. The kappa values increased from 0.3 to 0.9.
    I would recommend to re-do this analysis based on the information above.  So a3 and a4 are the most important ones, not a2.

    I compare four classification models (Decision Tree, Randon Forest, Gradient Boosted Tree & Neural Network) performance and identify the importance of attributes in each model for correct predictions.
    I am actually thinking about creating a new operator for calculating the feature importance based on the Explain Predictions output as well.  I am not sure yet if the focus only on correct predictions is actually a good idea or not.  I could see an argument to include both sides of the coin to be honest, correct and wrong ones.  The reason is that the feature value was important for the model independent of the question if the prediction was correct or not.  Is this not what we are after?

    Just my 2c,
    Ingo
  • varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
    Thanks, @IngoRM for your comments. In my view, it is not appropriate to analyze only based on correct predictions as we need to take the total predictions that were classified as both correct and incorrect. I saw some trend where a highly supporting attribute for correct predictions is also a supporting attribute for incorrect predictions. If the performance of an algorithm is low (where a number of incorrect predictions are more), the attribute supporting correct predictions and incorrect predictions are doing more harm than good.

    Correct me if there is any misconception about this.
    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    I think we are on the same page here.  I was only bringing it up since you mentioned
    ...and identify the importance of attributes in each model for correct predictions.
    So I thought it is a good opportunity to have this discussion quickly :)

  • varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
    Thanks, I got it. I am looking forward to your idea on the new feature selection operator based on explaining predictions, actually, this is the major reason I posted this thread. :smile:
    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Yip, this thread here and also another one in the recent days really made me think about how this could lead to a model-agnostic but model-dependent global feature importance weighting based on the local explanations / importances.  Stay tuned...
  • aileenzhouaileenzhou Member Posts: 12 Contributor II
    @IngoRM A new operator for calculating the feature importance would be a great idea. But I don't see this happens in the latest verion 9.7. Or have I missed it somehow? 
    I love RM as it is easy to use for beginners to step in big-data analysis world and they can actually conduct some complex projects using RM. However, sometimes it constrains the use of it due to some commonly used functions are missing. I would say feature importance is one of them.   
  • varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
    Hi @aileenzhou

    What kind of feature importance algorithm are you looking for?


    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

  • aileenzhouaileenzhou Member Posts: 12 Contributor II
    @varunm1 I used the algorithm called 'infor- mation fusion-based sensitivity analysis (IFSA)'. 
  • jacobcybulskijacobcybulski Member, University Professor Posts: 391 Unicorn
    As an idea I thought to investigate not so much which attributes supported or contradicted generation of specific results but rather how those attributes lead to the generation of correct or incorrect results. This could assist determining how the specific model failed to classify correctly some of the examples. As a simplistic exercise, I have converted the label-prediction pair into correct/incorrect prediction as a new label for generating a tree model, which (in my case after cross-validating a tree model) resulted in the following model:
    While the approach is theory-void, it may help determining why incorrect decisions were being made.
  • jacobcybulskijacobcybulski Member, University Professor Posts: 391 Unicorn
    edited July 2020
    The second attempt, again exploring ideas rather than being theoretically pure - was to use association analysis. The idea was: can we associate different value ranges of predictors with correct or incorrect classification.
    I assume it is not the "whole" attributes we need to look at but also the range of their values!
    This is similar to what we commonly see in cluster analysis. I'll do the association analysis between discretised attributes and the generated classification correctness. As the classifier is 95% correct we have no problems generating association rules with the correct predictions, e.g. here are the visualised "correct" rules for attributes binned into 3 classes.
    However, these are not very interesting as they do not help us identifying rules associated with incorrect predictions. Unfortunately, there are very few cases for association rules to be sensitive to failures. Up-sampling failures however can be useful in this regard. I'd recommend an approach where you can train the up-sampling method, e.g. using GAN, however as a quick and dirty appoach I have used SMOTE with the increased sensitivity to attributes binned into 5 classes and the following result.
    Note that the visualisation of association rules is a bit ambiguous but if needed a detailed analysis of association tables can further reveal the rule details, including:
    • Ranking of combinations of attribute value ranges that support or contradict correct and incorrect predictions using variety of criteria, e.g. support, confidence, conviction, lift, etc.
  • jacobcybulskijacobcybulski Member, University Professor Posts: 391 Unicorn
    edited July 2020
    Note that in the above it was "confidence" that was used for weighing the association rules to determine the attribute value proximity. However, we may like to use other statistical measurements of the rule "importance", e.g. lift. And we can try different layout algorithms, e.g. Fruchterman and Reingold's.
  • jacobcybulskijacobcybulski Member, University Professor Posts: 391 Unicorn
    edited July 2020
    An additional comment on why it is difficult to explain the attribute impact on the classification outcome, which will also show that SMOTE up-sampling did not disturb the distributional properties of attributes withing the classification outcome classes.
    The following chart shows whether or not individual attribute values had an impact (and how) on the classification outcome.
    As can be seen the means of attribute values in the outcome classes (success - blue, failure - red) are well differentiated for attributes a2, a3 and a4 (we tend to do this in cluster analysis). However, the variance of successful cases is huge (due to 95% of data being in this class), which really invalidates our interpretation. This can be clearly seen in the density distribution in the two "best" separated attributes (on their mean).
    We can see that it is virtually impossible to separate the failures from successes in the density distribution. However, let's see what happened after SMOTE.
    The distribution of all attributes against their performance classes is exactly the same. However, the separation of the two performance classes is far better (RM assigned the class colour in reverse order).
    So perhaps we could try interpreting the "simple" distribution chart based on the class mean.
    This also explains that we were able to construct association rules for the classifier's failures, and we can trust them - as long as we trust the synthetic examples added to the mix.

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Hey,
    The second attempt, again exploring ideas rather than being theoretically pure - was to use association analysis. The idea was: can we associate different value ranges of predictors with correct or incorrect classification.


    This is by the way the thought people use to generate  a learner which predicts the uncertainty of another learner, isnt it?


    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.