Determining which attributes contribute to value of a label

goccolo · February 2017

Hi Community, I'm currently running my first data mining project and I'm having some serious doubts. I hope I'm posting my question in the right place and some of you could help me with some kind of hint or advice. It may be obvious to you, but to me the differences between many tools and techniques is still very thin.

I have my data in an Oracle database, stored into 4 tables: CUSTOMERS, ACCOUNTS, TRANSACTIONS and ALERTS.

The common attribute for each of them is CUSTOMER_ID.

The attribute which is most "interesting" to me is called TRUE_POSITIVE, it's a column from table ALERTS, and it takes either value "Yes" or "No".

The GOAL of my project is to determine which of the attributes contribute the most to the value of TRUE_POSITIVE being = "Yes".

My dataset is moderate in size (maybe 50 attributes in total, tables having between 50k to 700k examples).

At this point I've imported my data in RapidMiner Studio and did some initial data clearing (rejected certain columns, filtered out examples with missing important attributes.etc.)

Many attributes are take binominal values (for example: CUSTOMERS.FACE_TO_FACE_IDENTIFIED), many are polynominal (for example: CUSTOMERS.NATIONALITY).

I've also created some new attributes in table CUSTOMERS, like NO_OF_ALERTS_POS, which stores the number of true positive alerts for the particular customer, or HR_CASHFLOW which stores customers' average monthly value of transactions made "with" high risk countries.

My main question is:

Which tool / operand should I use to achieve my goal? Correlation matrix? Regression?

And some additional questions:

What would be the optimal number of attributes? Does my current dataset require much dimensionality reduction?

Can I use my new attribues to avoid joining tables? Does it make any sense and is there big risk I will miss the change of detecting some unobvious correlations?

Many thanks in advance for your help.

Thomas_Ott · February 2017

Hi! Welcome to the boards. I moved your post to the RapidMiner Studio forum because you're using Studio.

Ok, what your task is really a standard classification analysis. You're trying to use the data you cleaned to learn any patterns that makes one record a "Yes" or "No."

What I would suggest is to use a predefined Cross Validation building block (right click in the design canvas, select Insert Building Block, insert Nominal Cross Validation). The default algorithm is a Decision Tree (double click to see inside) and if you run it, it will output a confusion matrix and tell you how good that algorithm was able to discen between Yes and No. You can swap out the Decision Tree with maybe a Logistic Regression or some other algorithms and test again to see which gives you a better model.

Why do we try different algos? It's because some algos perform better on different data sets. So, it becomes an iterative process sometimes. With RapidMiner, it's simple to swap out different algos and once you become more adavanced you can build a process to auto model and auto select the best algorithm. We've been doing automdeling, tuning, and selection for years but never talked about.

Most likely you might need dimensionaluity reduction if your data set is really wide. There are different techniques to do it but first try the Cross Validation one and we'll go from there.

goccolo · February 2017

Thomas, thanks for your reponse!

I've tried few algorithms, results below. Could you help me interpret them?

Decision Tree:

accuracy: 93.77% +/- 0.01% (mikro: 93.77%)

ConfusionMatrix:

True: Nie Tak

Nie: 42254 2806

Tak: 0 0

precision: unknown (positive class: Tak)

recall: 0.00% +/- 0.00% (mikro: 0.00%) (positive class: Tak)

AUC (optimistic): 1.000 +/- 0.000 (mikro: 1.000) (positive class: Tak)

AUC: 0.500 +/- 0.000 (mikro: 0.500) (positive class: Tak)

AUC (pessimistic): 0.000 +/- 0.000 (mikro: 0.000) (positive class: Tak)

k-NN:

accuracy: 88.27% +/- 2.31% (mikro: 88.27%)

ConfusionMatrix:True: Nie Tak

Nie: 39582 2612

Tak: 2672 194

precision: 6.58% +/- 1.82% (mikro: 6.77%) (positive class: Tak)

recall: 6.68% +/- 1.39% (mikro: 6.91%) (positive class: Tak)

AUC (optimistic): 0.941 +/- 0.011 (mikro: 0.941) (positive class: Tak)

AUC: 0.500 +/- 0.000 (mikro: 0.500) (positive class: Tak)

AUC (pessimistic): 0.062 +/- 0.012 (mikro: 0.062) (positive class: Tak)

GLM:

accuracy: 91.27% +/- 4.57% (mikro: 91.27%)

ConfusionMatrix:True: Nie Tak

Nie: 40693 2371

Tak: 1561 435

precision: 32.86% +/- 13.99% (mikro: 21.79%) (positive class: Tak)

recall: 14.18% +/- 6.19% (mikro: 15.50%) (positive class: Tak)

AUC (optimistic): 0.575 +/- 0.018 (mikro: 0.575) (positive class: Tak)

AUC: 0.557 +/- 0.017 (mikro: 0.557) (positive class: Tak)

AUC (pessimistic): 0.539 +/- 0.018 (mikro: 0.539) (positive class: Tak)

Looking at the confiusion matrix the last one seems to be best, but still far from actually good, as only 1 in 3 times the prediction was correct for the label value being "Tak".

Andrew · February 2017

You could also use the various "Weight by" operators. These will create a set of weights for the attributes where the value of the weight is nearer 1 if the attribute is relevant for the label and nearer 0 if it is not. You can then use the "Select by Weights" operator to select attributes of interest based on the weights to yield an example set with only the attributes of interest.

Andrew

Thomas_Ott · February 2017

So now we start doing some basic data science. Each Algo you chose has its shortcomings and the results all suck IMHO.

The Decision Tree model is pure garbage, it selects thinks everything is "Nie"

The K-nn is a bit better but it really has a hard time finding "Tak." While this appears on the surface is bad, there might be some opportunity to tune the K value and make sure attributes are properly normalized.

The GLM is slightly better in a different way, but it to has a hard time discerning "Tak" as well.

So what are some of the ways you can make this better? You might want to first go back to your dataset and try to balance the data. It appears that the instances of "Tak" are smaller than the instances of "Nie." This is what we call an unbalanced set and in the case of classifcation tasks, it could cause the algo to just lump everyone into the "Nie" category, like the Decision tree did.

I would add a sample operator inside the cross validation operator (right before the algo on the training side) and toggle on "balance data," then select an equal amount of each class and then train the algo again. Also check if the attributes your using for training can be normalized if you're use K-nn, scaling can have a big impact with that algo.

Telcontar120 · February 2017

You may also be interested in looking at the performance(costs) operator, which allows you to specify different costs of classification and misclassification. It may be that not all errors are equal, and the performance(costs) gives you a way to indicate the relative importance of misclassification. The modeling algorithm will then seek to minimize the costs (this operator doesn't work for all algorithms but it does for the main ones).

goccolo · February 2017

Thanks to all for answers. I'm not implementing all your suggestions yet, because I want to take one step a time, make sure I know what I'm doing

My current model performance is as follows:

accuracy: 57.55% +/- 4.12% (mikro: 57.55%)
ConfusionMatrix:
True:	Nie	Tak
Nie:	35803	136
Tak:	26925	884
precision: 3.20% +/- 0.24% (mikro: 3.18%) (positive class: Tak)
recall: 86.67% +/- 2.98% (mikro: 86.67%) (positive class: Tak)
AUC (optimistic): 0.802 +/- 0.011 (mikro: 0.802) (positive class: Tak)
AUC: 0.800 +/- 0.011 (mikro: 0.800) (positive class: Tak)
AUC (pessimistic): 0.798 +/- 0.010 (mikro: 0.798) (positive class: Tak)

And the process itself looks like this:

There is improvement, but the results are still not satisfactory.

I'm pretty sure it's necessary to prepare the data more thoroughly, but I don't know how exactly.

Thomas_Ott · February 2017

You're right, the overall model still isn't that great BUT the classifier is starting topick up TAK a lot better. I think this model can be optimized for sure.

I would try a GBT and SVM algo. For the SVM try a radial kernal with a C=1000 and gamma = 0.01 initially. If any of these algos show improvement then I would suggest using the Grid optimization operator to automatically vary and test combinations of parameters.

Another thought is if you have a wide data set (i.e. many attributes) to do some automatic feature selection to reduce the attributes but keep the ones with the best predictive power. I'm attaching the XML for that operator here. You'd have to put that between the Cross Validation and Replace Missing value. Before you use that, i would take another look at the process as a whole. The Replace Missing value operator gives me a pause, I would think about that one to make sure that it's what you want to do.

We use this method for our own PQL scoring system and it works really well. We distill 100's of attributes down to the best 15 but it does take processing time. To make it go faster, adjust the k-folds in the Cross Validation operator and vary the initial generations/pop size, etc.

Good luck.

<?xml version="1.0" encoding="UTF-8"?><process version="7.3.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.3.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="optimize_selection_evolutionary" compatibility="7.3.001" expanded="true" height="103" name="Optimize Selection (Evolutionary)" width="90" x="313" y="34">
        <parameter key="population_size" value="20"/>
        <parameter key="maximum_number_of_generations" value="100"/>
        <parameter key="show_population_plotter" value="true"/>
        <parameter key="plot_generations" value="1"/>
        <parameter key="selection_scheme" value="non dominated sorting"/>
        <parameter key="keep_best_individual" value="true"/>
        <process expanded="true">
          <operator activated="false" class="x_validation" compatibility="5.1.002" expanded="true" height="124" name="Validation" width="90" x="45" y="34">
            <parameter key="number_of_validations" value="4"/>
            <parameter key="sampling_type" value="2"/>
            <parameter key="use_local_random_seed" value="true"/>
            <process expanded="true">
              <operator activated="true" class="weka:W-REPTree" compatibility="7.3.000" expanded="true" height="82" name="W-REPTree" width="90" x="112" y="34"/>
              <connect from_port="training" to_op="W-REPTree" to_port="training set"/>
              <connect from_op="W-REPTree" from_port="model" to_port="model"/>
              <portSpacing port="source_training" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true">
              <operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="82" name="Apply Model" width="90" x="112" y="34">
                <list key="application_parameters"/>
              </operator>
              <operator activated="true" class="performance_classification" compatibility="7.3.001" expanded="true" height="82" name="Performance" width="90" x="246" y="34">
                <list key="class_weights"/>
              </operator>
              <connect from_port="model" to_op="Apply Model" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
              <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
              <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_averagable 1" spacing="0"/>
              <portSpacing port="sink_averagable 2" spacing="0"/>
            </process>
            <description align="center" color="transparent" colored="false" width="126">A cross-validation evaluating a decision tree model.</description>
          </operator>
          <operator activated="true" class="concurrency:cross_validation" compatibility="7.3.001" expanded="true" height="145" name="Cross Validation" width="90" x="45" y="289">
            <process expanded="true">
              <operator activated="true" class="h2o:gradient_boosted_trees" compatibility="7.3.000" expanded="true" height="103" name="Gradient Boosted Trees" width="90" x="203" y="34">
                <list key="expert_parameters"/>
              </operator>
              <connect from_port="training set" to_op="Gradient Boosted Trees" to_port="training set"/>
              <connect from_op="Gradient Boosted Trees" from_port="model" to_port="model"/>
              <portSpacing port="source_training set" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true">
              <operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="82" name="Apply Model (2)" width="90" x="112" y="34">
                <list key="application_parameters"/>
              </operator>
              <operator activated="true" class="performance_classification" compatibility="7.3.001" expanded="true" height="82" name="Performance (3)" width="90" x="246" y="34">
                <list key="class_weights"/>
              </operator>
              <connect from_port="model" to_op="Apply Model (2)" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model (2)" to_port="unlabelled data"/>
              <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (3)" to_port="labelled data"/>
              <connect from_op="Performance (3)" from_port="performance" to_port="performance 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_test set results" spacing="0"/>
              <portSpacing port="sink_performance 1" spacing="0"/>
              <portSpacing port="sink_performance 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="performance_attribute_count" compatibility="7.3.001" expanded="true" height="82" name="Performance (2)" width="90" x="246" y="238"/>
          <operator activated="true" class="log" compatibility="7.3.001" expanded="true" height="82" name="Log" width="90" x="447" y="34">
            <list key="log">
              <parameter key="Performance" value="operator.Validation.value.performance"/>
              <parameter key="Generation" value="operator.Optimize Selection (Evolutionary).value.generation"/>
              <parameter key="LengthBest" value="operator.Optimize Selection (Evolutionary).value.best_length"/>
              <parameter key="LengthAverage" value="operator.Optimize Selection (Evolutionary).value.average_length"/>
              <parameter key="PerformanceBest" value="operator.Optimize Selection (Evolutionary).value.best"/>
            </list>
          </operator>
          <connect from_port="example set" to_op="Cross Validation" to_port="example set"/>
          <connect from_op="Cross Validation" from_port="example set" to_op="Performance (2)" to_port="example set"/>
          <connect from_op="Cross Validation" from_port="performance 1" to_op="Performance (2)" to_port="performance"/>
          <connect from_op="Performance (2)" from_port="performance" to_op="Log" to_port="through 1"/>
          <connect from_op="Log" from_port="through 1" to_port="performance"/>
          <portSpacing port="source_example set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_performance" spacing="0"/>
        </process>
      </operator>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
    </process>
  </operator>
</process>

sta

goccolo · February 2017

I limited my dataset to 16 attributes (incl. the label), which I think is good enough.

Then I've repeated the modelling and here are performances of different algos:

Algorithm	Accuracy	"Nie" class recall	"Tak" class recall
k-NN	68%	68%	63%
GLM	57%	57%	87%
Decision Tree	98%	99%	0%
GBT	67%	67%	77%
Neural Net	71%	71%	73%
Logistic regression	68%	68%	80%
SVM	83%	83%	53%

I used the Grid optimization operator, but I was a bit disappointed with the results - in most cases manipulating parameters was basically changing the "Nie"/"Tak" class recall ratio. When the recall for "Nie" was going up, so was the overall accuracy, but that's just because there are a lot more examples of "Nie" in my exampleset. (63k examples vs 1k examples).

I went back to my data set, because I thought it's the reason why I can't squeeze out more effectiveness from these models.

I created my data purely for this project. I incorporated some patterns and regularities into it, but stil there is much randomness.

So to make the job easier (possible?) for the modelling algorithms, I turned some alerts labelled "Nie" to "Tak", for some specific customers.

Now there were 60,5k examples of "Nie" and 3,5k examples of "Tak" in my post-cleansing exampleset (though for model learning subprocess there was still 50/50 sampling).

This was enough to boost the performance of GBT to the following results:

accuracy: 84.54% +/- 2.57% (mikro: 84.54%)
ConfusionMatrix:
True:	Nie	Tak
Nie:	50899	446
Tak:	9410	2993
precision: 24.58% +/- 3.23% (mikro: 24.13%) (positive class: Tak)

...which is good enough for me.

After all I'm studying a hypothetical piece of AML software. If I managed to make my model predict with 100% accuracy which customers are likely to commit money-laundering, that wouldn't be very realistic.

Now coming back to my goal:

I'm trying to figure out which attributes indicate that the customer will have a true positive alert.

The model description contains this section:

Variable Importances:
            Variable Relative Importance Scaled Importance Percentage
COUNTRY_OF_RESIDENCE          659.554016          1.000000   0.734397
          PROFESSION          107.786240          0.163423   0.120017
       NO_OF_HR_TXNS           65.660095          0.099552   0.073111
      F2F_IDENTIFIED           16.260277          0.024653   0.018105
    COUNTRY_OF_BIRTH           12.987129          0.019691   0.014461
         NATIONALITY           10.082042          0.015286   0.011226
       ANNUAL_INCOME            9.467501          0.014354   0.010542
      OLDEST_ACCOUNT            7.385976          0.011198   0.008224
      NO_OF_ACCOUNTS            6.033924          0.009148   0.006719
           CITY_SIZE            2.696423          0.004088   0.003002
           CUST_TYPE            0.175396          0.000266   0.000195
      MARITAL_STATUS            0.000000          0.000000   0.000000
                 SEX            0.000000          0.000000   0.000000
         HR_CASHFLOW            0.000000          0.000000   0.000000
                 AGE            0.000000          0.000000   0.000000

I guess I can conclude that the top 3 variables are the attributes which should be taken into accunt when estimating the customer's risk.

If I go to the description of the trees, I will also be able to determine what values of these attributes are most likely to give a TRUE_POSITIVE = "Tak".

Is my understanding correct?

MartinLiebig · February 2017

Hi,

yes, you are right. To be a bit more precise, the table tells you the overall, global importance for the tree. It reads like - 72% of the information needed for the classification is contained in the COUNTRY_OF_RESIDENCE attribute. What i would have a look at is the cumsum of the last coloum. Seeing this i would argue to take >5 attributes into account.

What happens if you learn the GBT only on the top4/5/6/7 attributes? I would be interested to see AUC vs Nr. Attributes for the GBT. That chart might be helpful.

A side note: This is a global number. this can still mean, that for a single customer other attributes can have a huge effect. But that would only happen in a small fraction of your customer base.

Best,

Martin

goccolo · February 2017

For 15 attributes:

Accuracy: 88.07% +/- 1.80% (mikro: 88.07%)

AUC: 0.941 +/- 0.005 (mikro: 0.941) (positive class: Tak)

For 6 top attributes:

Accuracy: 90.56% +/- 1.53% (mikro: 90.56%)

AUC: 0.934 +/- 0.007 (mikro: 0.934) (positive class: Tak)

However, the increase in accuracy came at the cost of reduced "Tak" class recall, so I went back to the wider attribute set.

OK, so now the model is built and I know the attribute importance, but one question remains:

How can I get to know which values make the model predict a "Tak" or a "Nie"?

In my example the top 2 attributes are COUNTRY_OF_RESIDENCE and PROFESSION. What are the actual countries/professions that give me a "Tak"?

Telcontar120 · February 2017

If this is from the Random Forest learner, you would have to inspect the individual trees to determine that relationship.

Alternatively, you can run a Naive Bayes model on your reduced dataset with the top 16 attributes (or whatever you want to see). While the overall model might not be that accurate, the model output provides a set of views that show the relationship between your attribute values (both numerical and nominal) and your label.

MartinLiebig · February 2017

If this is from the Random Forest learner, you would have to inspect the individual trees to determine that relationship.

Or use the Weight by Tree Importance Operator

~Martin

Slyart2046 · June 2017

Hi good day. I am starting to learn how to use rapidminer and I want to ask what operator did you used to get this output?

Variable Importances:
            Variable Relative Importance Scaled Importance Percentage
COUNTRY_OF_RESIDENCE          659.554016          1.000000   0.734397
          PROFESSION          107.786240          0.163423   0.120017
       NO_OF_HR_TXNS           65.660095          0.099552   0.073111
      F2F_IDENTIFIED           16.260277          0.024653   0.018105
    COUNTRY_OF_BIRTH           12.987129          0.019691   0.014461
         NATIONALITY           10.082042          0.015286   0.011226
       ANNUAL_INCOME            9.467501          0.014354   0.010542
      OLDEST_ACCOUNT            7.385976          0.011198   0.008224
      NO_OF_ACCOUNTS            6.033924          0.009148   0.006719
           CITY_SIZE            2.696423          0.004088   0.003002
           CUST_TYPE            0.175396          0.000266   0.000195
      MARITAL_STATUS            0.000000          0.000000   0.000000
                 SEX            0.000000          0.000000   0.000000
         HR_CASHFLOW            0.000000          0.000000   0.000000
                 AGE            0.000000          0.000000   0.000000

thank you!

Telcontar120 · June 2017

It comes from the model output of the Gradient Boosted Trees operator.

Thomas_Ott · December 2017

This has turned out to be a very good thread.

MartinLiebig · December 2017

And by the way: 8.0 has an update on Random Forest and Decision Tree. Both of them are now delivering their importance on a weight port.

Best,

Martin

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Determining which attributes contribute to value of a label

Answers