Determining which attributes contribute to value of a label

goccologoccolo Member Posts: 5 Contributor I
edited November 2018 in Help

Hi Community, I'm currently running my first data mining project and I'm having some serious doubts. I hope I'm posting my question in the right place and some of you could help me with some kind of hint or advice. It may be obvious to you, but to me the differences between many tools and techniques is still very thin.

 

I have my data in an Oracle database, stored into 4 tables: CUSTOMERS, ACCOUNTS, TRANSACTIONS and ALERTS.

The common attribute for each of them is CUSTOMER_ID.

The attribute which is most "interesting" to me is called TRUE_POSITIVE, it's a column from table ALERTS, and it takes either value "Yes" or "No".

The GOAL of my project is to determine which of the attributes contribute the most to the value of TRUE_POSITIVE being = "Yes".

 

My dataset is moderate in size (maybe 50 attributes in total, tables having between 50k to 700k examples).

At this point I've imported my data in RapidMiner Studio and did some initial data clearing (rejected certain columns, filtered out examples with missing important attributes.etc.)

Many attributes are take binominal values (for example: CUSTOMERS.FACE_TO_FACE_IDENTIFIED), many are polynominal (for example: CUSTOMERS.NATIONALITY).

I've also created some new attributes in table CUSTOMERS, like NO_OF_ALERTS_POS, which stores the number of true positive alerts for the particular customer, or HR_CASHFLOW which stores customers' average monthly value of transactions made "with" high risk countries.

 

My main question is:

Which tool / operand should I use to achieve my goal? Correlation matrix? Regression?

 

And some additional questions:

What would be the optimal number of attributes? Does my current dataset require much dimensionality reduction?

Can I use my new attribues to avoid joining tables? Does it make any sense and is there big risk I will miss the change of detecting some unobvious correlations?

 

Many thanks in advance for your help.

Answers

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Hi! Welcome to the boards. I moved your post to the RapidMiner Studio forum because you're using Studio.

     

    Ok, what your task is really a standard classification analysis. You're trying to use the data you cleaned to learn any patterns that makes one record a "Yes" or "No."

     

    What I would suggest is to use a predefined Cross Validation building block (right click in the design canvas, select Insert Building Block, insert Nominal Cross Validation).  The default algorithm is a Decision Tree (double click to see inside) and if you run it, it will output a confusion matrix and tell you how good that algorithm was able to discen between Yes and No. You can swap out the Decision Tree with maybe a Logistic Regression or some other algorithms and test again to see which gives you a better model.

     

    Why do we try different algos? It's because some algos perform better on different data sets. So, it becomes an iterative process sometimes. With RapidMiner, it's simple to swap out different algos and once you become more adavanced you can build a process to auto model and auto select the best algorithm. We've been doing automdeling, tuning, and selection for years but never talked about. :)

     

    Most likely you might need dimensionaluity reduction if your data set is really wide. There are different techniques to do it but first try the Cross Validation one and we'll go from there. 

  • goccologoccolo Member Posts: 5 Contributor I

    Thomas, thanks for your reponse!

     

    I've tried few algorithms, results below. Could you help me interpret them?

    Decision Tree:

            accuracy: 93.77% +/- 0.01% (mikro: 93.77%)

            ConfusionMatrix:

            True:   Nie     Tak

            Nie:    42254   2806

            Tak:    0       0

            precision: unknown (positive class: Tak)

            recall: 0.00% +/- 0.00% (mikro: 0.00%) (positive class: Tak)

            AUC (optimistic): 1.000 +/- 0.000 (mikro: 1.000) (positive class: Tak)

            AUC: 0.500 +/- 0.000 (mikro: 0.500) (positive class: Tak)

            AUC (pessimistic): 0.000 +/- 0.000 (mikro: 0.000) (positive class: Tak)

     

    k-NN:

           accuracy: 88.27% +/- 2.31% (mikro: 88.27%)

           ConfusionMatrix:True:   Nie     Tak

           Nie:    39582   2612

           Tak:    2672    194

           precision: 6.58% +/- 1.82% (mikro: 6.77%) (positive class: Tak)

           recall: 6.68% +/- 1.39% (mikro: 6.91%) (positive class: Tak)

           AUC (optimistic): 0.941 +/- 0.011 (mikro: 0.941) (positive class: Tak)

           AUC: 0.500 +/- 0.000 (mikro: 0.500) (positive class: Tak)

           AUC (pessimistic): 0.062 +/- 0.012 (mikro: 0.062) (positive class: Tak)

     

     

    GLM:

           accuracy: 91.27% +/- 4.57% (mikro: 91.27%)

           ConfusionMatrix:True:   Nie     Tak

           Nie:    40693   2371

           Tak:    1561    435

           precision: 32.86% +/- 13.99% (mikro: 21.79%) (positive class: Tak)

           recall: 14.18% +/- 6.19% (mikro: 15.50%) (positive class: Tak)

           AUC (optimistic): 0.575 +/- 0.018 (mikro: 0.575) (positive class: Tak)

           AUC: 0.557 +/- 0.017 (mikro: 0.557) (positive class: Tak)

           AUC (pessimistic): 0.539 +/- 0.018 (mikro: 0.539) (positive class: Tak)

     

    Looking at the confiusion matrix the last one seems to be best, but still far from actually good, as only 1 in 3 times the prediction was correct for the label value being "Tak".

  • AndrewAndrew RapidMiner Certified Expert, RapidMiner Certified Master, Member Posts: 47 Guru

    You could also use the various "Weight by" operators. These will create a set of weights for the attributes where the value of the weight is nearer 1 if the attribute is relevant for the label and nearer 0 if it is not. You can then use the "Select by Weights" operator to select attributes of interest based on the weights to yield an example set with only the attributes of interest.

     

    Andrew

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    So now we start doing some basic data science. Each Algo you chose has its shortcomings and the results all suck IMHO.


    The Decision Tree model is pure garbage, it selects thinks everything is "Nie"

     

    The K-nn is a bit better but it really has a hard time finding "Tak." While this appears on the surface is bad, there might be some opportunity to tune the K value and make sure attributes are properly normalized.

     

    The GLM is slightly better in a different way, but it to has a hard time discerning "Tak" as well.

     

    So what are some of the ways you can make this better? You might want to first go back to your dataset and try to balance the data. It appears that the instances of "Tak" are smaller than the instances of "Nie." This is what we call an unbalanced set and in the case of classifcation tasks, it could cause the algo to just lump everyone into the "Nie" category, like the Decision tree did.

     

    I would add a sample operator inside the cross validation operator (right before the algo on the training side) and toggle on "balance data," then select an equal amount of each class and then train the algo again. Also check if the attributes your using for training can be normalized if you're use K-nn, scaling can have a big impact with that algo. 

     

     

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    You may also be interested in looking at the performance(costs) operator, which allows you to specify different costs of classification and misclassification.  It may be that not all errors are equal, and the performance(costs) gives you a way to indicate the relative importance of misclassification.  The modeling algorithm will then seek to minimize the costs (this operator doesn't work for all algorithms but it does for the main ones).

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • goccologoccolo Member Posts: 5 Contributor I

    Thanks to all for answers. I'm not implementing all your suggestions yet, because I want to take one step a time, make sure I know what I'm doing :) 

     

    My current model performance is as follows:

    accuracy: 57.55% +/- 4.12% (mikro: 57.55%)
    ConfusionMatrix:
    True: Nie Tak
    Nie: 35803 136
    Tak: 26925 884
    precision: 3.20% +/- 0.24% (mikro: 3.18%) (positive class: Tak)
    recall: 86.67% +/- 2.98% (mikro: 86.67%) (positive class: Tak)
    AUC (optimistic): 0.802 +/- 0.011 (mikro: 0.802) (positive class: Tak)
    AUC: 0.800 +/- 0.011 (mikro: 0.800) (positive class: Tak)
    AUC (pessimistic): 0.798 +/- 0.010 (mikro: 0.798) (positive class: Tak)

    And the process itself looks like this:

     

    proc_1.png

    proc_2.png

     

    There is improvement, but the results are still not satisfactory.

    I'm pretty sure it's necessary to prepare the data more thoroughly, but I don't know how exactly.

     

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    You're right, the overall model still isn't that great BUT the classifier is starting topick up TAK a lot better. I think this model can be optimized for sure.

     

    I would try a GBT and SVM algo. For the SVM try a radial kernal with a C=1000 and gamma = 0.01 initially. If any of these algos show improvement then I would suggest using the Grid optimization operator to automatically vary and test combinations of parameters.

     

    Another thought is if you have a wide data set (i.e. many attributes) to do some automatic feature selection to reduce the attributes but keep the ones with the best predictive power. I'm attaching the XML for that operator here. You'd have to put that between the Cross Validation and Replace Missing value. Before you use that, i would take another look at the process as a whole. The Replace Missing value operator gives me a pause, I would think about that one to make sure that it's what you want to do.

     

    We use this method for our own PQL scoring system and it works really well. We distill 100's of attributes down to the best 15 but it does take processing time. To make it go faster, adjust the k-folds in the Cross Validation operator and vary the initial generations/pop size, etc.

     

    Good luck. 

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.3.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.3.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="optimize_selection_evolutionary" compatibility="7.3.001" expanded="true" height="103" name="Optimize Selection (Evolutionary)" width="90" x="313" y="34">
    <parameter key="population_size" value="20"/>
    <parameter key="maximum_number_of_generations" value="100"/>
    <parameter key="show_population_plotter" value="true"/>
    <parameter key="plot_generations" value="1"/>
    <parameter key="selection_scheme" value="non dominated sorting"/>
    <parameter key="keep_best_individual" value="true"/>
    <process expanded="true">
    <operator activated="false" class="x_validation" compatibility="5.1.002" expanded="true" height="124" name="Validation" width="90" x="45" y="34">
    <parameter key="number_of_validations" value="4"/>
    <parameter key="sampling_type" value="2"/>
    <parameter key="use_local_random_seed" value="true"/>
    <process expanded="true">
    <operator activated="true" class="weka:W-REPTree" compatibility="7.3.000" expanded="true" height="82" name="W-REPTree" width="90" x="112" y="34"/>
    <connect from_port="training" to_op="W-REPTree" to_port="training set"/>
    <connect from_op="W-REPTree" from_port="model" to_port="model"/>
    <portSpacing port="source_training" spacing="0"/>
    <portSpacing port="sink_model" spacing="0"/>
    <portSpacing port="sink_through 1" spacing="0"/>
    </process>
    <process expanded="true">
    <operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="82" name="Apply Model" width="90" x="112" y="34">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="performance_classification" compatibility="7.3.001" expanded="true" height="82" name="Performance" width="90" x="246" y="34">
    <list key="class_weights"/>
    </operator>
    <connect from_port="model" to_op="Apply Model" to_port="model"/>
    <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
    <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
    <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
    <portSpacing port="source_model" spacing="0"/>
    <portSpacing port="source_test set" spacing="0"/>
    <portSpacing port="source_through 1" spacing="0"/>
    <portSpacing port="sink_averagable 1" spacing="0"/>
    <portSpacing port="sink_averagable 2" spacing="0"/>
    </process>
    <description align="center" color="transparent" colored="false" width="126">A cross-validation evaluating a decision tree model.</description>
    </operator>
    <operator activated="true" class="concurrency:cross_validation" compatibility="7.3.001" expanded="true" height="145" name="Cross Validation" width="90" x="45" y="289">
    <process expanded="true">
    <operator activated="true" class="h2o:gradient_boosted_trees" compatibility="7.3.000" expanded="true" height="103" name="Gradient Boosted Trees" width="90" x="203" y="34">
    <list key="expert_parameters"/>
    </operator>
    <connect from_port="training set" to_op="Gradient Boosted Trees" to_port="training set"/>
    <connect from_op="Gradient Boosted Trees" from_port="model" to_port="model"/>
    <portSpacing port="source_training set" spacing="0"/>
    <portSpacing port="sink_model" spacing="0"/>
    <portSpacing port="sink_through 1" spacing="0"/>
    </process>
    <process expanded="true">
    <operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="82" name="Apply Model (2)" width="90" x="112" y="34">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="performance_classification" compatibility="7.3.001" expanded="true" height="82" name="Performance (3)" width="90" x="246" y="34">
    <list key="class_weights"/>
    </operator>
    <connect from_port="model" to_op="Apply Model (2)" to_port="model"/>
    <connect from_port="test set" to_op="Apply Model (2)" to_port="unlabelled data"/>
    <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (3)" to_port="labelled data"/>
    <connect from_op="Performance (3)" from_port="performance" to_port="performance 1"/>
    <portSpacing port="source_model" spacing="0"/>
    <portSpacing port="source_test set" spacing="0"/>
    <portSpacing port="source_through 1" spacing="0"/>
    <portSpacing port="sink_test set results" spacing="0"/>
    <portSpacing port="sink_performance 1" spacing="0"/>
    <portSpacing port="sink_performance 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="performance_attribute_count" compatibility="7.3.001" expanded="true" height="82" name="Performance (2)" width="90" x="246" y="238"/>
    <operator activated="true" class="log" compatibility="7.3.001" expanded="true" height="82" name="Log" width="90" x="447" y="34">
    <list key="log">
    <parameter key="Performance" value="operator.Validation.value.performance"/>
    <parameter key="Generation" value="operator.Optimize Selection (Evolutionary).value.generation"/>
    <parameter key="LengthBest" value="operator.Optimize Selection (Evolutionary).value.best_length"/>
    <parameter key="LengthAverage" value="operator.Optimize Selection (Evolutionary).value.average_length"/>
    <parameter key="PerformanceBest" value="operator.Optimize Selection (Evolutionary).value.best"/>
    </list>
    </operator>
    <connect from_port="example set" to_op="Cross Validation" to_port="example set"/>
    <connect from_op="Cross Validation" from_port="example set" to_op="Performance (2)" to_port="example set"/>
    <connect from_op="Cross Validation" from_port="performance 1" to_op="Performance (2)" to_port="performance"/>
    <connect from_op="Performance (2)" from_port="performance" to_op="Log" to_port="through 1"/>
    <connect from_op="Log" from_port="through 1" to_port="performance"/>
    <portSpacing port="source_example set" spacing="0"/>
    <portSpacing port="source_through 1" spacing="0"/>
    <portSpacing port="sink_performance" spacing="0"/>
    </process>
    </operator>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    </process>
    </operator>
    </process>

    sta 

  • goccologoccolo Member Posts: 5 Contributor I

    I limited my dataset to 16 attributes (incl. the label), which I think is good enough.

     

    Then I've repeated the modelling and here are performances of different algos:

    Algorithm Accuracy "Nie" class recall "Tak" class recall
    k-NN 68% 68% 63%
    GLM 57% 57% 87%
    Decision Tree 98% 99% 0%
    GBT 67% 67% 77%
    Neural Net 71% 71% 73%
    Logistic regression 68% 68% 80%
    SVM 83% 83% 53%

     

    I used the Grid optimization operator, but I was a bit disappointed with the results - in most cases manipulating parameters was basically changing the "Nie"/"Tak" class recall ratio. When the recall for "Nie" was going up, so was the overall accuracy, but that's just because there are a lot more examples of "Nie" in my exampleset. (63k examples vs 1k examples).

     

    I went back to my data set, because I thought it's the reason why I can't squeeze out more effectiveness from these models.

    I created my data purely for this project. I incorporated some patterns and regularities into it, but stil there is much randomness.

    So to make the job easier (possible?) for the modelling algorithms, I turned some alerts labelled "Nie" to "Tak", for some specific customers.

    Now there were 60,5k examples of "Nie" and 3,5k examples of "Tak" in my post-cleansing exampleset (though for model learning subprocess there was still 50/50 sampling).

     

    This was enough to boost the performance of GBT to the following results:

    accuracy: 84.54% +/- 2.57% (mikro: 84.54%)
    ConfusionMatrix:
    True: Nie Tak
    Nie: 50899 446
    Tak: 9410 2993
    precision: 24.58% +/- 3.23% (mikro: 24.13%) (positive class: Tak)

    ...which is good enough for me.

    After all I'm studying a hypothetical piece of AML software. If I managed to make my model predict with 100% accuracy which customers are likely to commit money-laundering, that wouldn't be very realistic.

     

    Now coming back to my goal:

    I'm trying to figure out which attributes indicate that the customer will have a true positive alert.

    The model description contains this section:

    Variable Importances:
    Variable Relative Importance Scaled Importance Percentage
    COUNTRY_OF_RESIDENCE 659.554016 1.000000 0.734397
    PROFESSION 107.786240 0.163423 0.120017
    NO_OF_HR_TXNS 65.660095 0.099552 0.073111
    F2F_IDENTIFIED 16.260277 0.024653 0.018105
    COUNTRY_OF_BIRTH 12.987129 0.019691 0.014461
    NATIONALITY 10.082042 0.015286 0.011226
    ANNUAL_INCOME 9.467501 0.014354 0.010542
    OLDEST_ACCOUNT 7.385976 0.011198 0.008224
    NO_OF_ACCOUNTS 6.033924 0.009148 0.006719
    CITY_SIZE 2.696423 0.004088 0.003002
    CUST_TYPE 0.175396 0.000266 0.000195
    MARITAL_STATUS 0.000000 0.000000 0.000000
    SEX 0.000000 0.000000 0.000000
    HR_CASHFLOW 0.000000 0.000000 0.000000
    AGE 0.000000 0.000000 0.000000

    I guess I can conclude that the top 3 variables are the attributes which should be taken into accunt when estimating the customer's risk.

    If I go to the description of the trees, I will also be able to determine what values of these attributes are most likely to give a TRUE_POSITIVE = "Tak".

     

    Is my understanding correct?

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    Hi,

     

    yes, you are right. To be a bit more precise, the table tells you the overall, global importance for the tree. It reads like - 72% of the information needed for the classification is contained in the COUNTRY_OF_RESIDENCE attribute. What i would have a look at is the cumsum of the last coloum. Seeing this i would argue to take >5 attributes into account.

     

    What happens if you learn the GBT only on the top4/5/6/7 attributes? I would be interested to see AUC vs Nr. Attributes for the GBT. That chart might be helpful.

     

    A side note: This is a global number. this can still mean, that for a single customer other attributes can have a huge effect. But that would only happen in a small fraction of your customer base.

     

    Best,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • goccologoccolo Member Posts: 5 Contributor I

    For 15 attributes:

    Accuracy: 88.07% +/- 1.80% (mikro: 88.07%)

    AUC: 0.941 +/- 0.005 (mikro: 0.941) (positive class: Tak)

     

    For 6 top attributes:

    Accuracy: 90.56% +/- 1.53% (mikro: 90.56%)

    AUC: 0.934 +/- 0.007 (mikro: 0.934) (positive class: Tak)

     

    However, the increase in accuracy came at the cost of reduced "Tak" class recall, so I went back to the wider attribute set.

     

     

    OK, so now the model is built and I know the attribute importance, but one question remains:

    How can I get to know which values make the model predict a "Tak" or a "Nie"?

     

    In my example the top 2 attributes are COUNTRY_OF_RESIDENCE and PROFESSION. What are the actual countries/professions that give me a "Tak"?

     

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    If this is from the Random Forest learner, you would have to inspect the individual trees to determine that relationship.

    Alternatively, you can run a Naive Bayes model on your reduced dataset with the top 16 attributes (or whatever you want to see).  While the overall model might not be that accurate, the model output provides a set of views that show the relationship between your attribute values (both numerical and nominal) and your label.

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    If this is from the Random Forest learner, you would have to inspect the individual trees to determine that relationship.

    Or use the Weight by Tree Importance Operator :)

     

    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Slyart2046Slyart2046 Member Posts: 8 Contributor II

    Hi good day. I am starting to learn how to use rapidminer and I want to ask what operator did you used to get this output?

     

    Variable Importances:
    Variable Relative Importance Scaled Importance Percentage
    COUNTRY_OF_RESIDENCE 659.554016 1.000000 0.734397
    PROFESSION 107.786240 0.163423 0.120017
    NO_OF_HR_TXNS 65.660095 0.099552 0.073111
    F2F_IDENTIFIED 16.260277 0.024653 0.018105
    COUNTRY_OF_BIRTH 12.987129 0.019691 0.014461
    NATIONALITY 10.082042 0.015286 0.011226
    ANNUAL_INCOME 9.467501 0.014354 0.010542
    OLDEST_ACCOUNT 7.385976 0.011198 0.008224
    NO_OF_ACCOUNTS 6.033924 0.009148 0.006719
    CITY_SIZE 2.696423 0.004088 0.003002
    CUST_TYPE 0.175396 0.000266 0.000195
    MARITAL_STATUS 0.000000 0.000000 0.000000
    SEX 0.000000 0.000000 0.000000
    HR_CASHFLOW 0.000000 0.000000 0.000000
    AGE 0.000000 0.000000 0.000000

    thank you!



  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    It comes from the model output of the Gradient Boosted Trees operator.

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn
    This has turned out to be a very good thread.
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    And by the way: 8.0 has an update on Random Forest and Decision Tree. Both of them are now delivering their importance on a weight port.

     

    Best,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.