making some columns more important

mbsmbs Member Posts: 125  Guru
edited May 12 in Help
سلام. وقت بخیر 
چگونه میتوان بعضی از ستون های اکسل را در بکارگیری مدل موثرتر و کارآمد تر بیان کرد؟
how we can make some columns more important and effective?

Best Answer


  • varunm1varunm1 Member Posts: 497   Unicorn
    Hello @mbs

    Can you please explain more about your requirement? 
  • mbsmbs Member Posts: 125  Guru
    edited May 12

    I mean that I want make some columns more effective on the algorithm.
    some columns are more important Features so  i need to make them more effective
    thank you 
  • mbsmbs Member Posts: 125  Guru
    edited May 12
    and I did this but it doesnt work
    it has label :/
  • mbsmbs Member Posts: 125  Guru
    edited May 12
    thank you for your help I will try it and then for weight if I have any problem I will ask
    also is weighting good way for making some columns more effective?
  • rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 361   Unicorn
    Hello @mbs,

    Here is your example on how to Select by Weights. There are some more things you should know, but first:

    • I convert everything to Numerical because weighting can't be applied to categories.
    • Splitting the data stratifying the examples.
    • Applying the weighting by correlation method to the stratified examples. You can select any kind of weighting at this point.
    • Selecting the most important weights to train our Decision Tree.
    • The rest is standard procedure.
    You may also want to use Decision Tree (Weight Based) or DBSCAN (Weight Based), as not all ML algorithms support weight-based operations.

    Now, this process takes only the most important weighted columns, and discard the others. Here is the XML, in case you wish to experiment:

    <?xml version="1.0" encoding="UTF-8"?><process version="9.2.001">
      <operator activated="true" class="process" compatibility="9.2.001" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="9.2.001" expanded="true" height="68" name="Retrieve Titanic Training" width="90" x="45" y="34">
            <parameter key="repository_entry" value="//Samples/data/Titanic Training"/>
            <description align="center" color="transparent" colored="false" width="126">First, we get the information</description>
          <operator activated="true" class="nominal_to_numerical" compatibility="9.2.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="179" y="34">
            <parameter key="return_preprocessing_model" value="false"/>
            <parameter key="create_view" value="true"/>
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attribute" value=""/>
            <parameter key="attributes" value="Passenger Class|Sex"/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="nominal"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="file_path"/>
            <parameter key="block_type" value="single_value"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="single_value"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <parameter key="coding_type" value="unique integers"/>
            <parameter key="use_comparison_groups" value="false"/>
            <list key="comparison_groups"/>
            <parameter key="unexpected_value_handling" value="all 0 and warning"/>
            <parameter key="use_underscore_in_name" value="false"/>
            <description align="center" color="transparent" colored="false" width="126">We change it all to numerical if needed (It is your job to determine if this is needed or not)</description>
          <operator activated="true" class="split_data" compatibility="9.2.001" expanded="true" height="103" name="Split Data" width="90" x="313" y="289">
            <enumeration key="partitions">
              <parameter key="ratio" value="0.8"/>
              <parameter key="ratio" value="0.2"/>
            <parameter key="sampling_type" value="stratified sampling"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
          <operator activated="true" class="weight_by_correlation" compatibility="9.2.001" expanded="true" height="82" name="Weight by Correlation" width="90" x="447" y="34">
            <parameter key="normalize_weights" value="true"/>
            <parameter key="sort_weights" value="true"/>
            <parameter key="sort_direction" value="ascending"/>
            <parameter key="squared_correlation" value="true"/>
            <description align="center" color="transparent" colored="false" width="126">Weighting by (*) is basically the application of a strategy for determining the most important columns</description>
          <operator activated="true" class="select_by_weights" compatibility="9.2.001" expanded="true" height="103" name="Select by Weights" width="90" x="581" y="34">
            <parameter key="weight_relation" value="top p%"/>
            <parameter key="weight" value="1.0"/>
            <parameter key="k" value="5"/>
            <parameter key="p" value="0.5"/>
            <parameter key="deselect_unknown" value="true"/>
            <parameter key="use_absolute_weights" value="true"/>
            <description align="center" color="transparent" colored="false" width="126">You can select only the attributes you are going to use the most.</description>
          <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.2.001" expanded="true" height="103" name="Decision Tree" width="90" x="715" y="34">
            <parameter key="criterion" value="accuracy"/>
            <parameter key="maximal_depth" value="5"/>
            <parameter key="apply_pruning" value="true"/>
            <parameter key="confidence" value="0.2"/>
            <parameter key="apply_prepruning" value="true"/>
            <parameter key="minimal_gain" value="0.01"/>
            <parameter key="minimal_leaf_size" value="2"/>
            <parameter key="minimal_size_for_split" value="4"/>
            <parameter key="number_of_prepruning_alternatives" value="3"/>
          <operator activated="true" class="apply_model" compatibility="9.2.001" expanded="true" height="82" name="Apply Model" width="90" x="849" y="187">
            <list key="application_parameters"/>
            <parameter key="create_view" value="false"/>
          <operator activated="true" class="performance_classification" compatibility="9.2.001" expanded="true" height="82" name="Performance" width="90" x="983" y="34">
            <parameter key="main_criterion" value="first"/>
            <parameter key="accuracy" value="true"/>
            <parameter key="classification_error" value="false"/>
            <parameter key="kappa" value="false"/>
            <parameter key="weighted_mean_recall" value="false"/>
            <parameter key="weighted_mean_precision" value="false"/>
            <parameter key="spearman_rho" value="false"/>
            <parameter key="kendall_tau" value="false"/>
            <parameter key="absolute_error" value="false"/>
            <parameter key="relative_error" value="false"/>
            <parameter key="relative_error_lenient" value="false"/>
            <parameter key="relative_error_strict" value="false"/>
            <parameter key="normalized_absolute_error" value="false"/>
            <parameter key="root_mean_squared_error" value="false"/>
            <parameter key="root_relative_squared_error" value="false"/>
            <parameter key="squared_error" value="false"/>
            <parameter key="correlation" value="false"/>
            <parameter key="squared_correlation" value="false"/>
            <parameter key="cross-entropy" value="false"/>
            <parameter key="margin" value="false"/>
            <parameter key="soft_margin_loss" value="false"/>
            <parameter key="logistic_loss" value="false"/>
            <parameter key="skip_undefined_labels" value="true"/>
            <parameter key="use_example_weights" value="true"/>
            <list key="class_weights"/>
          <connect from_op="Retrieve Titanic Training" from_port="output" to_op="Nominal to Numerical" to_port="example set input"/>
          <connect from_op="Nominal to Numerical" from_port="example set output" to_op="Split Data" to_port="example set"/>
          <connect from_op="Split Data" from_port="partition 1" to_op="Weight by Correlation" to_port="example set"/>
          <connect from_op="Split Data" from_port="partition 2" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Weight by Correlation" from_port="weights" to_op="Select by Weights" to_port="weights"/>
          <connect from_op="Weight by Correlation" from_port="example set" to_op="Select by Weights" to_port="example set input"/>
          <connect from_op="Select by Weights" from_port="example set output" to_op="Decision Tree" to_port="training set"/>
          <connect from_op="Decision Tree" from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>

    Hope this helps.

  • rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 361   Unicorn

    Yes, it's very recommended to use weighting for this. Most of the time, it's even more recommended than upsampling or downsampling.

    All the best,

  • mbsmbs Member Posts: 125  Guru
    thank you very much for your help :)
  • mbsmbs Member Posts: 125  Guru
    With your example I change the data with my dataset but I have to add some more operator in order to process work with my data. please look at these screen shots and also I can not understand the result  :/
    Any way thank you for your help

  • mbsmbs Member Posts: 125  Guru
    the points that you mentioned in the screen shot works but still i use weighting by information gain because the correlation operator doesnt work with my data and also I changed tree to the ruleinduction and the result is 98.86 :)
    thank you
  • rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 361   Unicorn
    Hello @mbs,

    To be clear: my example was a quick one to show the specific ordering of the elements. If you want to do weighting by awesomeness, go ahead, hahaha.

    The result is a confusion matrix or something, where you need to see:
    • How many predicted positives are in the true positives list?
    • How many predicted negatives are in the true negatives list?
    • The class precision = how precisely can you select the sampled true positives (or sampled true negatives)
    • The class recall = how precisely can you select all the true positives (or true negatives)
    In the case you showed, 100% is a perfect score. The problem would be if with that 100%-scored algorithm you can score the 100% of the new phenomena, because if it doesn't, you are causing overfitting to your model. :)

    All the best,

  • mbsmbs Member Posts: 125  Guru
    with your example and my data i can not understand the result and it is not clear. but with my example every thing is clear
Sign In or Register to comment.