making some columns more important

mbs
سلام. وقت بخیر 
چگونه میتوان بعضی از ستون های اکسل را در بکارگیری مدل موثرتر و کارآمد تر بیان کرد؟
how we can make some columns more important and effective?

  varunm1
    Hello @mbs

    Can you please explain more about your requirement? 
  mbs
    I mean that I want make some columns more effective on the algorithm.
    some columns are more important Features so  i need to make them more effective
    thank you 
  mbs
    and I did this but it doesnt work
    it has label :/
  mbs
    thank you for your help I will try it and then for weight if I have any problem I will ask
    also is weighting good way for making some columns more effective?
  rfuentealba
    Hello @mbs,

    Here is your example on how to Select by Weights. There are some more things you should know, but first:

    • I convert everything to Numerical because weighting can't be applied to categories.
    • Splitting the data stratifying the examples.
    • Applying the weighting by correlation method to the stratified examples. You can select any kind of weighting at this point.
    • Selecting the most important weights to train our Decision Tree.
    • The rest is standard procedure.
    You may also want to use Decision Tree (Weight Based) or DBSCAN (Weight Based), as not all ML algorithms support weight-based operations.

    Now, this process takes only the most important weighted columns, and discard the others. Here is the XML, in case you wish to experiment:

    <?xml version="1.0" encoding="UTF-8"?><process version="9.2.001">
      <operator activated="true" class="process" compatibility="9.2.001" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="9.2.001" expanded="true" height="68" name="Retrieve Titanic Training" width="90" x="45" y="34">
            <parameter key="repository_entry" value="//Samples/data/Titanic Training"/>
            <description align="center" color="transparent" colored="false" width="126">First, we get the information</description>
          <operator activated="true" class="nominal_to_numerical" compatibility="9.2.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="179" y="34">
            <parameter key="return_preprocessing_model" value="false"/>
            <parameter key="create_view" value="true"/>
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attribute" value=""/>
            <parameter key="attributes" value="Passenger Class|Sex"/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="nominal"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="file_path"/>
            <parameter key="block_type" value="single_value"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="single_value"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <parameter key="coding_type" value="unique integers"/>
            <parameter key="use_comparison_groups" value="false"/>
            <list key="comparison_groups"/>
            <parameter key="unexpected_value_handling" value="all 0 and warning"/>
            <parameter key="use_underscore_in_name" value="false"/>
            <description align="center" color="transparent" colored="false" width="126">We change it all to numerical if needed (It is your job to determine if this is needed or not)</description>
          <operator activated="true" class="split_data" compatibility="9.2.001" expanded="true" height="103" name="Split Data" width="90" x="313" y="289">
            <enumeration key="partitions">
              <parameter key="ratio" value="0.8"/>
              <parameter key="ratio" value="0.2"/>
            <parameter key="sampling_type" value="stratified sampling"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
          <operator activated="true" class="weight_by_correlation" compatibility="9.2.001" expanded="true" height="82" name="Weight by Correlation" width="90" x="447" y="34">
            <parameter key="normalize_weights" value="true"/>
            <parameter key="sort_weights" value="true"/>
            <parameter key="sort_direction" value="ascending"/>
            <parameter key="squared_correlation" value="true"/>
            <description align="center" color="transparent" colored="false" width="126">Weighting by (*) is basically the application of a strategy for determining the most important columns</description>
          <operator activated="true" class="select_by_weights" compatibility="9.2.001" expanded="true" height="103" name="Select by Weights" width="90" x="581" y="34">
            <parameter key="weight_relation" value="top p%"/>
            <parameter key="weight" value="1.0"/>
            <parameter key="k" value="5"/>
            <parameter key="p" value="0.5"/>
            <parameter key="deselect_unknown" value="true"/>
            <parameter key="use_absolute_weights" value="true"/>
            <description align="center" color="transparent" colored="false" width="126">You can select only the attributes you are going to use the most.</description>
          <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.2.001" expanded="true" height="103" name="Decision Tree" width="90" x="715" y="34">
            <parameter key="criterion" value="accuracy"/>
            <parameter key="maximal_depth" value="5"/>
            <parameter key="apply_pruning" value="true"/>
            <parameter key="confidence" value="0.2"/>
            <parameter key="apply_prepruning" value="true"/>
            <parameter key="minimal_gain" value="0.01"/>
            <parameter key="minimal_leaf_size" value="2"/>
            <parameter key="minimal_size_for_split" value="4"/>
            <parameter key="number_of_prepruning_alternatives" value="3"/>
          <operator activated="true" class="apply_model" compatibility="9.2.001" expanded="true" height="82" name="Apply Model" width="90" x="849" y="187">
            <list key="application_parameters"/>
            <parameter key="create_view" value="false"/>
          <operator activated="true" class="performance_classification" compatibility="9.2.001" expanded="true" height="82" name="Performance" width="90" x="983" y="34">
            <parameter key="main_criterion" value="first"/>
            <parameter key="accuracy" value="true"/>
            <parameter key="classification_error" value="false"/>
            <parameter key="kappa" value="false"/>
            <parameter key="weighted_mean_recall" value="false"/>
            <parameter key="weighted_mean_precision" value="false"/>
            <parameter key="spearman_rho" value="false"/>
            <parameter key="kendall_tau" value="false"/>
            <parameter key="absolute_error" value="false"/>
            <parameter key="relative_error" value="false"/>
            <parameter key="relative_error_lenient" value="false"/>
            <parameter key="relative_error_strict" value="false"/>
            <parameter key="normalized_absolute_error" value="false"/>
            <parameter key="root_mean_squared_error" value="false"/>
            <parameter key="root_relative_squared_error" value="false"/>
            <parameter key="squared_error" value="false"/>
            <parameter key="correlation" value="false"/>
            <parameter key="squared_correlation" value="false"/>
            <parameter key="cross-entropy" value="false"/>
            <parameter key="margin" value="false"/>
            <parameter key="soft_margin_loss" value="false"/>
            <parameter key="logistic_loss" value="false"/>
            <parameter key="skip_undefined_labels" value="true"/>
            <parameter key="use_example_weights" value="true"/>
            <list key="class_weights"/>
          <connect from_op="Retrieve Titanic Training" from_port="output" to_op="Nominal to Numerical" to_port="example set input"/>
          <connect from_op="Nominal to Numerical" from_port="example set output" to_op="Split Data" to_port="example set"/>
          <connect from_op="Split Data" from_port="partition 1" to_op="Weight by Correlation" to_port="example set"/>
          <connect from_op="Split Data" from_port="partition 2" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Weight by Correlation" from_port="weights" to_op="Select by Weights" to_port="weights"/>
          <connect from_op="Weight by Correlation" from_port="example set" to_op="Select by Weights" to_port="example set input"/>
          <connect from_op="Select by Weights" from_port="example set output" to_op="Decision Tree" to_port="training set"/>
          <connect from_op="Decision Tree" from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>

    Hope this helps.

  rfuentealba

    Yes, it's very recommended to use weighting for this. Most of the time, it's even more recommended than upsampling or downsampling.

    All the best,

  mbs
    thank you very much for your help :)
  mbs
    With your example I change the data with my dataset but I have to add some more operator in order to process work with my data. please look at these screen shots and also I can not understand the result  :/
    Any way thank you for your help

  mbs
    the points that you mentioned in the screen shot works but still i use weighting by information gain because the correlation operator doesnt work with my data and also I changed tree to the ruleinduction and the result is 98.86 :)
    thank you
  rfuentealba
    Hello @mbs,

    To be clear: my example was a quick one to show the specific ordering of the elements. If you want to do weighting by awesomeness, go ahead, hahaha.

    The result is a confusion matrix or something, where you need to see:
    • How many predicted positives are in the true positives list?
    • How many predicted negatives are in the true negatives list?
    • The class precision = how precisely can you select the sampled true positives (or sampled true negatives)
    • The class recall = how precisely can you select all the true positives (or true negatives)
    In the case you showed, 100% is a perfect score. The problem would be if with that 100%-scored algorithm you can score the 100% of the new phenomena, because if it doesn't, you are causing overfitting to your model. :)

    All the best,

  mbs
    with your example and my data i can not understand the result and it is not clear. but with my example every thing is clear
