"Log Feature Selection (per selected feature)"

MuehliManMuehliMan Member Posts: 85 Maven
edited May 2019 in Help
Hi,

I am trying to construct a workflow, that writes a log entry for each new feature. Currentyl it runs for every iteration step. I already thought about using Branch for the log, but I dont know if this slows the system down.

What I want would be a log like this:

chosen_attribute, performance
atts45 | 0.500
atts45, atts90 | 0,750
atts45, atts90, atts 2 | 0,800

Here is an exmaple workflow:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.0.0" expanded="true" name="Global">
    <parameter key="logfile" value="advanced1.log"/>
    <process expanded="true" height="668" width="1421">
      <operator activated="true" class="generate_data" compatibility="5.0.11" expanded="true" height="60" name="Generate Data" width="90" x="45" y="30">
        <parameter key="number_examples" value="1000"/>
        <parameter key="number_of_attributes" value="300"/>
      </operator>
      <operator activated="true" class="optimize_selection" compatibility="5.0.11" expanded="true" height="94" name="FS" width="90" x="179" y="30">
        <process expanded="true" height="642" width="1070">
          <operator activated="true" class="materialize_data" compatibility="5.0.11" expanded="true" height="76" name="Materialize Data" width="90" x="180" y="30"/>
          <operator activated="true" class="multiply" compatibility="5.0.11" expanded="true" height="94" name="Multiply (2)" width="90" x="315" y="30"/>
          <operator activated="true" class="x_validation" compatibility="5.0.11" expanded="true" height="112" name="XValidation" width="90" x="450" y="30">
            <parameter key="number_of_validations" value="5"/>
            <process expanded="true" height="660" width="519">
              <operator activated="true" class="linear_regression" compatibility="5.0.11" expanded="true" height="94" name="Linear Regression (2)" width="90" x="282" y="30">
                <parameter key="feature_selection" value="none"/>
              </operator>
              <connect from_port="training" to_op="Linear Regression (2)" to_port="training set"/>
              <connect from_op="Linear Regression (2)" from_port="model" to_port="model"/>
              <portSpacing port="source_training" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true" height="660" width="519">
              <operator activated="true" class="apply_model" compatibility="5.0.11" expanded="true" height="76" name="Applier" width="90" x="45" y="30">
                <list key="application_parameters"/>
              </operator>
              <operator activated="true" class="performance_regression" compatibility="5.0.11" expanded="true" height="76" name="Performance" width="90" x="282" y="30"/>
              <connect from_port="model" to_op="Applier" to_port="model"/>
              <connect from_port="test set" to_op="Applier" to_port="unlabelled data"/>
              <connect from_op="Applier" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
              <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_averagable 1" spacing="0"/>
              <portSpacing port="sink_averagable 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="log" compatibility="5.0.11" expanded="true" height="94" name="ProcessLog" width="90" x="581" y="30">
            <parameter key="filename" value="%{data}_%{ham}_%{mode}_fs.log"/>
            <list key="log">
              <parameter key="generation" value="operator.FS.value.generation"/>
              <parameter key="performance" value="operator.FS.value.performance"/>
            </list>
          </operator>
          <connect from_port="example set" to_op="Materialize Data" to_port="example set input"/>
          <connect from_op="Materialize Data" from_port="example set output" to_op="Multiply (2)" to_port="input"/>
          <connect from_op="Multiply (2)" from_port="output 1" to_op="XValidation" to_port="training"/>
          <connect from_op="Multiply (2)" from_port="output 2" to_op="ProcessLog" to_port="through 2"/>
          <connect from_op="XValidation" from_port="averagable 1" to_op="ProcessLog" to_port="through 1"/>
          <connect from_op="ProcessLog" from_port="through 1" to_port="performance"/>
          <portSpacing port="source_example set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_performance" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Generate Data" from_port="output" to_op="FS" to_port="example set in"/>
      <connect from_op="FS" from_port="example set out" to_port="result 1"/>
      <connect from_op="FS" from_port="weights" to_port="result 2"/>
      <connect from_op="FS" from_port="performance" to_port="result 3"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="126"/>
    </process>
  </operator>
</process>
Best,
Markus

Answers

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Markus,
    if you replace the "Optimize Selection" with a Forward Selection or Backward Selection operator, you can log the currently tested attribute combination as well as the performance from the cross validation.
    This log can be turned lateron into a data set and a relatively easy process can be applied to actually find the best of all tested attribute combinations in each iteration.

    This is a process for applying this processing:
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.0.11" expanded="true" name="Process">
        <process expanded="true" height="463" width="710">
          <operator activated="true" class="retrieve" compatibility="5.0.11" expanded="true" height="60" name="Retrieve Generation Log" width="90" x="45" y="30">
            <parameter key="repository_entry" value="results/%{identifier} - Linear Regression Interaction %{round} Round Log"/>
          </operator>
          <operator activated="true" class="subprocess" compatibility="5.0.11" expanded="true" height="76" name="Report Selection Process" width="90" x="180" y="30">
            <process expanded="true" height="481" width="728">
              <operator activated="true" class="generate_copy" compatibility="5.0.11" expanded="true" height="76" name="Generate Copy (2)" width="90" x="45" y="30">
                <parameter key="attribute_name" value="attributes"/>
                <parameter key="new_name" value="chosenAttributes"/>
              </operator>
              <operator activated="true" class="replace" compatibility="5.0.11" expanded="true" height="76" name="Replace (4)" width="90" x="180" y="30">
                <parameter key="attribute_filter_type" value="single"/>
                <parameter key="attribute" value="chosenAttributes"/>
                <parameter key="replace_what" value="(.+)"/>
                <parameter key="replace_by" value=",$1"/>
              </operator>
              <operator activated="true" class="replace" compatibility="5.0.11" expanded="true" height="76" name="Replace (5)" width="90" x="315" y="30">
                <parameter key="attribute_filter_type" value="single"/>
                <parameter key="attribute" value="chosenAttributes"/>
                <parameter key="replace_what" value=",*(.*),[^,]*"/>
                <parameter key="replace_by" value="$1"/>
              </operator>
              <operator activated="true" class="replace_missing_values" compatibility="5.0.11" expanded="true" height="94" name="Replace Missing Values (2)" width="90" x="450" y="30">
                <parameter key="attribute_filter_type" value="single"/>
                <parameter key="attribute" value="chosenAttributes"/>
                <parameter key="default" value="value"/>
                <list key="columns"/>
                <parameter key="replenishment_value" value="none"/>
              </operator>
              <operator activated="true" class="filter_examples" compatibility="5.0.11" expanded="true" height="76" name="Filter Examples (3)" width="90" x="585" y="30">
                <parameter key="invert_filter" value="true"/>
              </operator>
              <operator activated="true" class="remember" compatibility="5.0.11" expanded="true" height="60" name="Remember (3)" width="90" x="45" y="120">
                <parameter key="name" value="data"/>
                <parameter key="io_object" value="ExampleSet"/>
              </operator>
              <operator activated="true" class="loop_values" compatibility="5.0.11" expanded="true" height="60" name="Loop Values (2)" width="90" x="180" y="120">
                <parameter key="attribute" value="chosenAttributes"/>
                <parameter key="iteration_macro" value="attribute"/>
                <process expanded="true">
                  <operator activated="true" class="filter_examples" compatibility="5.0.11" expanded="true" name="Filter Examples (4)">
                    <parameter key="condition_class" value="attribute_value_filter"/>
                    <parameter key="parameter_string" value="chosenAttributes=%{attribute}"/>
                  </operator>
                  <operator activated="true" class="sort" compatibility="5.0.11" expanded="true" name="Sort (3)">
                    <parameter key="attribute_name" value="sqrd cor"/>
                    <parameter key="sorting_direction" value="decreasing"/>
                  </operator>
                  <operator activated="true" class="filter_example_range" compatibility="5.0.11" expanded="true" name="Filter Example Range (2)">
                    <parameter key="first_example" value="1"/>
                    <parameter key="last_example" value="1"/>
                  </operator>
                  <operator activated="true" class="recall" compatibility="5.0.11" expanded="true" name="Recall (3)">
                    <parameter key="name" value="data"/>
                    <parameter key="io_object" value="ExampleSet"/>
                  </operator>
                  <operator activated="true" class="append" compatibility="5.0.11" expanded="true" name="Append (2)"/>
                  <operator activated="true" class="remember" compatibility="5.0.11" expanded="true" name="Remember (4)">
                    <parameter key="name" value="data"/>
                    <parameter key="io_object" value="ExampleSet"/>
                  </operator>
                  <connect from_port="example set" to_op="Filter Examples (4)" to_port="example set input"/>
                  <connect from_op="Filter Examples (4)" from_port="example set output" to_op="Sort (3)" to_port="example set input"/>
                  <connect from_op="Sort (3)" from_port="example set output" to_op="Filter Example Range (2)" to_port="example set input"/>
                  <connect from_op="Filter Example Range (2)" from_port="example set output" to_op="Append (2)" to_port="example set 1"/>
                  <connect from_op="Recall (3)" from_port="result" to_op="Append (2)" to_port="example set 2"/>
                  <connect from_op="Append (2)" from_port="merged set" to_op="Remember (4)" to_port="store"/>
                  <portSpacing port="source_example set" spacing="0"/>
                  <portSpacing port="sink_out 1" spacing="0"/>
                </process>
              </operator>
              <operator activated="true" class="recall" compatibility="5.0.11" expanded="true" height="60" name="Recall (4)" width="90" x="315" y="120">
                <parameter key="name" value="data"/>
                <parameter key="io_object" value="ExampleSet"/>
              </operator>
              <operator activated="true" class="select_attributes" compatibility="5.0.11" expanded="true" height="76" name="Select Attributes (2)" width="90" x="450" y="120">
                <parameter key="attribute_filter_type" value="single"/>
                <parameter key="attribute" value="chosenAttribtues"/>
                <parameter key="invert_selection" value="true"/>
              </operator>
              <operator activated="true" class="replace" compatibility="5.0.11" expanded="true" height="76" name="Replace (6)" width="90" x="585" y="120">
                <parameter key="attribute_filter_type" value="single"/>
                <parameter key="attribute" value="attributes"/>
                <parameter key="replace_what" value=".*,([^,]*)"/>
                <parameter key="replace_by" value="$1"/>
              </operator>
              <operator activated="true" breakpoints="after" class="sort" compatibility="5.0.11" expanded="true" height="76" name="Sort (4)" width="90" x="45" y="210">
                <parameter key="attribute_name" value="sqrd cor"/>
              </operator>
              <operator activated="true" class="generate_attributes" compatibility="5.0.11" expanded="true" height="76" name="Generate Attributes" width="90" x="180" y="210">
                <list key="function_descriptions">
                  <parameter key="lowerBound" value="sqrdCorrelation - stdDeviation"/>
                  <parameter key="upperBound" value="sqrdCorrelation + stdDeviation"/>
                </list>
              </operator>
              <connect from_port="in 1" to_op="Generate Copy (2)" to_port="example set input"/>
              <connect from_op="Generate Copy (2)" from_port="example set output" to_op="Replace (4)" to_port="example set input"/>
              <connect from_op="Replace (4)" from_port="example set output" to_op="Replace (5)" to_port="example set input"/>
              <connect from_op="Replace (5)" from_port="example set output" to_op="Replace Missing Values (2)" to_port="example set input"/>
              <connect from_op="Replace Missing Values (2)" from_port="example set output" to_op="Filter Examples (3)" to_port="example set input"/>
              <connect from_op="Filter Examples (3)" from_port="example set output" to_op="Remember (3)" to_port="store"/>
              <connect from_op="Filter Examples (3)" from_port="original" to_op="Loop Values (2)" to_port="example set"/>
              <connect from_op="Recall (4)" from_port="result" to_op="Select Attributes (2)" to_port="example set input"/>
              <connect from_op="Select Attributes (2)" from_port="example set output" to_op="Replace (6)" to_port="example set input"/>
              <connect from_op="Replace (6)" from_port="example set output" to_op="Sort (4)" to_port="example set input"/>
              <connect from_op="Sort (4)" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="source_in 2" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Retrieve Generation Log" from_port="output" to_op="Report Selection Process" to_port="in 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
        </process>
      </operator>
    </process>
    Greetings,
      Sebastian
  • MuehliManMuehliMan Member Posts: 85 Maven
    Hi Sebastian,

    Thank you for this process, which works. But it is made after the Feature Selection Process is finished. I was searching for a solution that gives the desired results on the fly, because this way you can see the progress of the process.

    I am currently doing a Feature Selection with a large Dataset which takes >1  day to run. I would also doubt that the log is not exceeding the memory and thereby killing the process.

    So the idea of logging and extracting the number of features is not possible?

    Best regards,
    Markus
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    no, there's no way. A log can only be updated when the log operator is executed. And hence there's no sub processes only executed when the number of attributes change, there's no handle to execute it on an appropriate location.
    I would suggest the following:
    Write the log to a file, append it there and clean the log in memory to prevent out of memory exceptions.

    Greetings,
    Sebastian
  • MuehliManMuehliMan Member Posts: 85 Maven
    So Materialize Data and Free Memory should help here?
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    actually I would use the Clear Log" operator. :)

    Greetings,
      Sebastian
  • MuehliManMuehliMan Member Posts: 85 Maven
    Well, that was a supid question, sorry for that!
Sign In or Register to comment.