RapidMiner

How can I sum up weights in each training round of a GBT or Random Forest?

SOLVED
Elite II

How can I sum up weights in each training round of a GBT or Random Forest?

hello,

I am using a GBT and / or Random Forest algorithm inside an X-Validation inside a parameter optimization.. Now, the Gradient Boosting Trees and Random Forests can give out the attribute weights for each model building. I would like to catch those weights and sum them up from each training round (or cross-validation).. I tried to output the attribute weights from inside the X-Validation with remember / and recall operator outside the X-Validation, and aggregate them with the aggregate operator and sum them up.. in this way, I get an averaged output of the most important attributes for each training round...

 

but somehow, for each round, my old values seem to dissapear or vanish, as nothing is summed up.. why is that so? Is it because of the parameter optimization algorithm?

5 REPLIES
RMStaff

Re: How can I sum up weights in each training round of a GBT or Random Forest?

Hi,

 

There is actually a little trick how you can achieve very easily.  You simply deliver the weights from the GBT to the Test subprocess and convert them into an Example Set with the "Weights to Data" operator.  Then you can deliver this data as "test" data to the output port.  As a consequence, the 10 weight vectors (for a 10-fold xval) will all be aggregated to one large data set which is delivered outside of the cross validation.  There you can simply aggregate the data grouped by the attribute (i.e. build the average, sum, maximum etc.).

 

The process below shows how this works:

 

<?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="7.4.000" expanded="true" height="68" name="Retrieve Sonar" width="90" x="112" y="136">
        <parameter key="repository_entry" value="//Samples/data/Sonar"/>
      </operator>
      <operator activated="true" class="concurrency:cross_validation" compatibility="7.4.000" expanded="true" height="145" name="Validation" width="90" x="246" y="136">
        <parameter key="sampling_type" value="stratified sampling"/>
        <process expanded="true">
          <operator activated="true" class="h2o:gradient_boosted_trees" compatibility="7.4.000" expanded="true" height="103" name="Gradient Boosted Trees" width="90" x="45" y="34">
            <list key="expert_parameters"/>
          </operator>
          <connect from_port="training set" to_op="Gradient Boosted Trees" to_port="training set"/>
          <connect from_op="Gradient Boosted Trees" from_port="model" to_port="model"/>
          <connect from_op="Gradient Boosted Trees" from_port="weights" to_port="through 1"/>
          <portSpacing port="source_training set" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="84"/>
          <portSpacing port="sink_through 2" spacing="0"/>
        </process>
        <process expanded="true">
          <operator activated="true" class="apply_model" compatibility="7.4.000" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance" compatibility="7.4.000" expanded="true" height="82" name="Performance" width="90" x="179" y="34"/>
          <operator activated="true" class="weights_to_data" compatibility="7.4.000" expanded="true" height="68" name="Weights to Data" width="90" x="45" y="136"/>
          <connect from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_port="through 1" to_op="Weights to Data" to_port="attribute weights"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="performance 1"/>
          <connect from_op="Weights to Data" from_port="example set" to_port="test set results"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="63"/>
          <portSpacing port="source_through 2" spacing="0"/>
          <portSpacing port="sink_test set results" spacing="0"/>
          <portSpacing port="sink_performance 1" spacing="0"/>
          <portSpacing port="sink_performance 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="aggregate" compatibility="7.4.000" expanded="true" height="82" name="Aggregate" width="90" x="447" y="34">
        <list key="aggregation_attributes">
          <parameter key="Weight" value="sum"/>
        </list>
        <parameter key="group_by_attributes" value="Attribute"/>
      </operator>
      <connect from_op="Retrieve Sonar" from_port="output" to_op="Validation" to_port="example set"/>
      <connect from_op="Validation" from_port="test result set" to_op="Aggregate" to_port="example set input"/>
      <connect from_op="Validation" from_port="performance 1" to_port="result 2"/>
      <connect from_op="Aggregate" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

Hope this helps,

Ingo


How to load processes in XML from the forum into RapidMiner: Read this!
Elite II

Re: How can I sum up weights in each training round of a GBT or Random Forest?

thanks, this seems to work, but however only for one complete cross-validation round... and then it starts again from zero for the next cross-validation round...

 

but it does not sum up values for several X-validations as the sum does not go above a certain value....

and Inside the Parameter Optimization, or when I want to direct the weights outside the Parameter-optimization to sum them up it seems its not increasing either ..

RMStaff

Re: How can I sum up weights in each training round of a GBT or Random Forest?

Ah, when I read "...to catch those weights and sum them up from each training round (or cross-validation)...", I thought you want to do exactly that.  Anyway, if you want to do this across multiple runs of cross-validations, you indeed would need to make use of Remember and Recall.  Here is a quick demo:

 

<?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="7.4.000" expanded="true" height="68" name="Retrieve Sonar" width="90" x="45" y="34">
        <parameter key="repository_entry" value="//Samples/data/Sonar"/>
      </operator>
      <operator activated="true" class="generate_data_user_specification" compatibility="7.4.000" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="179" y="85">
        <list key="attribute_values">
          <parameter key="Attribute" value="&quot;dummy&quot;"/>
          <parameter key="Weight" value="0"/>
        </list>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="remember" compatibility="7.4.000" expanded="true" height="68" name="Remember (2)" width="90" x="313" y="85">
        <parameter key="name" value="weights"/>
      </operator>
      <operator activated="true" class="optimize_parameters_grid" compatibility="7.4.000" expanded="true" height="103" name="Optimize Parameters (Grid)" width="90" x="447" y="34">
        <list key="parameters">
          <parameter key="Gradient Boosted Trees.number_of_trees" value="[5;20;3;linear]"/>
        </list>
        <process expanded="true">
          <operator activated="true" class="concurrency:cross_validation" compatibility="7.4.000" expanded="true" height="145" name="Validation" width="90" x="246" y="34">
            <parameter key="sampling_type" value="stratified sampling"/>
            <process expanded="true">
              <operator activated="true" class="h2o:gradient_boosted_trees" compatibility="7.4.000" expanded="true" height="103" name="Gradient Boosted Trees" width="90" x="45" y="34">
                <list key="expert_parameters"/>
              </operator>
              <connect from_port="training set" to_op="Gradient Boosted Trees" to_port="training set"/>
              <connect from_op="Gradient Boosted Trees" from_port="model" to_port="model"/>
              <connect from_op="Gradient Boosted Trees" from_port="weights" to_port="through 1"/>
              <portSpacing port="source_training set" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
              <portSpacing port="sink_through 2" spacing="0"/>
            </process>
            <process expanded="true">
              <operator activated="true" class="apply_model" compatibility="7.4.000" expanded="true" height="82" name="Apply Model" width="90" x="112" y="34">
                <list key="application_parameters"/>
              </operator>
              <operator activated="true" class="performance" compatibility="7.4.000" expanded="true" height="82" name="Performance" width="90" x="246" y="34"/>
              <operator activated="true" class="weights_to_data" compatibility="7.4.000" expanded="true" height="68" name="Weights to Data" width="90" x="112" y="136"/>
              <connect from_port="model" to_op="Apply Model" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
              <connect from_port="through 1" to_op="Weights to Data" to_port="attribute weights"/>
              <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
              <connect from_op="Performance" from_port="performance" to_port="performance 1"/>
              <connect from_op="Weights to Data" from_port="example set" to_port="test set results"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="source_through 2" spacing="0"/>
              <portSpacing port="sink_test set results" spacing="0"/>
              <portSpacing port="sink_performance 1" spacing="0"/>
              <portSpacing port="sink_performance 2" spacing="0"/>
            </process>
            <description align="center" color="transparent" colored="false" width="126">A cross-validation evaluating a decision tree model.</description>
          </operator>
          <operator activated="true" class="recall" compatibility="7.4.000" expanded="true" height="68" name="Recall" width="90" x="313" y="340">
            <parameter key="name" value="weights"/>
          </operator>
          <operator activated="true" class="append" compatibility="7.4.000" expanded="true" height="103" name="Append" width="90" x="447" y="289"/>
          <operator activated="true" class="remember" compatibility="7.4.000" expanded="true" height="68" name="Remember" width="90" x="581" y="289">
            <parameter key="name" value="weights"/>
          </operator>
          <connect from_port="input 1" to_op="Validation" to_port="example set"/>
          <connect from_op="Validation" from_port="test result set" to_op="Append" to_port="example set 1"/>
          <connect from_op="Validation" from_port="performance 1" to_port="performance"/>
          <connect from_op="Recall" from_port="result" to_op="Append" to_port="example set 2"/>
          <connect from_op="Append" from_port="merged set" to_op="Remember" to_port="store"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_performance" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="recall" compatibility="7.4.000" expanded="true" height="68" name="Recall (2)" width="90" x="581" y="34">
        <parameter key="name" value="weights"/>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="7.4.000" expanded="true" height="103" name="Filter Examples" width="90" x="715" y="34">
        <list key="filters_list">
          <parameter key="filters_entry_key" value="Attribute.does_not_equal.dummy"/>
        </list>
      </operator>
      <connect from_op="Retrieve Sonar" from_port="output" to_op="Optimize Parameters (Grid)" to_port="input 1"/>
      <connect from_op="Generate Data by User Specification" from_port="output" to_op="Remember (2)" to_port="store"/>
      <connect from_op="Recall (2)" from_port="result" to_op="Filter Examples" to_port="example set input"/>
      <connect from_op="Filter Examples" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Let me add that this does actually not make much sense to me but I am sure you have your reasons why to try this :-)

 

Cheers,

Ingo


How to load processes in XML from the forum into RapidMiner: Read this!
Elite II

Re: How can I sum up weights in each training round of a GBT or Random Forest?

[ Edited ]

hi,

thanks this was what I was looking for Smiley Happy

I wanted to get an average weighted importance ranking for all the attributes and all the different parameters... but usually the best parameters come at the end and its often only one or some parameter combinations... thats why I use the best parameters again to create the best model, and derive additionally the best attributes for that model in a separate round... 

 

by the way... is it possible to turn the direction of the grid search parameter utilisation around? I mean the bigger or more complex parameter values are often at the end of the process... is it possible to turn it the other way around and start with the more complex parameters going down to the easier ones? Maybe that would save CPU time when one see's that the performance does not increase anymore at some point...

Highlighted
RMStaff

Re: How can I sum up weights in each training round of a GBT or Random Forest?

Hi Fred,

 

Got it.  Re: turning direction around, this is not possible for the setting "Grid" in the dialog of Optimize Parameters but you could set it to "List" instead and add the values manually in the opposite order.  This is of course not feasible for hundreds of values but if this is a handful you can absolutely do this.

 

The following process shows a quick example.

 

<?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="7.4.000" expanded="true" height="68" name="Retrieve Sonar" width="90" x="45" y="34">
        <parameter key="repository_entry" value="//Samples/data/Sonar"/>
      </operator>
      <operator activated="true" class="optimize_parameters_grid" compatibility="7.4.000" expanded="true" height="103" name="Optimize Parameters (Grid)" width="90" x="179" y="34">
        <list key="parameters">
          <parameter key="Gradient Boosted Trees.number_of_trees" value="20,15,10,5"/>
        </list>
        <process expanded="true">
          <operator activated="true" class="concurrency:cross_validation" compatibility="7.4.000" expanded="true" height="145" name="Validation" width="90" x="45" y="34">
            <parameter key="sampling_type" value="stratified sampling"/>
            <process expanded="true">
              <operator activated="true" class="h2o:gradient_boosted_trees" compatibility="7.4.000" expanded="true" height="103" name="Gradient Boosted Trees" width="90" x="45" y="34">
                <parameter key="number_of_trees" value="5"/>
                <list key="expert_parameters"/>
              </operator>
              <connect from_port="training set" to_op="Gradient Boosted Trees" to_port="training set"/>
              <connect from_op="Gradient Boosted Trees" from_port="model" to_port="model"/>
              <portSpacing port="source_training set" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
              <description align="left" color="green" colored="true" height="80" resized="true" width="248" x="37" y="137">In the training phase, a model is built on the current training data set. (90 % of data by default, 10 times)</description>
            </process>
            <process expanded="true">
              <operator activated="true" class="apply_model" compatibility="7.4.000" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34">
                <list key="application_parameters"/>
              </operator>
              <operator activated="true" class="performance" compatibility="7.4.000" expanded="true" height="82" name="Performance" width="90" x="179" y="34"/>
              <connect from_port="model" to_op="Apply Model" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
              <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
              <connect from_op="Performance" from_port="performance" to_port="performance 1"/>
              <connect from_op="Performance" from_port="example set" to_port="test set results"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_test set results" spacing="0"/>
              <portSpacing port="sink_performance 1" spacing="0"/>
              <portSpacing port="sink_performance 2" spacing="0"/>
              <description align="left" color="blue" colored="true" height="103" resized="true" width="315" x="38" y="137">The model created in the Training step is applied to the current test set (10 %).&lt;br/&gt;The performance is evaluated and sent to the operator results.</description>
            </process>
            <description align="center" color="transparent" colored="false" width="126">A cross-validation evaluating a decision tree model.</description>
          </operator>
          <operator activated="true" class="log" compatibility="7.4.000" expanded="true" height="82" name="Log" width="90" x="179" y="34">
            <list key="log">
              <parameter key="trees" value="operator.Gradient Boosted Trees.parameter.number_of_trees"/>
              <parameter key="accuracy" value="operator.Validation.value.performance 1"/>
            </list>
          </operator>
          <connect from_port="input 1" to_op="Validation" to_port="example set"/>
          <connect from_op="Validation" from_port="performance 1" to_op="Log" to_port="through 1"/>
          <connect from_op="Log" from_port="through 1" to_port="performance"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_performance" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Retrieve Sonar" from_port="output" to_op="Optimize Parameters (Grid)" to_port="input 1"/>
      <connect from_op="Optimize Parameters (Grid)" from_port="performance" to_port="result 1"/>
      <connect from_op="Optimize Parameters (Grid)" from_port="parameter" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

Hope this helps,

Ingo


How to load processes in XML from the forum into RapidMiner: Read this!