The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.
Feature Request: Explain Predictions: Select the top X attributes by importance for the imp port
christos_karras
Member Posts: 50 Guru
in Help
It seems there's a missing feature with Explain Predictions: the "imp" output port always returns the importance of all attributes for all examples. I would like to get only the top X for both positive and negative importance for visualization. The "exa" port uses the "maximal explaining attributes" parameter for this, but it does not return the information in a format adequate for the visualization I'm trying to build.
I experimented with a few convoluted solutions to retrieve this list of top X attributes by group but I can't get to a reasonably simple solution.
I was trying to do something similar to this SQL query but did not find a simple solution to implement it:
What would be the simplest way to do this in RapidMiner?
I would also like to create a feature request to add a boolean option "Apply maximal explaining attributes to the imp output port" to the Explain Predictions operator to avoid the need to implement this kind of filtering in the future.
I experimented with a few convoluted solutions to retrieve this list of top X attributes by group but I can't get to a reasonably simple solution.
I was trying to do something similar to this SQL query but did not find a simple solution to implement it:
-- top 5 positive importances SELECT * FROM table WHERE row_number() OVER (PARTITION BY GroupingColumn1, GroupingColumn2) ORDER BY (Importance DESC) <= 5 UNION -- top 5 negative importances SELECT * FROM table WHERE row_number() OVER (PARTITION BY GroupingColumn1, GroupingColumn2) ORDER BY (Importance ASC) <= 5
What would be the simplest way to do this in RapidMiner?
I would also like to create a feature request to add a boolean option "Apply maximal explaining attributes to the imp output port" to the Explain Predictions operator to avoid the need to implement this kind of filtering in the future.
Tagged:
2
Best Answers
-
MartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,525 RM Data ScientistHi @christos_karras ,here is a process which does it. It's not too crazy, so i am not sure if this justifies a new parameter.Cheers,Martin<?xml version="1.0" encoding="UTF-8"?><process version="9.6.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.6.000" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="9.6.000" expanded="true" height="68" name="Retrieve Golf" width="90" x="112" y="85">
<parameter key="repository_entry" value="//Samples/data/Golf"/>
</operator>
<operator activated="true" class="h2o:generalized_linear_model" compatibility="9.3.001" expanded="true" height="124" name="Generalized Linear Model" width="90" x="313" y="85">
<parameter key="family" value="AUTO"/>
<parameter key="link" value="family_default"/>
<parameter key="solver" value="AUTO"/>
<parameter key="reproducible" value="false"/>
<parameter key="maximum_number_of_threads" value="4"/>
<parameter key="use_regularization" value="true"/>
<parameter key="lambda_search" value="false"/>
<parameter key="number_of_lambdas" value="0"/>
<parameter key="lambda_min_ratio" value="0.0"/>
<parameter key="early_stopping" value="true"/>
<parameter key="stopping_rounds" value="3"/>
<parameter key="stopping_tolerance" value="0.001"/>
<parameter key="standardize" value="true"/>
<parameter key="non-negative_coefficients" value="false"/>
<parameter key="add_intercept" value="true"/>
<parameter key="compute_p-values" value="false"/>
<parameter key="remove_collinear_columns" value="false"/>
<parameter key="missing_values_handling" value="MeanImputation"/>
<parameter key="max_iterations" value="0"/>
<parameter key="specify_beta_constraints" value="false"/>
<list key="beta_constraints"/>
<parameter key="max_runtime_seconds" value="0"/>
<list key="expert_parameters"/>
</operator>
<operator activated="true" class="multiply" compatibility="9.6.000" expanded="true" height="103" name="Multiply" width="90" x="447" y="106"/>
<operator activated="true" class="model_simulator:explain_predictions" compatibility="9.6.000" expanded="true" height="124" name="Explain Predictions" width="90" x="581" y="85">
<parameter key="maximal explaining attributes" value="3"/>
<parameter key="local sample size" value="500"/>
<parameter key="only create predictions" value="false"/>
<parameter key="normalize global weights" value="false"/>
<parameter key="sort_weights" value="true"/>
<parameter key="sort_direction" value="descending"/>
</operator>
<operator activated="true" class="operator_toolbox:group_into_collection" compatibility="2.4.000-SNAPSHOT" expanded="true" height="82" name="Group Into Collection" width="90" x="715" y="136">
<parameter key="group_by_attribute" value="Row No"/>
<parameter key="group_by_attribute (numerical)" value=""/>
<parameter key="sorting_order" value="none"/>
</operator>
<operator activated="true" class="loop_collection" compatibility="9.6.000" expanded="true" height="82" name="Loop Collection" width="90" x="849" y="136">
<parameter key="set_iteration_macro" value="false"/>
<parameter key="macro_name" value="iteration"/>
<parameter key="macro_start_value" value="1"/>
<parameter key="unfold" value="false"/>
<process expanded="true">
<operator activated="true" class="sort" compatibility="9.6.000" expanded="true" height="82" name="Sort" width="90" x="112" y="34">
<parameter key="attribute_name" value="Value"/>
<parameter key="sorting_direction" value="decreasing"/>
</operator>
<operator activated="true" class="filter_example_range" compatibility="9.6.000" expanded="true" height="82" name="Filter Example Range" width="90" x="313" y="34">
<parameter key="first_example" value="1"/>
<parameter key="last_example" value="3"/>
<parameter key="invert_filter" value="false"/>
</operator>
<connect from_port="single" to_op="Sort" to_port="example set input"/>
<connect from_op="Sort" from_port="example set output" to_op="Filter Example Range" to_port="example set input"/>
<connect from_op="Filter Example Range" from_port="example set output" to_port="output 1"/>
<portSpacing port="source_single" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="append" compatibility="9.6.000" expanded="true" height="82" name="Append" width="90" x="983" y="136">
<parameter key="datamanagement" value="double_array"/>
<parameter key="data_management" value="auto"/>
<parameter key="merge_type" value="all"/>
</operator>
<connect from_op="Retrieve Golf" from_port="output" to_op="Generalized Linear Model" to_port="training set"/>
<connect from_op="Generalized Linear Model" from_port="model" to_op="Explain Predictions" to_port="model"/>
<connect from_op="Generalized Linear Model" from_port="exampleSet" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Explain Predictions" to_port="training data"/>
<connect from_op="Multiply" from_port="output 2" to_op="Explain Predictions" to_port="test data"/>
<connect from_op="Explain Predictions" from_port="importances output" to_op="Group Into Collection" to_port="exa"/>
<connect from_op="Group Into Collection" from_port="col" to_op="Loop Collection" to_port="collection"/>
<connect from_op="Loop Collection" from_port="output 1" to_op="Append" to_port="example set 1"/>
<connect from_op="Append" from_port="merged set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
- Sr. Director Data Solutions, Altair RapidMiner -
Dortmund, Germany6 -
sgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manageryup I agree with @mschmitz - pretty easy to do this with a few operators. You could also just wrap these into one subprocess + then turn it into a building block or a new "custom operator"
<?xml version="1.0" encoding="UTF-8"?><process version="9.6.000"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.6.000" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="retrieve" compatibility="9.6.000" expanded="true" height="68" name="Retrieve Golf" width="90" x="112" y="85"> <parameter key="repository_entry" value="//Samples/data/Golf"/> </operator> <operator activated="true" class="h2o:generalized_linear_model" compatibility="9.3.001" expanded="true" height="124" name="Generalized Linear Model" width="90" x="313" y="85"> <parameter key="family" value="AUTO"/> <parameter key="link" value="family_default"/> <parameter key="solver" value="AUTO"/> <parameter key="reproducible" value="false"/> <parameter key="maximum_number_of_threads" value="4"/> <parameter key="use_regularization" value="true"/> <parameter key="lambda_search" value="false"/> <parameter key="number_of_lambdas" value="0"/> <parameter key="lambda_min_ratio" value="0.0"/> <parameter key="early_stopping" value="true"/> <parameter key="stopping_rounds" value="3"/> <parameter key="stopping_tolerance" value="0.001"/> <parameter key="standardize" value="true"/> <parameter key="non-negative_coefficients" value="false"/> <parameter key="add_intercept" value="true"/> <parameter key="compute_p-values" value="false"/> <parameter key="remove_collinear_columns" value="false"/> <parameter key="missing_values_handling" value="MeanImputation"/> <parameter key="max_iterations" value="0"/> <parameter key="specify_beta_constraints" value="false"/> <list key="beta_constraints"/> <parameter key="max_runtime_seconds" value="0"/> <list key="expert_parameters"/> </operator> <operator activated="true" class="multiply" compatibility="9.6.000" expanded="true" height="103" name="Multiply" width="90" x="447" y="106"/> <operator activated="true" class="model_simulator:explain_predictions" compatibility="9.6.000" expanded="true" height="124" name="Explain Predictions" width="90" x="581" y="85"> <parameter key="maximal explaining attributes" value="3"/> <parameter key="local sample size" value="500"/> <parameter key="only create predictions" value="false"/> <parameter key="normalize global weights" value="false"/> <parameter key="sort_weights" value="true"/> <parameter key="sort_direction" value="descending"/> </operator> <operator activated="true" class="subprocess" compatibility="9.6.000" expanded="true" height="82" name="Subprocess" width="90" x="715" y="85"> <process expanded="true"> <operator activated="true" class="operator_toolbox:group_into_collection" compatibility="2.3.000" expanded="true" height="82" name="Group Into Collection" width="90" x="45" y="34"> <parameter key="group_by_attribute" value="Row No"/> <parameter key="group_by_attribute (numerical)" value=""/> <parameter key="sorting_order" value="none"/> </operator> <operator activated="true" class="loop_collection" compatibility="9.6.000" expanded="true" height="82" name="Loop Collection" width="90" x="179" y="34"> <parameter key="set_iteration_macro" value="false"/> <parameter key="macro_name" value="iteration"/> <parameter key="macro_start_value" value="1"/> <parameter key="unfold" value="false"/> <process expanded="true"> <operator activated="true" class="sort" compatibility="9.6.000" expanded="true" height="82" name="Sort" width="90" x="112" y="34"> <parameter key="attribute_name" value="Value"/> <parameter key="sorting_direction" value="decreasing"/> </operator> <operator activated="true" class="filter_example_range" compatibility="9.6.000" expanded="true" height="82" name="Filter Example Range" width="90" x="313" y="34"> <parameter key="first_example" value="1"/> <parameter key="last_example" value="3"/> <parameter key="invert_filter" value="false"/> </operator> <connect from_port="single" to_op="Sort" to_port="example set input"/> <connect from_op="Sort" from_port="example set output" to_op="Filter Example Range" to_port="example set input"/> <connect from_op="Filter Example Range" from_port="example set output" to_port="output 1"/> <portSpacing port="source_single" spacing="0"/> <portSpacing port="sink_output 1" spacing="0"/> <portSpacing port="sink_output 2" spacing="0"/> </process> </operator> <operator activated="true" class="append" compatibility="9.6.000" expanded="true" height="82" name="Append" width="90" x="313" y="34"> <parameter key="datamanagement" value="double_array"/> <parameter key="data_management" value="auto"/> <parameter key="merge_type" value="all"/> </operator> <connect from_port="in 1" to_op="Group Into Collection" to_port="exa"/> <connect from_op="Group Into Collection" from_port="col" to_op="Loop Collection" to_port="collection"/> <connect from_op="Loop Collection" from_port="output 1" to_op="Append" to_port="example set 1"/> <connect from_op="Append" from_port="merged set" to_port="out 1"/> <portSpacing port="source_in 1" spacing="0"/> <portSpacing port="source_in 2" spacing="0"/> <portSpacing port="sink_out 1" spacing="0"/> <portSpacing port="sink_out 2" spacing="0"/> </process> <description align="center" color="transparent" colored="false" width="126">turn this into a building block or custom operator</description> </operator> <connect from_op="Retrieve Golf" from_port="output" to_op="Generalized Linear Model" to_port="training set"/> <connect from_op="Generalized Linear Model" from_port="model" to_op="Explain Predictions" to_port="model"/> <connect from_op="Generalized Linear Model" from_port="exampleSet" to_op="Multiply" to_port="input"/> <connect from_op="Multiply" from_port="output 1" to_op="Explain Predictions" to_port="training data"/> <connect from_op="Multiply" from_port="output 2" to_op="Explain Predictions" to_port="test data"/> <connect from_op="Explain Predictions" from_port="importances output" to_op="Subprocess" to_port="in 1"/> <connect from_op="Subprocess" from_port="out 1" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
Scott6
Answers
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.6.000" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="false" class="retrieve" compatibility="9.6.000" expanded="true" height="68" name="Retrieve Golf" width="90" x="112" y="289">
<parameter key="repository_entry" value="//Samples/data/Golf"/>
</operator>
<operator activated="true" class="generate_data" compatibility="9.6.000" expanded="true" height="68" name="Generate Data" width="90" x="45" y="85">
<parameter key="target_function" value="random"/>
<parameter key="number_examples" value="10000"/>
<parameter key="number_of_attributes" value="5"/>
<parameter key="attributes_lower_bound" value="-10.0"/>
<parameter key="attributes_upper_bound" value="10.0"/>
<parameter key="gaussian_standard_deviation" value="10.0"/>
<parameter key="largest_radius" value="10.0"/>
<parameter key="use_local_random_seed" value="false"/>
<parameter key="local_random_seed" value="1992"/>
<parameter key="datamanagement" value="double_array"/>
<parameter key="data_management" value="auto"/>
<description align="center" color="transparent" colored="false" width="126">Maybe Change to 100k for testing.</description>
</operator>
<operator activated="true" class="h2o:generalized_linear_model" compatibility="9.3.001" expanded="true" height="124" name="Generalized Linear Model" width="90" x="313" y="85">
<parameter key="family" value="AUTO"/>
<parameter key="link" value="family_default"/>
<parameter key="solver" value="AUTO"/>
<parameter key="reproducible" value="false"/>
<parameter key="maximum_number_of_threads" value="4"/>
<parameter key="use_regularization" value="true"/>
<parameter key="lambda_search" value="false"/>
<parameter key="number_of_lambdas" value="0"/>
<parameter key="lambda_min_ratio" value="0.0"/>
<parameter key="early_stopping" value="true"/>
<parameter key="stopping_rounds" value="3"/>
<parameter key="stopping_tolerance" value="0.001"/>
<parameter key="standardize" value="true"/>
<parameter key="non-negative_coefficients" value="false"/>
<parameter key="add_intercept" value="true"/>
<parameter key="compute_p-values" value="false"/>
<parameter key="remove_collinear_columns" value="false"/>
<parameter key="missing_values_handling" value="MeanImputation"/>
<parameter key="max_iterations" value="0"/>
<parameter key="specify_beta_constraints" value="false"/>
<list key="beta_constraints"/>
<parameter key="max_runtime_seconds" value="0"/>
<list key="expert_parameters"/>
</operator>
<operator activated="true" class="multiply" compatibility="9.6.000" expanded="true" height="103" name="Multiply" width="90" x="447" y="136"/>
<operator activated="true" class="model_simulator:explain_predictions" compatibility="9.6.000" expanded="true" height="124" name="Explain Predictions" width="90" x="581" y="85">
<parameter key="maximal explaining attributes" value="3"/>
<parameter key="local sample size" value="500"/>
<parameter key="only create predictions" value="false"/>
<parameter key="normalize global weights" value="false"/>
<parameter key="sort_weights" value="true"/>
<parameter key="sort_direction" value="descending"/>
</operator>
<operator activated="true" class="operator_toolbox:group_into_collection" compatibility="2.4.000-SNAPSHOT" expanded="true" height="82" name="Group Into Collection" width="90" x="715" y="136">
<parameter key="group_by_attribute" value="Row No"/>
<parameter key="group_by_attribute (numerical)" value=""/>
<parameter key="sorting_order" value="none"/>
</operator>
<operator activated="true" class="loop_collection" compatibility="9.6.000" expanded="true" height="82" name="Loop Collection" width="90" x="916" y="136">
<parameter key="set_iteration_macro" value="false"/>
<parameter key="macro_name" value="iteration"/>
<parameter key="macro_start_value" value="1"/>
<parameter key="unfold" value="false"/>
<process expanded="true">
<operator activated="true" class="parse_numbers" compatibility="9.6.000" expanded="true" height="82" name="Parse Numbers" width="90" x="45" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Name"/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="nominal"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="file_path"/>
<parameter key="block_type" value="single_value"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="single_value"/>
<parameter key="invert_selection" value="true"/>
<parameter key="include_special_attributes" value="false"/>
<parameter key="decimal_character" value="."/>
<parameter key="grouped_digits" value="false"/>
<parameter key="grouping_character" value=","/>
<parameter key="infinity_representation" value=""/>
<parameter key="unparsable_value_handling" value="fail"/>
</operator>
<operator activated="true" class="sort" compatibility="9.6.000" expanded="true" height="82" name="Sort" width="90" x="179" y="34">
<parameter key="attribute_name" value="Value"/>
<parameter key="sorting_direction" value="decreasing"/>
</operator>
<operator activated="true" class="filter_example_range" compatibility="9.6.000" expanded="true" height="82" name="Filter Example Range" width="90" x="313" y="34">
<parameter key="first_example" value="1"/>
<parameter key="last_example" value="3"/>
<parameter key="invert_filter" value="false"/>
</operator>
<connect from_port="single" to_op="Parse Numbers" to_port="example set input"/>
<connect from_op="Parse Numbers" from_port="example set output" to_op="Sort" to_port="example set input"/>
<connect from_op="Sort" from_port="example set output" to_op="Filter Example Range" to_port="example set input"/>
<connect from_op="Filter Example Range" from_port="example set output" to_port="output 1"/>
<portSpacing port="source_single" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="append" compatibility="9.6.000" expanded="true" height="82" name="Append" width="90" x="1050" y="136">
<parameter key="datamanagement" value="double_array"/>
<parameter key="data_management" value="auto"/>
<parameter key="merge_type" value="all"/>
</operator>
<connect from_op="Generate Data" from_port="output" to_op="Generalized Linear Model" to_port="training set"/>
<connect from_op="Generalized Linear Model" from_port="model" to_op="Explain Predictions" to_port="model"/>
<connect from_op="Generalized Linear Model" from_port="exampleSet" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Explain Predictions" to_port="training data"/>
<connect from_op="Multiply" from_port="output 2" to_op="Explain Predictions" to_port="test data"/>
<connect from_op="Explain Predictions" from_port="importances output" to_op="Group Into Collection" to_port="exa"/>
<connect from_op="Group Into Collection" from_port="col" to_op="Loop Collection" to_port="collection"/>
<connect from_op="Loop Collection" from_port="output 1" to_op="Append" to_port="example set 1"/>
<connect from_op="Append" from_port="merged set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Dortmund, Germany
My other concern is about users with professional licenses. For example, in the following scenario, what would happen to a user with a Professional license which is limited to 100,000 rows:
* Explain Predictions is used on a test data set of 10,000 rows and 30 columns
* Outputs of the "imp" port has 300,000 rows
* This exceeds the 100,000 row limit of the license: what happens here?
* After filtering the 3 most important features for each data row, the results now have 30,000 rows, which is acceptable for the license
Thanks
Dortmund, Germany
Not sure if you guys had the chance to connect. But I wanted to let you know that a new parameter "apply maximum to importances output" will be a part of the upcoming 9.7 release. So there is no need for the postprocessing then any longer.
Also the process above actually had two errors since it sorted according to the column "Value" but it should have been "Importance". And also it should actually not sort according to "Importance" but with respect to the absolute value of the importance. The process below fixes both problems and also contains the alternative path with the new parameter (which obviously will only work for you guys after the 9.7 release).
Hope this helps,
Ingo