Options

Feature Request: In feature selection operators, allow specifying features to always include

christos_karraschristos_karras Member Posts: 50 Guru
It would be useful for the feature selection operators (Forward Selection, Backward Elimination, Optimize Selection, Optimize Selection (Evolutionary), etc) to have the ability to provide a starting set of features that we always want to include. This would be useful when we already have a working model with a basic set of features, but want to consider the addition of additional features that might be expensive to obtain (for example due to implementation complexity, high CPU/memory usage, etc)

In this situation, we don't want to evaluate the existing set of features: they shouldn't be considered for addition/removal in different iterations of the feature selection operator, they should just always be there. What we want to know is, from the set of additional features, are there any features that bring a significant improvement to the model. We also want to avoid an outcome where the feature selection process chooses an expensive feature instead of one of our basic features.

For example, we might be able to simply and efficiently calculate the mean value over a time window, but require a more complex implementation and higher CPU/RAM usage to get the median or 95th percentile (for example, we can query the mean directly from the source system, but need to query high resolution raw data to compute the median or 95th percentile). Without the ability to enforce a fixed set of features, the feature selection process might tell us that having the 95th percentile is more useful than having the mean, but might not tell us what kind of improvement we got by selecting the 95th percentile (was it a 0.01% improvement in accuracy, or 10%?). If instead we would enforce always including the mean in the feature sets tested by the feature selection operators, then we would expect that the 95th percentile would be picked only if it's significantly better than the mean which is always there.

To overcome the lack of this feature, I implemented a hack where we hide our "Fixed Features" from the feature selection operators, but add them back when training a model in the feature selection's inner process. So at each iteration, we train the model based on our Fixed Features plus the subset of the remaining features that the feature selection operator chose for this iteration. I'm sharing an example of this process, but it is meant only to illustrate the feature request, not as a real solution. It would be much easier and maintainable if the feature selection operators had the ability to select a set of fixed features. I would expect to have the flexibility to provide this set with two methods:
* Using the usual feature specification options (regex, list of features, invert selection, etc)
* Using an input ExampleSet that contains the list of features we want to use as "fixed features". This ExampleSet would have a structure similar to what we get using the "Weights to Data" operator

<?xml version="1.0" encoding="UTF-8"?><process version="9.6.000">
  <context>
    <input/>
    <output/>
    <macros>
      <macro>
        <key>FixedFeaturesRegex</key>
        <value>.*mean$</value>
      </macro>
    </macros>
  </context>
  <operator activated="true" class="process" compatibility="9.4.000" expanded="true" name="Root" origin="GENERATED_TUTORIAL">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2000"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="generate_data" compatibility="9.6.000" expanded="true" height="68" name="Generate Data" width="90" x="45" y="34">
        <parameter key="target_function" value="driller oscillation timeseries"/>
        <parameter key="number_examples" value="500"/>
        <parameter key="number_of_attributes" value="5"/>
        <parameter key="attributes_lower_bound" value="-10.0"/>
        <parameter key="attributes_upper_bound" value="10.0"/>
        <parameter key="gaussian_standard_deviation" value="10.0"/>
        <parameter key="largest_radius" value="10.0"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
        <parameter key="datamanagement" value="double_array"/>
        <parameter key="data_management" value="auto"/>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="9.6.000" expanded="true" height="82" name="Generate label" width="90" x="179" y="34">
        <list key="function_descriptions">
          <parameter key="label" value="att1*att2+att5/att4"/>
        </list>
        <parameter key="keep_all" value="true"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="9.6.000" expanded="true" height="82" name="Set Role" width="90" x="313" y="34">
        <parameter key="attribute_name" value="label"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="subprocess" compatibility="9.6.000" expanded="true" height="82" name="Generate expensive features" width="90" x="514" y="34">
        <process expanded="true">
          <operator activated="true" class="time_series:process_windows" compatibility="9.6.000" expanded="true" height="82" name="Process Windows" origin="GENERATED_TUTORIAL" width="90" x="45" y="34">
            <parameter key="attribute_filter_type" value="all"/>
            <parameter key="attribute" value=""/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="numeric"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="real"/>
            <parameter key="block_type" value="value_series"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_series_end"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="true"/>
            <parameter key="has_indices" value="false"/>
            <parameter key="indices_attribute" value="Date"/>
            <parameter key="window_size" value="30"/>
            <parameter key="no_overlapping_windows" value="false"/>
            <parameter key="step_size" value="1"/>
            <parameter key="create_horizon_(labels)" value="true"/>
            <parameter key="horizon_attribute" value="label"/>
            <parameter key="horizon_size" value="1"/>
            <parameter key="horizon_offset" value="0"/>
            <parameter key="add_last_index_in_window_attribute" value="true"/>
            <parameter key="enable_parallel_execution" value="true"/>
            <process expanded="true">
              <operator activated="true" class="time_series:extract_std_descriptive_features" compatibility="9.6.000" expanded="true" height="82" name="Extract Aggregates" origin="GENERATED_TUTORIAL" width="90" x="447" y="34">
                <parameter key="attribute_filter_type" value="all"/>
                <parameter key="attribute" value=""/>
                <parameter key="attributes" value=""/>
                <parameter key="use_except_expression" value="false"/>
                <parameter key="value_type" value="numeric"/>
                <parameter key="use_value_type_exception" value="false"/>
                <parameter key="except_value_type" value="real"/>
                <parameter key="block_type" value="value_series"/>
                <parameter key="use_block_type_exception" value="false"/>
                <parameter key="except_block_type" value="value_series_end"/>
                <parameter key="invert_selection" value="false"/>
                <parameter key="include_special_attributes" value="false"/>
                <parameter key="sum" value="true"/>
                <parameter key="mean" value="true"/>
                <parameter key="geometric_mean" value="true"/>
                <parameter key="first_quartile" value="true"/>
                <parameter key="median" value="true"/>
                <parameter key="third_quartile" value="true"/>
                <parameter key="min" value="true"/>
                <parameter key="max" value="true"/>
                <parameter key="std_deviation" value="true"/>
                <parameter key="kurtosis" value="true"/>
                <parameter key="skewness" value="true"/>
                <parameter key="add_time_series_name" value="true"/>
                <parameter key="ignore_invalid_values" value="false"/>
              </operator>
              <connect from_port="windowed example set" to_op="Extract Aggregates" to_port="example set"/>
              <connect from_op="Extract Aggregates" from_port="features" to_port="output 1"/>
              <portSpacing port="source_windowed example set" spacing="0"/>
              <portSpacing port="source_input 1" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="append" compatibility="9.6.000" expanded="true" height="82" name="Append" width="90" x="179" y="34">
            <parameter key="datamanagement" value="double_array"/>
            <parameter key="data_management" value="auto"/>
            <parameter key="merge_type" value="all"/>
          </operator>
          <connect from_port="in 1" to_op="Process Windows" to_port="example set"/>
          <connect from_op="Process Windows" from_port="output 1" to_op="Append" to_port="example set 1"/>
          <connect from_op="Append" from_port="merged set" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="source_in 2" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
        <description align="center" color="transparent" colored="false" width="126"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="9.6.000" expanded="true" height="103" name="Multiply" width="90" x="916" y="34"/>
      <operator activated="true" class="select_attributes" compatibility="9.6.000" expanded="true" height="82" name="Exclude Fixed Features" width="90" x="1519" y="34">
        <parameter key="attribute_filter_type" value="regular_expression"/>
        <parameter key="attribute" value=""/>
        <parameter key="attributes" value=""/>
        <parameter key="regular_expression" value="%{FixedFeaturesRegex}"/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="attribute_value"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="time"/>
        <parameter key="block_type" value="attribute_block"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_matrix_row_start"/>
        <parameter key="invert_selection" value="true"/>
        <parameter key="include_special_attributes" value="false"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="9.6.000" expanded="true" height="82" name="Select Fixed Features" width="90" x="1117" y="187">
        <parameter key="attribute_filter_type" value="regular_expression"/>
        <parameter key="attribute" value=""/>
        <parameter key="attributes" value=""/>
        <parameter key="regular_expression" value="%{FixedFeaturesRegex}"/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="attribute_value"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="time"/>
        <parameter key="block_type" value="attribute_block"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_matrix_row_start"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
      </operator>
      <operator activated="true" class="remember" compatibility="9.6.000" expanded="true" height="68" name="Remember FixedFeaturesExampleSet" width="90" x="1318" y="187">
        <parameter key="name" value="FixedFeaturesExampleSet"/>
        <parameter key="io_object" value="ExampleSet"/>
        <parameter key="store_which" value="1"/>
        <parameter key="remove_from_process" value="true"/>
      </operator>
      <operator activated="true" class="optimize_selection_forward" compatibility="9.6.000" expanded="true" height="103" name="Forward Selection" origin="GENERATED_TUTORIAL" width="90" x="1720" y="34">
        <parameter key="maximal_number_of_attributes" value="10"/>
        <parameter key="speculative_rounds" value="0"/>
        <parameter key="stopping_behavior" value="without increase"/>
        <parameter key="use_relative_increase" value="true"/>
        <parameter key="alpha" value="0.05"/>
        <process expanded="true">
          <operator activated="true" class="recall" compatibility="9.6.000" expanded="true" height="68" name="Recall FixedFeaturesExampleSet" width="90" x="112" y="187">
            <parameter key="name" value="FixedFeaturesExampleSet"/>
            <parameter key="io_object" value="ExampleSet"/>
            <parameter key="remove_from_store" value="false"/>
          </operator>
          <operator activated="true" class="concurrency:join" compatibility="9.6.000" expanded="true" height="82" name="Join" width="90" x="313" y="85">
            <parameter key="remove_double_attributes" value="true"/>
            <parameter key="join_type" value="inner"/>
            <parameter key="use_id_attribute_as_key" value="true"/>
            <list key="key_attributes">
              <parameter key="DateTime" value="DateTime"/>
            </list>
            <parameter key="keep_both_join_attributes" value="false"/>
            <description align="center" color="transparent" colored="false" width="126">Join the current subset of features (chosen by the feature selection operator for this iteration) to the &amp;quot;fixed features&amp;quot; that we always want to keep</description>
          </operator>
          <operator activated="true" class="concurrency:cross_validation" compatibility="9.6.000" expanded="true" height="145" name="Cross Validation" origin="GENERATED_TUTORIAL" width="90" x="514" y="34">
            <parameter key="split_on_batch_attribute" value="false"/>
            <parameter key="leave_one_out" value="false"/>
            <parameter key="number_of_folds" value="10"/>
            <parameter key="sampling_type" value="shuffled sampling"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
            <parameter key="enable_parallel_execution" value="true"/>
            <process expanded="true">
              <operator activated="true" class="k_nn" compatibility="9.6.000" expanded="true" height="82" name="K-NN (2)" origin="GENERATED_TUTORIAL" width="90" x="45" y="34">
                <parameter key="k" value="5"/>
                <parameter key="weighted_vote" value="false"/>
                <parameter key="measure_types" value="MixedMeasures"/>
                <parameter key="mixed_measure" value="MixedEuclideanDistance"/>
                <parameter key="nominal_measure" value="NominalDistance"/>
                <parameter key="numerical_measure" value="EuclideanDistance"/>
                <parameter key="divergence" value="GeneralizedIDivergence"/>
                <parameter key="kernel_type" value="radial"/>
                <parameter key="kernel_gamma" value="1.0"/>
                <parameter key="kernel_sigma1" value="1.0"/>
                <parameter key="kernel_sigma2" value="0.0"/>
                <parameter key="kernel_sigma3" value="2.0"/>
                <parameter key="kernel_degree" value="3.0"/>
                <parameter key="kernel_shift" value="1.0"/>
                <parameter key="kernel_a" value="1.0"/>
                <parameter key="kernel_b" value="0.0"/>
              </operator>
              <connect from_port="training set" to_op="K-NN (2)" to_port="training set"/>
              <connect from_op="K-NN (2)" from_port="model" to_port="model"/>
              <portSpacing port="source_training set" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true">
              <operator activated="true" class="apply_model" compatibility="9.6.000" expanded="true" height="82" name="Apply Model (2)" origin="GENERATED_TUTORIAL" width="90" x="45" y="34">
                <list key="application_parameters"/>
                <parameter key="create_view" value="false"/>
              </operator>
              <operator activated="true" class="performance" compatibility="9.6.000" expanded="true" height="82" name="Performance (2)" origin="GENERATED_TUTORIAL" width="90" x="313" y="34">
                <parameter key="use_example_weights" value="true"/>
              </operator>
              <connect from_port="model" to_op="Apply Model (2)" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model (2)" to_port="unlabelled data"/>
              <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
              <connect from_op="Performance (2)" from_port="performance" to_port="performance 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_test set results" spacing="0"/>
              <portSpacing port="sink_performance 1" spacing="0"/>
              <portSpacing port="sink_performance 2" spacing="0"/>
            </process>
          </operator>
          <connect from_port="example set" to_op="Join" to_port="left"/>
          <connect from_op="Recall FixedFeaturesExampleSet" from_port="result" to_op="Join" to_port="right"/>
          <connect from_op="Join" from_port="join" to_op="Cross Validation" to_port="example set"/>
          <connect from_op="Cross Validation" from_port="performance 1" to_port="performance"/>
          <portSpacing port="source_example set" spacing="0"/>
          <portSpacing port="sink_performance" spacing="36"/>
        </process>
      </operator>
      <operator activated="true" class="weights_to_data" compatibility="9.6.000" expanded="true" height="68" name="Weights to Data" width="90" x="1251" y="34"/>
      <connect from_op="Generate Data" from_port="output" to_op="Generate label" to_port="example set input"/>
      <connect from_op="Generate label" from_port="example set output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Generate expensive features" to_port="in 1"/>
      <connect from_op="Generate expensive features" from_port="out 1" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Exclude Fixed Features" to_port="example set input"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Select Fixed Features" to_port="example set input"/>
      <connect from_op="Exclude Fixed Features" from_port="example set output" to_op="Forward Selection" to_port="example set"/>
      <connect from_op="Select Fixed Features" from_port="example set output" to_op="Remember FixedFeaturesExampleSet" to_port="store"/>
      <connect from_op="Forward Selection" from_port="example set" to_port="result 1"/>
      <connect from_op="Forward Selection" from_port="attribute weights" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="168"/>
      <description align="center" color="yellow" colored="false" height="169" resized="true" width="411" x="444" y="139">Generate additional features that might be expensive to obtain (complex implementation, additional data integration needed, high CPU or memory usage, etc).&lt;br&gt;&lt;br&gt;We want to add them in our final model only if a Feature Selection method demonstrates that they are helpful in addition to the basic features, and to only add the subset of additional features that really improve the model.</description>
      <description align="center" color="yellow" colored="false" height="109" resized="true" width="497" x="988" y="280">These attributes will always be included when training a model.&lt;br&gt;&lt;br&gt;In this example, we keep all &amp;quot;means&amp;quot; by default because we know they are simple and efficient to compute, and want to know which additional features bring an improvement to the model.</description>
      <description align="center" color="yellow" colored="false" height="184" resized="true" width="497" x="1515" y="168">The forward selection operator won't be aware of the existence of the fixed features, so it will never attempt any combination of features where they are excluded.&lt;br&gt;&lt;br&gt;However, inside the operator, we add back the feature that have been excluded, so all models tested include these FixedFeatures.&lt;br&gt;The result from this process will be a list of features that are most useful on top of the existing FixedFeatures</description>
    </process>
  </operator>
</process>

Sign In or Register to comment.