"How to carry out symbolic regression?"

mznmzn Member, University Professor Posts: 10  University Professor
edited May 23 in Help
Is there any tutorials/examples on to how use RM to carry out symbolic regression?
Tagged:

Answers

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,682  RM Founder
    Hi @mzn
    Traditional approaches for symbolic regression often suffered from a phenomenon called feature bloat which is why they are hardly used any longer today.  They have been replaced by a combination of linear regression (for assigning coefficients) with automatic feature generation approaches.  In RapidMiner you would use a combination of the operators Generalized Linear Models with Automatic Feature Engineering for this.  The multi-objective optimization approach keeps the feature bloat in check and therefore reduces the risk for overfitting.  I have attached a small demo process below.
    I gave a presentation in London last week which also covered this to some degree.  For this discussion I used similar data to the one in the example process mentioned above.  I attached a couple of relevant slides showing a simple linear regression model, a decision tree model, a GBT model, and a model consisting of linear regression combined with automatic feature engineering.  Like in symbolic regression, the resulting formula can be easily seen (in this case it was prediction(y) = 10,550 * |x| + 7,565 * x * |x|2 + 705 / |x| + 17,394.
    Here are some relevant links:
    And finally the little demo process below.
    Hope this helps,
    Ingo
    <?xml version="1.0" encoding="UTF-8"?><process version="9.2.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="UTF-8"/>
        <process expanded="true">
          <operator activated="true" class="generate_data" compatibility="9.2.000" expanded="true" height="68" name="Generate Data" width="90" x="45" y="289">
            <parameter key="target_function" value="one variable non linear"/>
            <parameter key="number_examples" value="3000"/>
            <parameter key="number_of_attributes" value="1"/>
            <parameter key="attributes_lower_bound" value="-25.0"/>
            <parameter key="attributes_upper_bound" value="25.0"/>
            <parameter key="gaussian_standard_deviation" value="10.0"/>
            <parameter key="largest_radius" value="10.0"/>
            <parameter key="use_local_random_seed" value="true"/>
            <parameter key="local_random_seed" value="1977"/>
            <parameter key="datamanagement" value="double_array"/>
            <parameter key="data_management" value="auto"/>
          </operator>
          <operator activated="true" class="add_noise" compatibility="9.2.000" expanded="true" height="103" name="Add Noise" width="90" x="179" y="289">
            <parameter key="return_preprocessing_model" value="false"/>
            <parameter key="create_view" value="false"/>
            <parameter key="attribute_filter_type" value="all"/>
            <parameter key="attribute" value=""/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="attribute_value"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="time"/>
            <parameter key="block_type" value="attribute_block"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <parameter key="random_attributes" value="0"/>
            <parameter key="label_noise" value="0.01"/>
            <parameter key="default_attribute_noise" value="0.0"/>
            <list key="noise"/>
            <parameter key="offset" value="0.0"/>
            <parameter key="linear_factor" value="1.0"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
          </operator>
          <operator activated="true" class="split_data" compatibility="9.2.000" expanded="true" height="103" name="Split Data (2)" width="90" x="313" y="289">
            <enumeration key="partitions">
              <parameter key="ratio" value="0.7"/>
              <parameter key="ratio" value="0.3"/>
            </enumeration>
            <parameter key="sampling_type" value="automatic"/>
            <parameter key="use_local_random_seed" value="true"/>
            <parameter key="local_random_seed" value="1992"/>
          </operator>
          <operator activated="true" class="generate_id" compatibility="9.2.000" expanded="true" height="82" name="Generate ID" width="90" x="581" y="442">
            <parameter key="create_nominal_ids" value="false"/>
            <parameter key="offset" value="0"/>
          </operator>
          <operator activated="true" class="multiply" compatibility="9.2.000" expanded="true" height="103" name="Multiply" width="90" x="447" y="187"/>
          <operator activated="true" class="model_simulator:automatic_feature_engineering" compatibility="9.2.000" expanded="true" height="103" name="Automatic Feature Engineering" width="90" x="581" y="34">
            <parameter key="mode" value="feature selection and generation"/>
            <parameter key="balance for accuracy" value="1.0"/>
            <parameter key="show progress dialog" value="true"/>
            <parameter key="use_local_random_seed" value="true"/>
            <parameter key="local_random_seed" value="1992"/>
            <parameter key="use optimization heuristics" value="false"/>
            <parameter key="maximum generations" value="100"/>
            <parameter key="population size" value="30"/>
            <parameter key="use multi-starts" value="true"/>
            <parameter key="number of multi-starts" value="5"/>
            <parameter key="generations until multi-start" value="10"/>
            <parameter key="use time limit" value="true"/>
            <parameter key="time limit in seconds" value="60"/>
            <parameter key="use subset for generation" value="false"/>
            <parameter key="maximum function complexity" value="6"/>
            <parameter key="use_plus" value="false"/>
            <parameter key="use_diff" value="false"/>
            <parameter key="use_mult" value="true"/>
            <parameter key="use_div" value="true"/>
            <parameter key="reciprocal_value" value="true"/>
            <parameter key="use_square_roots" value="true"/>
            <parameter key="use_exp" value="false"/>
            <parameter key="use_log" value="false"/>
            <parameter key="use_absolute_values" value="true"/>
            <parameter key="use_sgn" value="false"/>
            <parameter key="use_min" value="false"/>
            <parameter key="use_max" value="false"/>
            <process expanded="true">
              <operator activated="true" class="split_data" compatibility="9.2.000" expanded="true" height="103" name="Split Data" width="90" x="45" y="136">
                <enumeration key="partitions">
                  <parameter key="ratio" value="0.7"/>
                  <parameter key="ratio" value="0.3"/>
                </enumeration>
                <parameter key="sampling_type" value="automatic"/>
                <parameter key="use_local_random_seed" value="true"/>
                <parameter key="local_random_seed" value="1992"/>
              </operator>
              <operator activated="true" class="h2o:generalized_linear_model" compatibility="9.2.000" expanded="true" height="124" name="Generalized Linear Model" width="90" x="179" y="34">
                <parameter key="family" value="AUTO"/>
                <parameter key="link" value="family_default"/>
                <parameter key="solver" value="AUTO"/>
                <parameter key="reproducible" value="false"/>
                <parameter key="maximum_number_of_threads" value="4"/>
                <parameter key="use_regularization" value="false"/>
                <parameter key="lambda" value="1.0"/>
                <parameter key="lambda_search" value="false"/>
                <parameter key="number_of_lambdas" value="0"/>
                <parameter key="lambda_min_ratio" value="0.0"/>
                <parameter key="early_stopping" value="true"/>
                <parameter key="stopping_rounds" value="3"/>
                <parameter key="stopping_tolerance" value="0.001"/>
                <parameter key="alpha" value="1.0"/>
                <parameter key="standardize" value="true"/>
                <parameter key="non-negative_coefficients" value="false"/>
                <parameter key="add_intercept" value="true"/>
                <parameter key="compute_p-values" value="false"/>
                <parameter key="remove_collinear_columns" value="false"/>
                <parameter key="missing_values_handling" value="MeanImputation"/>
                <parameter key="max_iterations" value="0"/>
                <parameter key="specify_beta_constraints" value="false"/>
                <list key="beta_constraints"/>
                <parameter key="max_runtime_seconds" value="0"/>
                <list key="expert_parameters"/>
              </operator>
              <operator activated="true" class="apply_model" compatibility="9.2.000" expanded="true" height="82" name="Apply Model" width="90" x="380" y="136">
                <list key="application_parameters"/>
                <parameter key="create_view" value="false"/>
              </operator>
              <operator activated="true" class="performance_regression" compatibility="9.2.000" expanded="true" height="82" name="Performance" width="90" x="514" y="136">
                <parameter key="main_criterion" value="root_mean_squared_error"/>
                <parameter key="root_mean_squared_error" value="true"/>
                <parameter key="absolute_error" value="false"/>
                <parameter key="relative_error" value="false"/>
                <parameter key="relative_error_lenient" value="false"/>
                <parameter key="relative_error_strict" value="false"/>
                <parameter key="normalized_absolute_error" value="false"/>
                <parameter key="root_relative_squared_error" value="false"/>
                <parameter key="squared_error" value="false"/>
                <parameter key="correlation" value="false"/>
                <parameter key="squared_correlation" value="false"/>
                <parameter key="prediction_average" value="false"/>
                <parameter key="spearman_rho" value="false"/>
                <parameter key="kendall_tau" value="false"/>
                <parameter key="skip_undefined_labels" value="true"/>
                <parameter key="use_example_weights" value="true"/>
              </operator>
              <connect from_port="example set source" to_op="Split Data" to_port="example set"/>
              <connect from_op="Split Data" from_port="partition 1" to_op="Generalized Linear Model" to_port="training set"/>
              <connect from_op="Split Data" from_port="partition 2" to_op="Apply Model" to_port="unlabelled data"/>
              <connect from_op="Generalized Linear Model" from_port="model" to_op="Apply Model" to_port="model"/>
              <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
              <connect from_op="Performance" from_port="performance" to_port="performance sink"/>
              <portSpacing port="source_example set source" spacing="0"/>
              <portSpacing port="sink_performance sink" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="multiply" compatibility="9.2.000" expanded="true" height="103" name="Multiply (2)" width="90" x="715" y="34"/>
          <operator activated="true" class="model_simulator:apply_feature_set" compatibility="9.2.000" expanded="true" height="82" name="Apply Feature Set" width="90" x="849" y="187">
            <parameter key="handle missings" value="true"/>
            <parameter key="keep originals" value="false"/>
            <parameter key="originals special role" value="true"/>
          </operator>
          <operator activated="true" class="h2o:generalized_linear_model" compatibility="9.2.000" expanded="true" height="124" name="Generalized Linear Model (2)" width="90" x="983" y="187">
            <parameter key="family" value="AUTO"/>
            <parameter key="link" value="family_default"/>
            <parameter key="solver" value="AUTO"/>
            <parameter key="reproducible" value="false"/>
            <parameter key="maximum_number_of_threads" value="4"/>
            <parameter key="use_regularization" value="false"/>
            <parameter key="lambda_search" value="false"/>
            <parameter key="number_of_lambdas" value="0"/>
            <parameter key="lambda_min_ratio" value="0.0"/>
            <parameter key="early_stopping" value="true"/>
            <parameter key="stopping_rounds" value="3"/>
            <parameter key="stopping_tolerance" value="0.001"/>
            <parameter key="standardize" value="true"/>
            <parameter key="non-negative_coefficients" value="false"/>
            <parameter key="add_intercept" value="true"/>
            <parameter key="compute_p-values" value="false"/>
            <parameter key="remove_collinear_columns" value="false"/>
            <parameter key="missing_values_handling" value="MeanImputation"/>
            <parameter key="max_iterations" value="0"/>
            <parameter key="specify_beta_constraints" value="false"/>
            <list key="beta_constraints"/>
            <parameter key="max_runtime_seconds" value="0"/>
            <list key="expert_parameters"/>
          </operator>
          <operator activated="true" class="multiply" compatibility="9.2.000" expanded="true" height="103" name="Multiply (3)" width="90" x="715" y="442"/>
          <operator activated="true" class="model_simulator:apply_feature_set" compatibility="9.2.000" expanded="true" height="82" name="Apply Feature Set (2)" width="90" x="849" y="340">
            <parameter key="handle missings" value="true"/>
            <parameter key="keep originals" value="false"/>
            <parameter key="originals special role" value="true"/>
          </operator>
          <operator activated="true" class="apply_model" compatibility="9.2.000" expanded="true" height="82" name="Apply Model (2)" width="90" x="1117" y="340">
            <list key="application_parameters"/>
            <parameter key="create_view" value="false"/>
          </operator>
          <operator activated="true" class="concurrency:join" compatibility="9.2.000" expanded="true" height="82" name="Join" width="90" x="1251" y="442">
            <parameter key="remove_double_attributes" value="true"/>
            <parameter key="join_type" value="inner"/>
            <parameter key="use_id_attribute_as_key" value="true"/>
            <list key="key_attributes"/>
            <parameter key="keep_both_join_attributes" value="false"/>
          </operator>
          <connect from_op="Generate Data" from_port="output" to_op="Add Noise" to_port="example set input"/>
          <connect from_op="Add Noise" from_port="example set output" to_op="Split Data (2)" to_port="example set"/>
          <connect from_op="Split Data (2)" from_port="partition 1" to_op="Multiply" to_port="input"/>
          <connect from_op="Split Data (2)" from_port="partition 2" to_op="Generate ID" to_port="example set input"/>
          <connect from_op="Generate ID" from_port="example set output" to_op="Multiply (3)" to_port="input"/>
          <connect from_op="Multiply" from_port="output 1" to_op="Automatic Feature Engineering" to_port="example set in"/>
          <connect from_op="Multiply" from_port="output 2" to_op="Apply Feature Set" to_port="example set"/>
          <connect from_op="Automatic Feature Engineering" from_port="feature set" to_op="Multiply (2)" to_port="input"/>
          <connect from_op="Multiply (2)" from_port="output 1" to_op="Apply Feature Set" to_port="feature set"/>
          <connect from_op="Multiply (2)" from_port="output 2" to_op="Apply Feature Set (2)" to_port="feature set"/>
          <connect from_op="Apply Feature Set" from_port="example set" to_op="Generalized Linear Model (2)" to_port="training set"/>
          <connect from_op="Generalized Linear Model (2)" from_port="model" to_op="Apply Model (2)" to_port="model"/>
          <connect from_op="Multiply (3)" from_port="output 1" to_op="Apply Feature Set (2)" to_port="example set"/>
          <connect from_op="Multiply (3)" from_port="output 2" to_op="Join" to_port="right"/>
          <connect from_op="Apply Feature Set (2)" from_port="example set" to_op="Apply Model (2)" to_port="unlabelled data"/>
          <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Join" to_port="left"/>
          <connect from_op="Apply Model (2)" from_port="model" to_port="result 1"/>
          <connect from_op="Join" from_port="join" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="315"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
    RapidMiner Wisdom 2020
    February 11th and 12th 2020 in Boston, MA, USA

    sgenzerkypexin
  • mznmzn Member, University Professor Posts: 10  University Professor
    Thanks a lot Ingo. I am interested in the following:
    1. I have a set of data points (x1, x2, x3...) with a corresponding output (y1)
    2. I need to derive a relation (in the form of an equation) that links x1, x2, x3 to y1 such that I can predict the output for any inputs variables.
    3. Can I do this in RM? If yes, is there a simple example I/my graduate students can follow?
    4. Your youtube videos are very helpful! Thanks!
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,682  RM Founder
    Hi @mzn
    Thanks for your kind words :smile: 
    The process above is a cool example, but maybe not simple enough.  Pretty much machine learning models in RapidMiner can be used for this task, but maybe I would go with a simple linear regression first.  The process below shows a simple example for this.  If you use the Model Simulator like I do in this example, the students can even play around with some of the inputs and see how the model reacts.  You can see the Simulator in this video (around minute 6:40): https://academy.rapidminer.com/learn/video/auto-model-classification
    More helpful videos on this can be found here: https://academy.rapidminer.com/catalog?label=search&value=regression
    Hope this helps,
    Ingo
    <?xml version="1.0" encoding="UTF-8"?><process version="9.2.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="UTF-8"/>
        <process expanded="true">
          <operator activated="true" class="generate_data" compatibility="9.2.000" expanded="true" height="68" name="Generate Data" width="90" x="45" y="34">
            <parameter key="target_function" value="sum"/>
            <parameter key="number_examples" value="1000"/>
            <parameter key="number_of_attributes" value="5"/>
            <parameter key="attributes_lower_bound" value="-10.0"/>
            <parameter key="attributes_upper_bound" value="10.0"/>
            <parameter key="gaussian_standard_deviation" value="10.0"/>
            <parameter key="largest_radius" value="10.0"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
            <parameter key="datamanagement" value="double_array"/>
            <parameter key="data_management" value="auto"/>
          </operator>
          <operator activated="true" class="add_noise" compatibility="9.2.000" expanded="true" height="103" name="Add Noise" width="90" x="179" y="34">
            <parameter key="return_preprocessing_model" value="false"/>
            <parameter key="create_view" value="false"/>
            <parameter key="attribute_filter_type" value="all"/>
            <parameter key="attribute" value=""/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="attribute_value"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="time"/>
            <parameter key="block_type" value="attribute_block"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <parameter key="random_attributes" value="5"/>
            <parameter key="label_noise" value="0.05"/>
            <parameter key="default_attribute_noise" value="0.0"/>
            <list key="noise"/>
            <parameter key="offset" value="0.0"/>
            <parameter key="linear_factor" value="1.0"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
          </operator>
          <operator activated="true" class="split_data" compatibility="9.2.000" expanded="true" height="103" name="Split Data" width="90" x="313" y="187">
            <enumeration key="partitions">
              <parameter key="ratio" value="0.7"/>
              <parameter key="ratio" value="0.3"/>
            </enumeration>
            <parameter key="sampling_type" value="automatic"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
          </operator>
          <operator activated="true" class="linear_regression" compatibility="9.2.000" expanded="true" height="103" name="Linear Regression" width="90" x="447" y="34">
            <parameter key="feature_selection" value="none"/>
            <parameter key="alpha" value="0.05"/>
            <parameter key="max_iterations" value="10"/>
            <parameter key="forward_alpha" value="0.05"/>
            <parameter key="backward_alpha" value="0.05"/>
            <parameter key="eliminate_colinear_features" value="true"/>
            <parameter key="min_tolerance" value="0.05"/>
            <parameter key="use_bias" value="true"/>
            <parameter key="ridge" value="1.0E-8"/>
          </operator>
          <operator activated="true" class="apply_model" compatibility="9.2.000" expanded="true" height="82" name="Apply Model" width="90" x="581" y="238">
            <list key="application_parameters"/>
            <parameter key="create_view" value="false"/>
          </operator>
          <operator activated="true" class="model_simulator:model_simulator" compatibility="9.2.000" expanded="true" height="103" name="Model Simulator" width="90" x="782" y="136"/>
          <connect from_op="Generate Data" from_port="output" to_op="Add Noise" to_port="example set input"/>
          <connect from_op="Add Noise" from_port="example set output" to_op="Split Data" to_port="example set"/>
          <connect from_op="Split Data" from_port="partition 1" to_op="Linear Regression" to_port="training set"/>
          <connect from_op="Split Data" from_port="partition 2" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Linear Regression" from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_op="Linear Regression" from_port="exampleSet" to_op="Model Simulator" to_port="training data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Model Simulator" to_port="test data"/>
          <connect from_op="Apply Model" from_port="model" to_op="Model Simulator" to_port="model"/>
          <connect from_op="Model Simulator" from_port="simulator output" to_port="result 1"/>
          <connect from_op="Model Simulator" from_port="model output" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="105"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>

    RapidMiner Wisdom 2020
    February 11th and 12th 2020 in Boston, MA, USA

    mzn
Sign In or Register to comment.