RapidMiner 9.7 is Now Available

Lots of amazing new improvements including true version control! Learn more about what's new here.


"How to carry out symbolic regression?"

mznmzn Member, University Professor Posts: 10  University Professor
edited May 2019 in Help
Is there any tutorials/examples on to how use RM to carry out symbolic regression?


  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,749  RM Founder
    Hi @mzn
    Traditional approaches for symbolic regression often suffered from a phenomenon called feature bloat which is why they are hardly used any longer today.  They have been replaced by a combination of linear regression (for assigning coefficients) with automatic feature generation approaches.  In RapidMiner you would use a combination of the operators Generalized Linear Models with Automatic Feature Engineering for this.  The multi-objective optimization approach keeps the feature bloat in check and therefore reduces the risk for overfitting.  I have attached a small demo process below.
    I gave a presentation in London last week which also covered this to some degree.  For this discussion I used similar data to the one in the example process mentioned above.  I attached a couple of relevant slides showing a simple linear regression model, a decision tree model, a GBT model, and a model consisting of linear regression combined with automatic feature engineering.  Like in symbolic regression, the resulting formula can be easily seen (in this case it was prediction(y) = 10,550 * |x| + 7,565 * x * |x|2 + 705 / |x| + 17,394.
    Here are some relevant links:
    And finally the little demo process below.
    Hope this helps,
    <?xml version="1.0" encoding="UTF-8"?><process version="9.2.000">
  • mznmzn Member, University Professor Posts: 10  University Professor
    Thanks a lot Ingo. I am interested in the following:
    1. I have a set of data points (x1, x2, x3...) with a corresponding output (y1)
    2. I need to derive a relation (in the form of an equation) that links x1, x2, x3 to y1 such that I can predict the output for any inputs variables.
    3. Can I do this in RM? If yes, is there a simple example I/my graduate students can follow?
    4. Your youtube videos are very helpful! Thanks!
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,749  RM Founder
    Hi @mzn
    Thanks for your kind words :smile: 
    The process above is a cool example, but maybe not simple enough.  Pretty much machine learning models in RapidMiner can be used for this task, but maybe I would go with a simple linear regression first.  The process below shows a simple example for this.  If you use the Model Simulator like I do in this example, the students can even play around with some of the inputs and see how the model reacts.  You can see the Simulator in this video (around minute 6:40): https://academy.rapidminer.com/learn/video/auto-model-classification
    More helpful videos on this can be found here: https://academy.rapidminer.com/catalog?label=search&value=regression
    Hope this helps,
    <?xml version="1.0" encoding="UTF-8"?><process version="9.2.000">
      <operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="UTF-8"/>
        <process expanded="true">
          <operator activated="true" class="generate_data" compatibility="9.2.000" expanded="true" height="68" name="Generate Data" width="90" x="45" y="34">
            <parameter key="target_function" value="sum"/>
            <parameter key="number_examples" value="1000"/>
            <parameter key="number_of_attributes" value="5"/>
            <parameter key="attributes_lower_bound" value="-10.0"/>
            <parameter key="attributes_upper_bound" value="10.0"/>
            <parameter key="gaussian_standard_deviation" value="10.0"/>
            <parameter key="largest_radius" value="10.0"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
            <parameter key="datamanagement" value="double_array"/>
            <parameter key="data_management" value="auto"/>
          <operator activated="true" class="add_noise" compatibility="9.2.000" expanded="true" height="103" name="Add Noise" width="90" x="179" y="34">
            <parameter key="return_preprocessing_model" value="false"/>
            <parameter key="create_view" value="false"/>
            <parameter key="attribute_filter_type" value="all"/>
            <parameter key="attribute" value=""/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="attribute_value"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="time"/>
            <parameter key="block_type" value="attribute_block"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <parameter key="random_attributes" value="5"/>
            <parameter key="label_noise" value="0.05"/>
            <parameter key="default_attribute_noise" value="0.0"/>
            <list key="noise"/>
            <parameter key="offset" value="0.0"/>
            <parameter key="linear_factor" value="1.0"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
          <operator activated="true" class="split_data" compatibility="9.2.000" expanded="true" height="103" name="Split Data" width="90" x="313" y="187">
            <enumeration key="partitions">
              <parameter key="ratio" value="0.7"/>
              <parameter key="ratio" value="0.3"/>
            <parameter key="sampling_type" value="automatic"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
          <operator activated="true" class="linear_regression" compatibility="9.2.000" expanded="true" height="103" name="Linear Regression" width="90" x="447" y="34">
            <parameter key="feature_selection" value="none"/>
            <parameter key="alpha" value="0.05"/>
            <parameter key="max_iterations" value="10"/>
            <parameter key="forward_alpha" value="0.05"/>
            <parameter key="backward_alpha" value="0.05"/>
            <parameter key="eliminate_colinear_features" value="true"/>
            <parameter key="min_tolerance" value="0.05"/>
            <parameter key="use_bias" value="true"/>
            <parameter key="ridge" value="1.0E-8"/>
          <operator activated="true" class="apply_model" compatibility="9.2.000" expanded="true" height="82" name="Apply Model" width="90" x="581" y="238">
            <list key="application_parameters"/>
            <parameter key="create_view" value="false"/>
          <operator activated="true" class="model_simulator:model_simulator" compatibility="9.2.000" expanded="true" height="103" name="Model Simulator" width="90" x="782" y="136"/>
          <connect from_op="Generate Data" from_port="output" to_op="Add Noise" to_port="example set input"/>
          <connect from_op="Add Noise" from_port="example set output" to_op="Split Data" to_port="example set"/>
          <connect from_op="Split Data" from_port="partition 1" to_op="Linear Regression" to_port="training set"/>
          <connect from_op="Split Data" from_port="partition 2" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Linear Regression" from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_op="Linear Regression" from_port="exampleSet" to_op="Model Simulator" to_port="training data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Model Simulator" to_port="test data"/>
          <connect from_op="Apply Model" from_port="model" to_op="Model Simulator" to_port="model"/>
          <connect from_op="Model Simulator" from_port="simulator output" to_port="result 1"/>
          <connect from_op="Model Simulator" from_port="model output" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="105"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>

Sign In or Register to comment.