[SOLVED] Flexible Learner Replacement

GzFGzF Member Posts: 11 Contributor II
edited November 2018 in Help
Hi there,

I'm fairly new to rapid miner, but had to get into it pretty fast due to my new job.
Currently I'm working on predictive analysis for maintenance issues.

My data set contains some ten thousand examples with about a thousand attributes.
Due to speed issues, I've designed a selective preprocess where I split the data into subsets of different  attributes, to do some forward selection analysis combined with cross validation and keep the attributes of each subset which have the biggest impact on the result. Then join the results back together to do a final analysis. This process currently involves the usage of about 20 Learner Operators , i.e. SVM, Naïve Bayes or Decision Tree (Process will be shown in the following post).
Switching from one Learner Method to another is a dull and tiring thing, since I have to replace all of the 20 Operators.

So I thought of some kind of macro like 2-Component System for flexible Learner replacement.
These 2 Components could look like:
The first Component should be is a nested Operator, which contains the Learner to be used. It might also needs an ID/Name as parameter for the purpose of running several of the Container Constructions in one Process.
The second Component is linked to the specified first Component (via the ID/Name). It simply retrieves the defined Learner Operator which all its Parameters.
This would also come in really handy, when you want to do some optimization on the whole process (which is my second point of the idea). Regarding a Learner with 3 parameters and 2 choices for each of the parameters, it would make a difference of 2^3^20 combinations - no container available, so each of the 20 Learners has got its own set of parameters - to 2^3 combinations due to the usage of just one Learner configuration throughout the whole process. This would not only save computation time. It would also save time designing the process for not having to choose and set 60 parameter ranges instead of only 3 in the optimization Operator (not regarding the debugging).

I believe that there is some kind of workaround for the problem of too many combinations using macros, but this would probably make the design phase even more complicated and tiring.
Or maybe there is a cool solution using the XML-Code and a replace-function instead of the GUI .

Thanks
Garlef

Answers

  • GzFGzF Member Posts: 11 Contributor II
    And here is a very, very short version of the code - more like a fragment

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.015">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="subprocess" compatibility="5.3.015" expanded="true" height="76" name="Get important Attributes" width="90" x="313" y="30">
            <process expanded="true">
              <operator activated="true" class="subprocess" compatibility="5.3.015" expanded="true" height="76" name="Periodical Log (4)" width="90" x="179" y="30">
                <process expanded="true">
                  <operator activated="true" class="multiply" compatibility="5.3.015" expanded="true" height="76" name="Multiply (22)" width="90" x="45" y="30"/>
                  <operator activated="true" class="subprocess" compatibility="5.3.015" expanded="true" height="76" name="Min (2)" width="90" x="179" y="30">
                    <process expanded="true">
                      <operator activated="true" class="select_attributes" compatibility="5.3.015" expanded="true" height="76" name="Get Min (2)" width="90" x="45" y="30">
                        <parameter key="attribute_filter_type" value="subset"/>
                        <parameter key="attributes" value="|Min|D_Min"/>
                      </operator>
                      <operator activated="true" class="optimize_selection_forward" compatibility="5.3.015" expanded="true" height="94" name="Forward Selection (50)" width="90" x="246" y="30">
                        <process expanded="true">
                          <operator activated="true" class="x_validation" compatibility="5.3.015" expanded="true" height="112" name="Validation (49)" width="90" x="45" y="30">
                            <process expanded="true">
                              <operator activated="true" class="naive_bayes_kernel" compatibility="5.3.015" expanded="true" height="76" name="Karnel Bayes (9)" width="90" x="178" y="30"/>
                              <connect from_port="training" to_op="Karnel Bayes (9)" to_port="training set"/>
                              <connect from_op="Karnel Bayes (9)" from_port="model" to_port="model"/>
                              <portSpacing port="source_training" spacing="0"/>
                              <portSpacing port="sink_model" spacing="0"/>
                              <portSpacing port="sink_through 1" spacing="0"/>
                            </process>
                            <process expanded="true">
                              <operator activated="true" class="apply_model" compatibility="5.3.015" expanded="true" height="76" name="Apply Model (51)" width="90" x="45" y="30">
                                <list key="application_parameters"/>
                              </operator>
                              <operator activated="true" class="performance" compatibility="5.3.015" expanded="true" height="76" name="Perf Min Forward (7)" width="90" x="283" y="30"/>
                              <connect from_port="model" to_op="Apply Model (51)" to_port="model"/>
                              <connect from_port="test set" to_op="Apply Model (51)" to_port="unlabelled data"/>
                              <connect from_op="Apply Model (51)" from_port="labelled data" to_op="Perf Min Forward (7)" to_port="labelled data"/>
                              <connect from_op="Perf Min Forward (7)" from_port="performance" to_port="averagable 1"/>
                              <portSpacing port="source_model" spacing="0"/>
                              <portSpacing port="source_test set" spacing="0"/>
                              <portSpacing port="source_through 1" spacing="0"/>
                              <portSpacing port="sink_averagable 1" spacing="0"/>
                              <portSpacing port="sink_averagable 2" spacing="0"/>
                            </process>
                          </operator>
                          <connect from_port="example set" to_op="Validation (49)" to_port="training"/>
                          <connect from_op="Validation (49)" from_port="averagable 1" to_port="performance"/>
                          <portSpacing port="source_example set" spacing="0"/>
                          <portSpacing port="sink_performance" spacing="0"/>
                        </process>
                      </operator>
                      <connect from_port="in 1" to_op="Get Min (2)" to_port="example set input"/>
                      <connect from_op="Get Min (2)" from_port="example set output" to_op="Forward Selection (50)" to_port="example set"/>
                      <connect from_op="Forward Selection (50)" from_port="example set" to_port="out 1"/>
                      <portSpacing port="source_in 1" spacing="0"/>
                      <portSpacing port="source_in 2" spacing="0"/>
                      <portSpacing port="sink_out 1" spacing="0"/>
                      <portSpacing port="sink_out 2" spacing="0"/>
                    </process>
                  </operator>
                  <connect from_port="in 1" to_op="Multiply (22)" to_port="input"/>
                  <connect from_op="Multiply (22)" from_port="output 1" to_op="Min (2)" to_port="in 1"/>
                  <connect from_op="Min (2)" from_port="out 1" to_port="out 1"/>
                  <portSpacing port="source_in 1" spacing="0"/>
                  <portSpacing port="source_in 2" spacing="0"/>
                  <portSpacing port="sink_out 1" spacing="0"/>
                  <portSpacing port="sink_out 2" spacing="0"/>
                </process>
              </operator>
              <connect from_port="in 1" to_op="Periodical Log (4)" to_port="in 1"/>
              <connect from_op="Periodical Log (4)" from_port="out 1" to_port="out 1"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="source_in 2" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
        </process>
      </operator>
    </process>
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    You could create some processes that contain only a learner, e.g. an SVM or a decision tree, connected to the process input and output, and call those processes via Execute Process in your main process. By using a macro that specifies the location of the process to be tested you can even make in configurable. And you can even automate the complete thing by using the Loop Parameters operator to automatically try out several of the inner processes by varying the process_location parameter of Execute Process.

    I hope this makes sense to you!

    Best regards,
    Marius
  • GzFGzF Member Posts: 11 Contributor II
    Hi Marius,

    I managed to creat a little example on using the Execute Process operator.
    Using different learners is nice and easy now. Thanks for that nice trick.

    In the following I will use the terms if outer process: the one that sets the macros, loads the data and calls the Execute Process operator, and inner process: the one which contains the learner operator. The learner mentioned is the k-NN operator

    I still got some questions on the variation of parameters in the inner process.
    Variying numerical values seems easy unsing macros, but only returns the error mesage
    Optimize Parameters (Grid): Cannot evaluate performance for current parameter combination because of an error in one of the inner operators: A value for the parameter 'k' must be specified! Expected integer but found '30.0'.
    The parameter 'k' comes from the k-NN operator. I tried using a second macro in the inner process that does nothing but gets the value of the macro from the outer process and parse the number. But I can't call the macro name in the Parse Number operator.

    And I don't know how to do an outer modification of parameters in the inner process that are done by setting a tag or drop down lists, i.e. weighted vote or measure types @ the k-NN operator. Putting through numerical or nominal values by using macros has no use + the problem described above. And calling them from an outer optimization operator is not possible at all since it as no idea about the design of the inner process.

    Also I can't use the Optimize Grid operator to modify macros that are created with the Macros operator. Looks like Optimize Grid can't look inside menus for creating multiple objects or subsets.

    Garlef
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi Garlef,

    you could use an Optimize Parameters in the outer process to simply call different inner processes, and add another Optimize Parameters in the inner process to variate the learner parameters.

    Best regards,
    Marius
  • GzFGzF Member Posts: 11 Contributor II
    Hi Marius,

    I thought about that, but it might lead to in different optimization parameter settings for each of my attribute subsets. And I'm not sure how applicable the results will be.
    On the other hand I could save each intermediate result and load it later in the main process. But thats probably not too good for the computational time.

    Anyway, thanks a lot for the really fast and really good help

    And I still believe, that my idea is worth implementing, but that's mostly because it took me a couple of hours to think of it. ;)

    Best Regards
    Garlef
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Well, you should store the settings and the model of each iteration, and when you finally have new data you simply load the best model and apply it :) (The Parameter Optimization has an output for the parameter settings).

    Best regards,
    Marius
Sign In or Register to comment.