Need recommendation for prediction/regression workflow

cyborghijacker · August 2017

Greetings one and all

I am getting acquainted to machine learning in Rapidminer and essentially I'm concerned with a prediction problem. My CSV with about 5000 Examples contains 10 predictor attributes and 2 target attributes. I have a few queries with regards to Rapidminer's design:

1) I understand that only 1 target attribute (or prediction) can be set. Would I be able to predict both using a single process?

2) I am interested in using the new Deep Learning operator in performing the training. What are the recommended preprocessing steps? I can think of 1) filtering (missing values); 2) Normalizing. Do correlated attributes need to be removed manually?

3) For splitting of data into training, testing and validation, am I supposed to simply use Cross Validation with the Deep Learning operator nested within it? What about Split Validation? Does these operators split the original data into the 3 sets?

4) Can the deep learning operator handle a mixture of categorical and numerical? Is there no one-hot encoding necessary within Rapidminer, or do I need to preprocess using Nominal to Numerical (dummy coding)? For categorical variables, is the polynominal role suitable to describe it? I noticed there is also a 'text' class.

5) What does the 'reproducible' function do within the DL operator?

6) Is it possible to 'deploy' a trained DL model to an operational scenario?

7) When importing data using the Import Config. Wizard, could I skip defining the roles and instead use the Set Roles function in the designer?

My apologies for the many questions. I find Rapidminer to be a powerful tool and really user friendly. Would like to take the time to really understand it. Thank you very much.

Regards

Corse

jczogalla · August 2017

Hi Corse!

I'm not very familiar with the DL operator, but I will try to answer some of the other questions.

1) Yes, you can do something similar to this:

Spoiler

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
<context>
    <input/>
    <output/>
    <macros/>
</context>
<operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="generate_data" compatibility="7.5.003" expanded="true" height="68" name="Generate Data" width="90" x="45" y="34">
        <parameter key="target_function" value="polynomial"/>
        <description align="center" color="transparent" colored="false" width="126">Data with label</description>
      </operator>
      <operator activated="true" class="generate_aggregation" compatibility="7.5.003" expanded="true" height="82" name="Generate Aggregation" width="90" x="179" y="34">
        <parameter key="attribute_name" value="sum"/>
        <description align="center" color="transparent" colored="false" width="126">Add second label</description>
      </operator>
      <operator activated="true" class="set_role" compatibility="7.5.003" expanded="true" height="82" name="Set Role" width="90" x="313" y="34">
        <parameter key="attribute_name" value="sum"/>
        <parameter key="target_role" value="sumlabel"/>
        <list key="set_additional_roles"/>
        <description align="center" color="transparent" colored="false" width="126">Set role of second label to &quot;sumlabel&quot;</description>
      </operator>
      <operator activated="true" class="support_vector_machine" compatibility="7.5.003" expanded="true" height="124" name="SVM" width="90" x="447" y="34">
        <description align="center" color="transparent" colored="false" width="126">Predict for &quot;label&quot; attribute</description>
      </operator>
      <operator activated="true" class="set_role" compatibility="7.5.003" expanded="true" height="82" name="Set Role (2)" width="90" x="45" y="238">
        <parameter key="attribute_name" value="label"/>
        <parameter key="target_role" value="polylabel"/>
        <list key="set_additional_roles">
          <parameter key="sum" value="label"/>
        </list>
        <description align="center" color="transparent" colored="false" width="126">Set first label to something, set second label to &quot;label&quot;</description>
      </operator>
      <operator activated="true" class="support_vector_machine" compatibility="7.5.003" expanded="true" height="124" name="SVM (2)" width="90" x="179" y="238">
        <description align="center" color="transparent" colored="false" width="126">Predict &quot;sum&quot; attribute</description>
      </operator>
      <operator activated="true" class="apply_model" compatibility="7.5.003" expanded="true" height="82" name="Apply Model" width="90" x="246" y="442">
        <list key="application_parameters"/>
        <description align="center" color="transparent" colored="false" width="126">Apply first model</description>
      </operator>
      <operator activated="true" class="set_role" compatibility="7.5.003" expanded="true" height="82" name="Set Role (3)" width="90" x="380" y="442">
        <parameter key="attribute_name" value="prediction(label)"/>
        <parameter key="target_role" value="predictionold"/>
        <list key="set_additional_roles"/>
        <description align="center" color="transparent" colored="false" width="126">Set prediction label to something</description>
      </operator>
      <operator activated="true" class="apply_model" compatibility="7.5.003" expanded="true" height="82" name="Apply Model (2)" width="90" x="581" y="391">
        <list key="application_parameters"/>
        <description align="center" color="transparent" colored="false" width="126">Apply second model</description>
      </operator>
      <connect from_op="Generate Data" from_port="output" to_op="Generate Aggregation" to_port="example set input"/>
      <connect from_op="Generate Aggregation" from_port="example set output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="SVM" to_port="training set"/>
      <connect from_op="SVM" from_port="model" to_op="Apply Model" to_port="model"/>
      <connect from_op="SVM" from_port="exampleSet" to_op="Set Role (2)" to_port="example set input"/>
      <connect from_op="Set Role (2)" from_port="example set output" to_op="SVM (2)" to_port="training set"/>
      <connect from_op="SVM (2)" from_port="model" to_op="Apply Model (2)" to_port="model"/>
      <connect from_op="SVM (2)" from_port="exampleSet" to_op="Apply Model" to_port="unlabelled data"/>
      <connect from_op="Apply Model" from_port="labelled data" to_op="Set Role (3)" to_port="example set input"/>
      <connect from_op="Set Role (3)" from_port="example set output" to_op="Apply Model (2)" to_port="unlabelled data"/>
      <connect from_op="Apply Model (2)" from_port="labelled data" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
</operator>
</process>

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.003"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process"> <process expanded="true"> <operator activated="true" class="generate_data" compatibility="7.5.003" expanded="true" height="68" name="Generate Data" width="90" x="45" y="34"> <parameter key="target_function" value="polynomial"/> <description align="center" color="transparent" colored="false" width="126">Data with label</description> </operator> <operator activated="true" class="generate_aggregation" compatibility="7.5.003" expanded="true" height="82" name="Generate Aggregation" width="90" x="179" y="34"> <parameter key="attribute_name" value="sum"/> <description align="center" color="transparent" colored="false" width="126">Add second label</description> </operator> <operator activated="true" class="set_role" compatibility="7.5.003" expanded="true" height="82" name="Set Role" width="90" x="313" y="34"> <parameter key="attribute_name" value="sum"/> <parameter key="target_role" value="sumlabel"/> <list key="set_additional_roles"/> <description align="center" color="transparent" colored="false" width="126">Set role of second label to &quot;sumlabel&quot;</description> </operator> <operator activated="true" class="support_vector_machine" compatibility="7.5.003" expanded="true" height="124" name="SVM" width="90" x="447" y="34"> <description align="center" color="transparent" colored="false" width="126">Predict for &quot;label&quot; attribute</description> </operator> <operator activated="true" class="set_role" compatibility="7.5.003" expanded="true" height="82" name="Set Role (2)" width="90" x="45" y="238"> <parameter key="attribute_name" value="label"/> <parameter key="target_role" value="polylabel"/> <list key="set_additional_roles"> <parameter key="sum" value="label"/> </list> <description align="center" color="transparent" colored="false" width="126">Set first label to something, set second label to &quot;label&quot;</description> </operator> <operator activated="true" class="support_vector_machine" compatibility="7.5.003" expanded="true" height="124" name="SVM (2)" width="90" x="179" y="238"> <description align="center" color="transparent" colored="false" width="126">Predict &quot;sum&quot; attribute</description> </operator> <operator activated="true" class="apply_model" compatibility="7.5.003" expanded="true" height="82" name="Apply Model" width="90" x="246" y="442"> <list key="application_parameters"/> <description align="center" color="transparent" colored="false" width="126">Apply first model</description> </operator> <operator activated="true" class="set_role" compatibility="7.5.003" expanded="true" height="82" name="Set Role (3)" width="90" x="380" y="442"> <parameter key="attribute_name" value="prediction(label)"/> <parameter key="target_role" value="predictionold"/> <list key="set_additional_roles"/> <description align="center" color="transparent" colored="false" width="126">Set prediction label to something</description> </operator> <operator activated="true" class="apply_model" compatibility="7.5.003" expanded="true" height="82" name="Apply Model (2)" width="90" x="581" y="391"> <list key="application_parameters"/> <description align="center" color="transparent" colored="false" width="126">Apply second model</description> </operator> <connect from_op="Generate Data" from_port="output" to_op="Generate Aggregation" to_port="example set input"/> <connect from_op="Generate Aggregation" from_port="example set output" to_op="Set Role" to_port="example set input"/> <connect from_op="Set Role" from_port="example set output" to_op="SVM" to_port="training set"/> <connect from_op="SVM" from_port="model" to_op="Apply Model" to_port="model"/> <connect from_op="SVM" from_port="exampleSet" to_op="Set Role (2)" to_port="example set input"/> <connect from_op="Set Role (2)" from_port="example set output" to_op="SVM (2)" to_port="training set"/> <connect from_op="SVM (2)" from_port="model" to_op="Apply Model (2)" to_port="model"/> <connect from_op="SVM (2)" from_port="exampleSet" to_op="Apply Model" to_port="unlabelled data"/> <connect from_op="Apply Model" from_port="labelled data" to_op="Set Role (3)" to_port="example set input"/> <connect from_op="Set Role (3)" from_port="example set output" to_op="Apply Model (2)" to_port="unlabelled data"/> <connect from_op="Apply Model (2)" from_port="labelled data" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator></process>

It's important to keep both labels as special roles (you can name them how ever you like), and before using a learner, just set the role of the label you want to predict to "label". After the apply model, set the prediction attribute to another special role because otherwise it might be discarded for the second model. I hope the process helps to understand.

2) Both preprocessing steps make sense. Remember that normalization creates a preprocessing model that needs to be grouped with the learner model. I'm not sure if you have to remove correlated attributes, but there is an operator for that.

3) Both split validation and cross validation can be used. The data will be split into training and test sets accordingly, and the performance will be measured/validated over all splits. If you connect the model output port, it will generate the model over all examples.

6) Usually, models can be stored and used in another process so long as the new data has the same format for the regular attributes as the data you trained it on. It should not be different for the DL model.

7) Yes, you can skip the set roles during the configuration and just set the roles with the corresponding operator. Remember that up until then, all attributes are considered as regular and will be used as such e.g. when using a learner.

I hope this helps!

Jan

cyborghijacker · August 2017

Hi Jan

Thank you for your quick reply. Regarding your reply:

2) 'Remember that normalization creates a preprocessing model that needs to be grouped with the learner model' -> Yes i noticed that certain operators generate a 'preprocessing model'. I am not sure what this is, and what does it mean that it need to be grouped with the learning model. Do you mean to group the normalization operator followed by the learning model (e.g. Neural Net) both within the training process?

3) I am assume Cross Validation would be the more recommended one?

Regards

Corse

sgenzer · August 2017

Hi @cyborghijacker - welcome to the RapidMiner Community! So glad to see you here.

I too am not an expert in the new DL operator but can at least point you to these resources that may help (if you have not already seen them):

https://rapidminer.com/resource/deep-learning-demo/

https://docs.rapidminer.com/studio/operators/modeling/predictive/neural_nets/deep_learning.html

http://community.rapidminer.com/t5/RapidMiner-Studio-Forum/deep-learning-operator/td-p/35758

Scott

jczogalla · August 2017

Hi Corse

2) Yes, that's what I meant. You should put the normalization in the traiing process and use the Group Model operator to combine the normalization model and learner model. This way the test data will be normalized the same way as the training data.

3) Cross Validation does several validations based on the number of folds you want to run. Split validation can be combined with a Loop operator around it to do several different validations if the sampling method uses a random element (i.e. shuffled or stratified sampling).

Regards

Jan

cyborghijacker · August 2017

1) Just wondering, why should I not add the Normalize operator just before the Cross Validation operator (i.e. not nested within the Training process)?

2) Also, after training the model within the Optimize Parameters (Grid) operator, how could I use the model for prediction on new data? i.e. what to connect to the new data? I only see 'per' , 'par', 'res' from the output of the optimize parameters operator. How could I connect the Apply Model and where?

3) One small question: For regression task, do I set the role of my target attribute (which is a numerical value) as 'label' as well? Do I not select the 'prediction' role?

Thomas_Ott · August 2017

1) When doing training and testing in a validation operator, you want to put your normalization or other operator that affects the training row inside the validation operator. If you keep it outside you can leak information (i.e. data snooping) into your test set, which can affect the overall accuracy of your model.

2) Output the RES port to an Apply Model operator. The model that is delievered from the Optimize Parameter is the optimized model.

3) Use the label role.

cyborghijacker · August 2017

Thank you for the concise clarification, Thomas. I have a few queries about Normalization, if you don't mind :

1) I would assume you mean something like the process shown below. If I put the normalization on the training data, would the test data be similarly normalized?

2) What is the purpose of the Model output that is delivered by the Normalize operator? In what situation is it actually used?

3) Assuming my purpose is for a NN prediction via Deep Learning, where should I use the De-normalize operator? I would like my output (i.e. prediction) not to be a normalized value. I noticed that De-normalize is typically connected to the output 'pre' of the Normalize operator, I am confused by what this does - isn't it simply negating the effect of the Normalize operator?

4) For the Cross Validation, is the 'mod' output from the Apply Model operator necessary to be connected to something (I am not sure what) for the Cross Validation to deliver the 'mod' output?

5) In the Deep Learning operator, there is the option to 'standardize' -> could this be an inbuilt normalization parameter?

Regards

Ben

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Need recommendation for prediction/regression workflow

Answers