How to " tell" RM what to use for training/ testing

LG222PSLG222PS Member Posts: 2 Newbie
Dear Miners,
Please help me to get the hang of this.
I have a desicion tree model with a set of data that has about 60 000 rowas of data with the lable attribute and 15 000 without. I assumed/ wanted the data with the lable attribute to be the training data and the rows with missing lable attribute should be the test values ( wich I want to export at the end for external validation)
Now my export only has 5900 rows of data and it seems not to use the " empty" rows for test, but replace missing values with mean value per default option and split the whole data into test and training set. 
I am wondering how to fix this issue, without having to disassembke the entire design ( which would be painfull, since I already incorporated the modle outcome in my thesis draft)
Could you please help me?
Kind regards
A data science newbie


  • Options
    varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
    edited July 2019
    Hello @LG222PS

    Did you filter the data using "Filter Example" operator with a Condition Class "no_missing_lables".? This will separate the data with missing labels and no-missing labels. You can use the labeled data from training and unlabeled for testing. Below is the XML code and I also attached dataset with missing labels for you to test this code. To do this, you need to download dataset attached. Now in the XML window of rapidminer process copy the below code and paste it, then click on the green tick mark. Now you will see Read CSV operator in process, in the parameter options, point CSV file parameters to the data set you downloaded. If you are unable to find XML window, go to view --> Show Panel --> XML.

    Filter example operator output ports, "exa" is the output port related to filtered values (in our case no missing labels), "unm" are unmatched values, in our case data with missing labels.

    <?xml version="1.0" encoding="UTF-8"?><process version="9.3.001">
      <operator activated="true" class="process" compatibility="9.3.001" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="read_csv" compatibility="9.3.001" expanded="true" height="68" name="Read CSV" width="90" x="112" y="34">
            <parameter key="csv_file" value="F:\RM\titanic_missing_label.csv"/>
            <parameter key="column_separators" value=","/>
            <parameter key="trim_lines" value="false"/>
            <parameter key="use_quotes" value="true"/>
            <parameter key="quotes_character" value="&quot;"/>
            <parameter key="escape_character" value="\"/>
            <parameter key="skip_comments" value="false"/>
            <parameter key="comment_characters" value="#"/>
            <parameter key="starting_row" value="1"/>
            <parameter key="parse_numbers" value="true"/>
            <parameter key="decimal_character" value="."/>
            <parameter key="grouped_digits" value="false"/>
            <parameter key="grouping_character" value=","/>
            <parameter key="infinity_representation" value=""/>
            <parameter key="date_format" value=""/>
            <parameter key="first_row_as_names" value="true"/>
            <list key="annotations"/>
            <parameter key="time_zone" value="SYSTEM"/>
            <parameter key="locale" value="English (United States)"/>
            <parameter key="encoding" value="SYSTEM"/>
            <parameter key="read_all_values_as_polynominal" value="false"/>
            <list key="data_set_meta_data_information"/>
            <parameter key="read_not_matching_values_as_missings" value="true"/>
            <parameter key="datamanagement" value="double_array"/>
            <parameter key="data_management" value="auto"/>
          <operator activated="true" class="set_role" compatibility="9.3.001" expanded="true" height="82" name="Set Role" width="90" x="246" y="34">
            <parameter key="attribute_name" value="Survived"/>
            <parameter key="target_role" value="label"/>
            <list key="set_additional_roles"/>
          <operator activated="true" class="filter_examples" compatibility="9.3.001" expanded="true" height="103" name="Filter Examples" width="90" x="380" y="34">
            <parameter key="parameter_expression" value=""/>
            <parameter key="condition_class" value="no_missing_labels"/>
            <parameter key="invert_filter" value="false"/>
            <list key="filters_list"/>
            <parameter key="filters_logic_and" value="true"/>
            <parameter key="filters_check_metadata" value="true"/>
          <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.3.001" expanded="true" height="103" name="Decision Tree" width="90" x="581" y="34">
            <parameter key="criterion" value="gain_ratio"/>
            <parameter key="maximal_depth" value="10"/>
            <parameter key="apply_pruning" value="true"/>
            <parameter key="confidence" value="0.1"/>
            <parameter key="apply_prepruning" value="true"/>
            <parameter key="minimal_gain" value="0.01"/>
            <parameter key="minimal_leaf_size" value="2"/>
            <parameter key="minimal_size_for_split" value="4"/>
            <parameter key="number_of_prepruning_alternatives" value="3"/>
          <operator activated="true" class="apply_model" compatibility="9.3.001" expanded="true" height="82" name="Apply Model" width="90" x="715" y="136">
            <list key="application_parameters"/>
            <parameter key="create_view" value="false"/>
          <connect from_op="Read CSV" from_port="output" to_op="Set Role" to_port="example set input"/>
          <connect from_op="Set Role" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
          <connect from_op="Filter Examples" from_port="example set output" to_op="Decision Tree" to_port="training set"/>
          <connect from_op="Filter Examples" from_port="unmatched example set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Decision Tree" from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>

    Hope this helps, please inform if you need more help.

    Be Safe. Follow precautions and Maintain Social Distancing

  • Options
    LG222PSLG222PS Member Posts: 2 Newbie
    edited July 2019
    Hello Varum,
    thanks for taking your time. Unfortunatelly, your xml example is not helping me with my problem. 
    I already apply the "filter example" filter create my model. However, I cannot find how to feed the "remaining"/filtered out data back into the model to test them and give me the test set results in a separate file/report.
    Kind regards
  • Options
    varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
    edited July 2019
    Hello @LG222PS

    Dis you try the apply model operator? That is the one applying the trained model on test dataset and providing us with the prediction. Can you provide your XML code to check?


    Be Safe. Follow precautions and Maintain Social Distancing

Sign In or Register to comment.