Options

How to provide seperate datasets for both training and testing ?

Gokz123Gokz123 Member Posts: 6 Newbie
Iam a new user to rapidminer tool.i had watched a video regarding the training and testing dataset  through cross validation.But it says a single dataset can be used  for both training and testing.How to provide seperate datasets for both training and testing ? Can anyone please explain how to do that ? 





Answers

  • Options
    varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
    edited March 2019
     Hello @Gokz123

    Here is a comprehensive explanation by @sgenzer on cross-validation (CV). 

    https://community.rapidminer.com/discussion/55112/cross-validation-and-its-outputs-in-rm-studio

    Simple understanding: When you connect dataset to CV operator, it divides the data into multiple sets based on the number of folds value. Every time it runs it will use one subset for testing and others for training.

    EDIT: If you would like to provide separate datasets, then you need to connect the training data to the model and this model is connected to apply model operator and also the test dataset is connected to apply model operator. In this way, you can connect training and testing separately. sample XML below.

    <?xml version="1.0" encoding="UTF-8"?><process version="9.2.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="9.2.000" expanded="true" height="68" name="Retrieve Training_Fold0" width="90" x="112" y="85">
            <parameter key="repository_entry" value="//Local Repository/data/CSEDM_Challenge_Data/Training_Fold0"/>
          </operator>
          <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.2.000" expanded="true" height="103" name="Decision Tree" width="90" x="380" y="85">
            <parameter key="criterion" value="gain_ratio"/>
            <parameter key="maximal_depth" value="10"/>
            <parameter key="apply_pruning" value="true"/>
            <parameter key="confidence" value="0.1"/>
            <parameter key="apply_prepruning" value="true"/>
            <parameter key="minimal_gain" value="0.01"/>
            <parameter key="minimal_leaf_size" value="2"/>
            <parameter key="minimal_size_for_split" value="4"/>
            <parameter key="number_of_prepruning_alternatives" value="3"/>
          </operator>
          <operator activated="true" class="retrieve" compatibility="9.2.000" expanded="true" height="68" name="Retrieve Test_Fold0" width="90" x="246" y="238">
            <parameter key="repository_entry" value="//Local Repository/data/CSEDM_Challenge_Data/Test_Fold0"/>
          </operator>
          <operator activated="true" class="apply_model" compatibility="9.2.000" expanded="true" height="82" name="Apply Model" width="90" x="581" y="187">
            <list key="application_parameters"/>
            <parameter key="create_view" value="false"/>
          </operator>
          <operator activated="true" class="performance_classification" compatibility="9.2.000" expanded="true" height="82" name="Performance" width="90" x="782" y="187">
            <parameter key="main_criterion" value="first"/>
            <parameter key="accuracy" value="true"/>
            <parameter key="classification_error" value="false"/>
            <parameter key="kappa" value="true"/>
            <parameter key="weighted_mean_recall" value="false"/>
            <parameter key="weighted_mean_precision" value="false"/>
            <parameter key="spearman_rho" value="false"/>
            <parameter key="kendall_tau" value="false"/>
            <parameter key="absolute_error" value="false"/>
            <parameter key="relative_error" value="false"/>
            <parameter key="relative_error_lenient" value="false"/>
            <parameter key="relative_error_strict" value="false"/>
            <parameter key="normalized_absolute_error" value="false"/>
            <parameter key="root_mean_squared_error" value="true"/>
            <parameter key="root_relative_squared_error" value="false"/>
            <parameter key="squared_error" value="false"/>
            <parameter key="correlation" value="false"/>
            <parameter key="squared_correlation" value="false"/>
            <parameter key="cross-entropy" value="false"/>
            <parameter key="margin" value="false"/>
            <parameter key="soft_margin_loss" value="false"/>
            <parameter key="logistic_loss" value="false"/>
            <parameter key="skip_undefined_labels" value="true"/>
            <parameter key="use_example_weights" value="true"/>
            <list key="class_weights"/>
          </operator>
          <connect from_op="Retrieve Training_Fold0" from_port="output" to_op="Decision Tree" to_port="training set"/>
          <connect from_op="Decision Tree" from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_op="Retrieve Test_Fold0" from_port="output" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    
    You can ask more if you want a different process


    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

  • Options
    [Deleted User][Deleted User] Posts: 0 Learner III
    edited May 2019
    Hi
    According to the points that @varunm1 said if we have a data with label we dont need to  separate dataset in to traning and testing. And also RM with cross validation is able to separte it automatically to the train and test parts And for the testing part it will not use the label like the training part. 
    Are these points correct?
    Thank you
     
  • Options
    bedantabedanta Member Posts: 1 Contributor I
    Hi @varunm1 , is there any way to input train and test data separately into auto models?
  • Options
    varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
    Hello @bedanta

    Once the auto model is done training, you can deploy the model and test it on new data. This is possible only after trainin automodel
    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

Sign In or Register to comment.