Split the dataset

ikayunida123 · June 2018

Hello everyone!

So I'm doing a text classification right now. And I want to ask how to split the dataset into data training and data testing on Rapidminer. I know there are some operators like Split Data or Split Validation, but looks like it's splitting the data automatically(?) So I don't know which one is data training or which one is data testing.

My teacher wants me to compare the result of text classification that I'm doing manually and the result of my RapidMiner process. So I must make sure the data training or data testing in those two processes are same.

Please help me. Thank you :catvery-happy:

rfuentealba · June 2018

Hi @ikayunida123

It depends on how you chose your data for training or testing.

Let's say you have 10 samples from 1 to 10, and you train your model manually with samples 1, 2 and 3, and test your model with 4 to 10, you should choose "linear sampling" on both "Split Data" and when configuring the "Split Validation". If, however, you chose specifically (e.g.) examples 1, 4, and 6 to train your model and the rest to test it, you might prefer not to work with split data but by creating two datasets that are equal to your choices, and build the model without split validation.

There are three ways to create data samples in RapidMiner (well, there are four but the fourth one is "automatically choose between stratified or shuffled depending on the data types you have") : "Linear" is 1, 2, 3... "Shuffled" is random, and "Stratified" is shuffled but trying to maintain the proportions between your training data and your testing data.

Regarding Split Data, Split Validation and DIY validation, I can post you some pictures on what is the case for each one:

Split Validation:

Split Validation 01 - Setup.png Split Validation - General View Split Validation 02 - Internals.png Split Validation - Testing/Training

Split Data:

Split Data.png Notice that this is equivalent to performing the Split Validation, but harder to read when on a larger model

With DIY Validation, data splitting is your responsibility. Basically, your model looks much like the "Split Data" model, except that you have two Retrieve operators, one with your chosen training data and other with your chosen testing data. TBH I was too lazy to build an example.

For the sake of completion, I prefer to do Cross Validation whenever I have enough memory and processor to use it (or use it as a pretext to ask my boss to buy more memory and a better processor). It is exactly the same as the Split Validation, but let's say you have 100 examples and you want to part them in 5 folds, you have an iterator: use examples 1 to 20 for testing and the rest for training, then examples 21 to 40 for testing, then 41 to 60 and so on... More folds means smaller examples but more iterations. There is also the "leave one out" option but with this option enabled the amount of computing power required is... quite high if your dataset is large.

Screen Shot 2018-06-08 at 01.02.22.png This is the same as the split validation, but more powerful and more CPU and memory consuming.

Hope this helps,

Rodrigo.

ikayunida123 · June 2018

Hello Mr. @rfuentealba :catvery-happy: Your answer definitely makes me understand the concept of validation in RapidMiner. Thank you for your help! But the one that exactly close to my needed (from your explanation) is DIY validation. Can you give me an example of DIY validation? Any pictures or XML process is okay. Thank you and have a nice day :catvery-happy:

Telcontar120 · June 2018

You can use the Log operator to capture whatever intermediate results you want from the training and the testing data inside a split validation. Just make sure you use the local random seed option so your results are reproducible.

rfuentealba · June 2018

Hi @ikayunida123,

If I understood it correctly, you picked up some data by hand, and you are performing calculations manually, and your teacher wants you to see how performance differs between what you do on paper and pencil (May Odin bless your patience for such a task, I failed COBOL twice at the Uni because I didn't have enough of it). So you want something that can help you define that you want exactly that data for training and that other data for testing, am I right? In that case, the Log operator that @Telcontar120 suggested might not work. (However, that's a good catch! You might want to try the other way round: run your process on RapidMiner and use the Log operator to see how your training data looks like, and then take paper and pencil with that data, that's much less hassle!)

Well, DIY means "Do It Yourself" (I had a joke with "Just Do It" but I just didn't... For reference, that is a slogan for a known sportswear company), and it means getting rid of all the help given to you by RapidMiner process blocks.

The first thing you have to get rid of is our good friend the Split Data operator, and the examples must be split by yourself. Then, apply everything as you would, but... before applying the Performance block, you have to define which columns in your result have label and prediction roles. The Performance operator takes these two to analyze how far was your algorithm from the truth, and that's transparent to you when using the Split Validation or the Split Data methods.

Finally, apply Performance as normal, et voilá: DIY Validation! Here is a screenshot.

DIY Validation.png DIY Validation!Hope this helps!

Cheers and have a nice weekend,

ikayunida123 · June 2018

Hello @Telcontar120 thank you for your suggestion :cathappy:

But I'm still didn't understand. Can you please give me an example of how to use the log operator in my case? I'm sorry for asking too much questions.

This's my process :

<?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.1.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="8.1.001" expanded="true" height="68" name="Retrieve Dataset Skripsi" width="90" x="45" y="34">
        <parameter key="repository_entry" value="//Local Repository/Skripsi Ika/Dataset Skripsi"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="8.1.001" expanded="true" height="82" name="Nominal to Text" width="90" x="179" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Text"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="8.1.001" expanded="true" height="82" name="Set Role" width="90" x="313" y="34">
        <parameter key="attribute_name" value="Label"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles">
          <parameter key="Text" value="regular"/>
          <parameter key="Label" value="label"/>
        </list>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="8.1.001" expanded="true" height="103" name="Filter Examples" width="90" x="447" y="34">
        <parameter key="condition_class" value="no_missing_attributes"/>
        <list key="filters_list"/>
      </operator>
      <operator activated="true" class="remove_duplicates" compatibility="8.1.001" expanded="true" height="103" name="Remove Duplicates" width="90" x="581" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Text"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="715" y="34">
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:replace_tokens" compatibility="8.1.000" expanded="true" height="68" name="Replace Tokens" width="90" x="112" y="34">
            <list key="replace_dictionary">
              <parameter key="(https?|http)://[-a-zA-Z0-9+&amp;@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&amp;@#/%=~_|]" value="link"/>
              <parameter key="@ value=at "/>
              <parameter key="#" value="hashtag "/>
            </list>
          </operator>
          <operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases" width="90" x="246" y="34"/>
          <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="380" y="34"/>
          <operator activated="true" class="text:filter_stopwords_dictionary" compatibility="8.1.000" expanded="true" height="82" name="Filter Stopwords (Dictionary)" width="90" x="514" y="34">
            <parameter key="file" value="E:\KAMPUS\[SEMESTER 8]\SKRIPSI\DATASET\[2] PRE-PROCESSING\talastopwordsedit.txt"/>
          </operator>
          <operator activated="true" class="text:generate_n_grams_terms" compatibility="8.1.000" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="648" y="34"/>
          <connect from_port="document" to_op="Replace Tokens" to_port="document"/>
          <connect from_op="Replace Tokens" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (Dictionary)" to_port="document"/>
          <connect from_op="Filter Stopwords (Dictionary)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
          <connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="split_validation" compatibility="8.1.001" expanded="true" height="124" name="Validation" width="90" x="581" y="187">
        <process expanded="true">
          <operator activated="true" class="naive_bayes" compatibility="8.1.001" expanded="true" height="82" name="Naive Bayes" width="90" x="179" y="34"/>
          <connect from_port="training" to_op="Naive Bayes" to_port="training set"/>
          <connect from_op="Naive Bayes" from_port="model" to_port="model"/>
          <portSpacing port="source_training" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true">
          <operator activated="true" class="apply_model" compatibility="8.1.001" expanded="true" height="82" name="Apply Model" width="90" x="112" y="34">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance_classification" compatibility="8.1.001" expanded="true" height="82" name="Performance" width="90" x="246" y="34">
            <list key="class_weights"/>
          </operator>
          <connect from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_averagable 1" spacing="0"/>
          <portSpacing port="sink_averagable 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="retrieve" compatibility="8.1.001" expanded="true" height="68" name="Retrieve Dataset Skripsi (2)" width="90" x="45" y="340">
        <parameter key="repository_entry" value="//Local Repository/Skripsi Ika/Dataset Skripsi"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="8.1.001" expanded="true" height="82" name="Nominal to Text (2)" width="90" x="179" y="340">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Text"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="8.1.001" expanded="true" height="82" name="Set Role (2)" width="90" x="313" y="340">
        <parameter key="attribute_name" value="Label"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles">
          <parameter key="Text" value="regular"/>
          <parameter key="Label" value="label"/>
        </list>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="8.1.001" expanded="true" height="103" name="Filter Examples (2)" width="90" x="447" y="340">
        <parameter key="condition_class" value="no_missing_attributes"/>
        <list key="filters_list"/>
      </operator>
      <operator activated="true" class="remove_duplicates" compatibility="8.1.001" expanded="true" height="103" name="Remove Duplicates (2)" width="90" x="581" y="340">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Text"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data (2)" width="90" x="715" y="340">
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:replace_tokens" compatibility="8.1.000" expanded="true" height="68" name="Replace Tokens (2)" width="90" x="112" y="34">
            <list key="replace_dictionary">
              <parameter key="(https?|http)://[-a-zA-Z0-9+&amp;@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&amp;@#/%=~_|]" value="link"/>
              <parameter key="@ value=at "/>
              <parameter key="#" value="hashtag "/>
            </list>
          </operator>
          <operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases (2)" width="90" x="246" y="34"/>
          <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="380" y="34"/>
          <operator activated="true" class="text:filter_stopwords_dictionary" compatibility="8.1.000" expanded="true" height="82" name="Filter Stopwords (2)" width="90" x="514" y="34">
            <parameter key="file" value="E:\KAMPUS\[SEMESTER 8]\SKRIPSI\DATASET\[2] PRE-PROCESSING\talastopwordsedit.txt"/>
          </operator>
          <operator activated="true" class="text:generate_n_grams_terms" compatibility="8.1.000" expanded="true" height="68" name="Generate n-Grams (2)" width="90" x="648" y="34"/>
          <connect from_port="document" to_op="Replace Tokens (2)" to_port="document"/>
          <connect from_op="Replace Tokens (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
          <connect from_op="Transform Cases (2)" from_port="document" to_op="Tokenize (2)" to_port="document"/>
          <connect from_op="Tokenize (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
          <connect from_op="Filter Stopwords (2)" from_port="document" to_op="Generate n-Grams (2)" to_port="document"/>
          <connect from_op="Generate n-Grams (2)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="apply_model" compatibility="8.1.001" expanded="true" height="82" name="Apply Model (2)" width="90" x="849" y="289">
        <list key="application_parameters"/>
      </operator>
      <connect from_op="Retrieve Dataset Skripsi" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
      <connect from_op="Filter Examples" from_port="example set output" to_op="Remove Duplicates" to_port="example set input"/>
      <connect from_op="Remove Duplicates" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Validation" to_port="training"/>
      <connect from_op="Validation" from_port="model" to_op="Apply Model (2)" to_port="model"/>
      <connect from_op="Validation" from_port="training" to_port="result 1"/>
      <connect from_op="Validation" from_port="averagable 1" to_port="result 2"/>
      <connect from_op="Retrieve Dataset Skripsi (2)" from_port="output" to_op="Nominal to Text (2)" to_port="example set input"/>
      <connect from_op="Nominal to Text (2)" from_port="example set output" to_op="Set Role (2)" to_port="example set input"/>
      <connect from_op="Set Role (2)" from_port="example set output" to_op="Filter Examples (2)" to_port="example set input"/>
      <connect from_op="Filter Examples (2)" from_port="example set output" to_op="Remove Duplicates (2)" to_port="example set input"/>
      <connect from_op="Remove Duplicates (2)" from_port="example set output" to_op="Process Documents from Data (2)" to_port="example set"/>
      <connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Apply Model (2)" to_port="unlabelled data"/>
      <connect from_op="Apply Model (2)" from_port="labelled data" to_port="result 3"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>

Thank you.

Telcontar120 · June 2018

Here is an example process which uses split validation to create a model and logs the train performance and the text performance separately.

<operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="8.1.003" expanded="true" height="68" name="Retrieve Sonar" width="90" x="112" y="34">
        <parameter key="repository_entry" value="//Samples/data/Sonar"/>
      </operator>
      <operator activated="true" class="split_validation" compatibility="8.1.003" expanded="true" height="124" name="Validation" width="90" x="447" y="30">
        <parameter key="training_set_size" value="10"/>
        <parameter key="sampling_type" value="linear sampling"/>
        <process expanded="true">
          <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="8.1.003" expanded="true" height="103" name="Decision Tree" width="90" x="112" y="30"/>
          <operator activated="true" class="apply_model" compatibility="8.1.003" expanded="true" height="82" name="Apply Model (2)" width="90" x="246" y="34">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance" compatibility="8.1.003" expanded="true" height="82" name="Performance (2)" width="90" x="179" y="187"/>
          <operator activated="true" class="log" compatibility="8.1.003" expanded="true" height="82" name="Log" width="90" x="313" y="187">
            <list key="log">
              <parameter key="train_perf" value="operator.Performance (2).value.performance"/>
            </list>
          </operator>
          <connect from_port="training" to_op="Decision Tree" to_port="training set"/>
          <connect from_op="Decision Tree" from_port="model" to_op="Apply Model (2)" to_port="model"/>
          <connect from_op="Decision Tree" from_port="exampleSet" to_op="Apply Model (2)" to_port="unlabelled data"/>
          <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
          <connect from_op="Apply Model (2)" from_port="model" to_port="model"/>
          <connect from_op="Performance (2)" from_port="performance" to_op="Log" to_port="through 1"/>
          <connect from_op="Log" from_port="through 1" to_port="through 1"/>
          <portSpacing port="source_training" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
          <portSpacing port="sink_through 2" spacing="0"/>
        </process>
        <process expanded="true">
          <operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="82" name="Apply Model" width="90" x="45" y="30">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance" compatibility="8.1.003" expanded="true" height="82" name="Performance" width="90" x="179" y="30"/>
          <operator activated="true" class="log" compatibility="8.1.003" expanded="true" height="82" name="Log (2)" width="90" x="313" y="34">
            <list key="log">
              <parameter key="test_perf" value="operator.Performance.value.performance"/>
            </list>
          </operator>
          <connect from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_op="Log (2)" to_port="through 1"/>
          <connect from_op="Log (2)" from_port="through 1" to_port="averagable 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="source_through 2" spacing="0"/>
          <portSpacing port="sink_averagable 1" spacing="0"/>
          <portSpacing port="sink_averagable 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Retrieve Sonar" from_port="output" to_op="Validation" to_port="training"/>
      <connect from_op="Validation" from_port="model" to_port="result 1"/>
      <connect from_op="Validation" from_port="training" to_port="result 2"/>
      <connect from_op="Validation" from_port="averagable 1" to_port="result 3"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>

hanaakarimah · April 25

Hi @rfuentealba, I want to ask if the data splitting method you provided (without cross-validation) but utilizing that split data has ever been mentioned in a journal? Or have you indeed tried various methods in data processing? Because I want to use that method, but my professor asked me to find a similar journal. Meanwhile, what I found in local journals in my country mostly uses cross-validation. However, if using cross-validation, I cannot control how many data splits I will perform. (this will be tested later when I present) and cross-validation also takes a lot of time... hufftt

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Split the dataset

Answers