Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Split data set and low quality operator
Hi people,
I'm new to Rapidminer so be a bit passionate to me
Two questions:
1. My issue is that I would like to split out my full data set into a training set(use this also for validation) and test set. From my understanding it's best practice to spilt out the test set straight away (and the before I do for instance any exploratory data analysis and feature selection analysis on the data).
So If I split my full data set with "Split data" operator and use the "Remove correlated attributes" (referred to as "corr.") operator on the training set and the corr. operator remove some attributes. At the end I store this final training set. Now my test set has more attributes than my tranining set - how do I remove the same attributes generated by the corr. operator to my test set? I don't want to use the corr. operator on my test set because it could potensially remove fewer or other attributes. Is it possible to generation the test set in this automatic / dynamic way? Are there any other ways you guys do this process?
2. Do there exists any "low quality" operator (i.e. the same low quality operations carried out inside the Turbo prep tab -> Cleanse -> Remove low quality) in Rapidminer Design?
Love to hear from you.
Best regards
Andy
I'm new to Rapidminer so be a bit passionate to me
Two questions:
1. My issue is that I would like to split out my full data set into a training set(use this also for validation) and test set. From my understanding it's best practice to spilt out the test set straight away (and the before I do for instance any exploratory data analysis and feature selection analysis on the data).
So If I split my full data set with "Split data" operator and use the "Remove correlated attributes" (referred to as "corr.") operator on the training set and the corr. operator remove some attributes. At the end I store this final training set. Now my test set has more attributes than my tranining set - how do I remove the same attributes generated by the corr. operator to my test set? I don't want to use the corr. operator on my test set because it could potensially remove fewer or other attributes. Is it possible to generation the test set in this automatic / dynamic way? Are there any other ways you guys do this process?
2. Do there exists any "low quality" operator (i.e. the same low quality operations carried out inside the Turbo prep tab -> Cleanse -> Remove low quality) in Rapidminer Design?
Love to hear from you.
Best regards
Andy
Tagged:
1
Comments
You can do your attribute removal either inside the cross-validation (technically correct) or outside the operator as part of your preprocessing (probably more common although it can bias your subsequent performance estimate).
Note that you don't actually need to worry about attribute removal per se: if you build a model on a subset of attributes and then apply it to a dataset that has extra attributes, those extra attributes will simply be ignored if you apply the same model. But you do need to worry about attribute transformations, which is why you either need to do your data ETL beforehand (prior to all modeling) or do it inside the cross-validation and then pass all the transformations from the training data to the test data.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
As mentioned above by @Telcontar120 cross-validation is better than splitting as we don't know which part of data is actually useful. In case, if you want to try here is an example of cross-validation operator. Just copy the XML code in new process XML window that can be found in View --> Show Panel --> XML. Paste the below code there and click on the green tick mark. You can then check the operators and options This is 5 fold CV which means data is divided into 5 splits and trained and tested 5 times. Please go through RM tutorial for in-depth understanding.
Varun
Varun
https://www.varunmandalapu.com/
Be Safe. Follow precautions and Maintain Social Distancing
CV divides data into train and test (not validation). For example in 5 fold it divides the data into 5 subsets(1,2,3,4,5) and uses first four(1,2,3,4) for training and other (5) for testing(unlabelled or unseen), Then it saves performance metrics, clears the model and takes next four folds(2,3,4,5) for train and 1 test. This happens till all data is used for training and testing. The stored performance metrics are aggregated to give the final performance. This is the main reason CV is used as it mostly eliminate overfitting. I am not sure about leakage.
I got what you are asking, If you want manually test, Split the data using split operator into 90 and 10 percent. Feed 90% into CV and then get the model output from CV operator again connect that to apply the model and connect the 10% data you set aside to the apply model as well. I see keras model operator had the option for validation set percentage but not for others. I think you need to split using the split operator.
<?xml version="1.0" encoding="UTF-8"?><process version="9.1.000"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.1.000" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="generate_data" compatibility="9.1.000" expanded="true" height="68" name="Generate Data" width="90" x="112" y="34"> <parameter key="target_function" value="simple non linear classification"/> <parameter key="number_examples" value="1000"/> <parameter key="number_of_attributes" value="5"/> <parameter key="attributes_lower_bound" value="-10.0"/> <parameter key="attributes_upper_bound" value="10.0"/> <parameter key="gaussian_standard_deviation" value="10.0"/> <parameter key="largest_radius" value="10.0"/> <parameter key="use_local_random_seed" value="false"/> <parameter key="local_random_seed" value="1992"/> <parameter key="datamanagement" value="double_array"/> <parameter key="data_management" value="auto"/> </operator> <operator activated="true" class="split_data" compatibility="9.1.000" expanded="true" height="103" name="Split Data" width="90" x="246" y="85"> <enumeration key="partitions"> <parameter key="ratio" value="0.9"/> <parameter key="ratio" value="0.1"/> </enumeration> <parameter key="sampling_type" value="automatic"/> <parameter key="use_local_random_seed" value="false"/> <parameter key="local_random_seed" value="1992"/> </operator> <operator activated="true" class="concurrency:cross_validation" compatibility="9.1.000" expanded="true" height="145" name="Cross Validation" width="90" x="380" y="34"> <parameter key="split_on_batch_attribute" value="false"/> <parameter key="leave_one_out" value="false"/> <parameter key="number_of_folds" value="5"/> <parameter key="sampling_type" value="automatic"/> <parameter key="use_local_random_seed" value="false"/> <parameter key="local_random_seed" value="1992"/> <parameter key="enable_parallel_execution" value="true"/> <process expanded="true"> <operator activated="true" class="neural_net" compatibility="9.1.000" expanded="true" height="82" name="Neural Net" width="90" x="112" y="34"> <list key="hidden_layers"/> <parameter key="training_cycles" value="200"/> <parameter key="learning_rate" value="0.01"/> <parameter key="momentum" value="0.9"/> <parameter key="decay" value="false"/> <parameter key="shuffle" value="true"/> <parameter key="normalize" value="true"/> <parameter key="error_epsilon" value="1.0E-4"/> <parameter key="use_local_random_seed" value="false"/> <parameter key="local_random_seed" value="1992"/> </operator> <connect from_port="training set" to_op="Neural Net" to_port="training set"/> <connect from_op="Neural Net" from_port="model" to_port="model"/> <portSpacing port="source_training set" spacing="0"/> <portSpacing port="sink_model" spacing="0"/> <portSpacing port="sink_through 1" spacing="0"/> </process> <process expanded="true"> <operator activated="true" class="apply_model" compatibility="9.1.000" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34"> <list key="application_parameters"/> <parameter key="create_view" value="false"/> </operator> <operator activated="true" class="performance_binominal_classification" compatibility="9.1.000" expanded="true" height="82" name="Performance (2)" width="90" x="246" y="34"> <parameter key="main_criterion" value="first"/> <parameter key="accuracy" value="true"/> <parameter key="classification_error" value="false"/> <parameter key="kappa" value="true"/> <parameter key="AUC (optimistic)" value="false"/> <parameter key="AUC" value="true"/> <parameter key="AUC (pessimistic)" value="false"/> <parameter key="precision" value="true"/> <parameter key="recall" value="true"/> <parameter key="lift" value="false"/> <parameter key="fallout" value="false"/> <parameter key="f_measure" value="true"/> <parameter key="false_positive" value="false"/> <parameter key="false_negative" value="false"/> <parameter key="true_positive" value="false"/> <parameter key="true_negative" value="false"/> <parameter key="sensitivity" value="false"/> <parameter key="specificity" value="false"/> <parameter key="youden" value="false"/> <parameter key="positive_predictive_value" value="false"/> <parameter key="negative_predictive_value" value="false"/> <parameter key="psep" value="false"/> <parameter key="skip_undefined_labels" value="true"/> <parameter key="use_example_weights" value="true"/> </operator> <connect from_port="model" to_op="Apply Model" to_port="model"/> <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/> <connect from_op="Apply Model" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/> <connect from_op="Performance (2)" from_port="performance" to_port="performance 1"/> <portSpacing port="source_model" spacing="0"/> <portSpacing port="source_test set" spacing="0"/> <portSpacing port="source_through 1" spacing="0"/> <portSpacing port="sink_test set results" spacing="0"/> <portSpacing port="sink_performance 1" spacing="0"/> <portSpacing port="sink_performance 2" spacing="0"/> </process> </operator> <operator activated="true" class="apply_model" compatibility="9.1.000" expanded="true" height="82" name="Apply Model (2)" width="90" x="514" y="187"> <list key="application_parameters"/> <parameter key="create_view" value="false"/> </operator> <operator activated="true" class="performance_binominal_classification" compatibility="9.1.000" expanded="true" height="82" name="Performance" width="90" x="648" y="136"> <parameter key="main_criterion" value="first"/> <parameter key="accuracy" value="true"/> <parameter key="classification_error" value="false"/> <parameter key="kappa" value="true"/> <parameter key="AUC (optimistic)" value="false"/> <parameter key="AUC" value="true"/> <parameter key="AUC (pessimistic)" value="false"/> <parameter key="precision" value="true"/> <parameter key="recall" value="true"/> <parameter key="lift" value="false"/> <parameter key="fallout" value="false"/> <parameter key="f_measure" value="false"/> <parameter key="false_positive" value="false"/> <parameter key="false_negative" value="false"/> <parameter key="true_positive" value="false"/> <parameter key="true_negative" value="false"/> <parameter key="sensitivity" value="false"/> <parameter key="specificity" value="false"/> <parameter key="youden" value="false"/> <parameter key="positive_predictive_value" value="false"/> <parameter key="negative_predictive_value" value="false"/> <parameter key="psep" value="false"/> <parameter key="skip_undefined_labels" value="true"/> <parameter key="use_example_weights" value="true"/> </operator> <connect from_op="Generate Data" from_port="output" to_op="Split Data" to_port="example set"/> <connect from_op="Split Data" from_port="partition 1" to_op="Cross Validation" to_port="example set"/> <connect from_op="Split Data" from_port="partition 2" to_op="Apply Model (2)" to_port="unlabelled data"/> <connect from_op="Cross Validation" from_port="model" to_op="Apply Model (2)" to_port="model"/> <connect from_op="Cross Validation" from_port="performance 1" to_port="result 2"/> <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance" to_port="labelled data"/> <connect from_op="Performance" from_port="performance" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> <portSpacing port="sink_result 3" spacing="0"/> </process> </operator> </process>
Thanks,
Varun
Varun
https://www.varunmandalapu.com/
Be Safe. Follow precautions and Maintain Social Distancing
Varun
https://www.varunmandalapu.com/
Be Safe. Follow precautions and Maintain Social Distancing
Scott