Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Choose Predictive Models
Hello,
I have a classification Problem with binary target attribute. All other Atrributes are numerical. In the Rapidminer are about 80 Operators, which can be used for classification. It is nearly impossible to try all of them...
I found the ROC as a tool to choose the Operator to use. I just dont unterstand how it works and how it can provide a "perfect" Model for my Problem. For example if i put 10 Operators in the compare ROC Operator they are all with standard settings in there. The result are curves and the curve which comes the closest to the top left is the best and therefore this Operator ist the best. But what is when i change the Parameters from the 10 Operators? Then i get a total different ROC. So its just try and error right?
Is there any Method to find the best Operator for my Problem? Or does it all come down to use one Operator within the optimize Parameters and find so the best Operator with the best accuracy?
I hope i explained my question well...
Thanks!
0
Best Answers
-
varunm1 Member Posts: 1,207 UnicornHello @dome,
Did you try Automodel? Automodel has the capability to provide you with the algorithms that are suitable for your data. I will definitely try that first and then look for more models to see which one might do a better job.
As you said, there are a number of predictive algorithms, which is the reason it is good to visualize your data using t-sne to see if there are any patterns that can be identified from the dataset based on data distribution. This is one way I narrow down my algorithms, but this needs some expertise.
You can use roc curves but as you said, it might change based on the settings of models.
@IngoRM might suggest more on this.Regards,
Varun
https://www.varunmandalapu.com/Be Safe. Follow precautions and Maintain Social Distancing
6 -
tftemme Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member Posts: 164 RM ResearchHi @dome
As @varunm1 already pointed out you could for example try out automodel, which already compare different models with different parameters for your data set.
If you want to use different models (different learner operators) for your own within an Optimize Parameters operator there is a little trick to achieve this. You can use the Select Subprocess operator to create different subprocesses for each learner. Then you optimize the 'select which' parameter of the Select Subprocess operator to get the best performing of your models. Inside the subprocesses you could also use additional Optimize operators to individually optimize the models for their parameters.
Don't forget that you need a validation scheme around all Optimization operators. I attached an example process for the sonar data set and three models, but I think you will get the idea.
Hopes this helps,
Best regards,
Fabian<?xml version="1.0" encoding="UTF-8"?><process version="9.3.001"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.3.001" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="retrieve" compatibility="9.3.001" expanded="true" height="68" name="Retrieve Sonar" width="90" x="112" y="34"> <parameter key="repository_entry" value="//Samples/data/Sonar"/> </operator> <operator activated="true" class="nominal_to_binominal" compatibility="9.3.001" expanded="true" height="103" name="Nominal to Binominal" width="90" x="246" y="34"> <parameter key="return_preprocessing_model" value="false"/> <parameter key="create_view" value="false"/> <parameter key="attribute_filter_type" value="single"/> <parameter key="attribute" value="class"/> <parameter key="attributes" value=""/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="nominal"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="file_path"/> <parameter key="block_type" value="single_value"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="single_value"/> <parameter key="invert_selection" value="false"/> <parameter key="include_special_attributes" value="true"/> <parameter key="transform_binominal" value="false"/> <parameter key="use_underscore_in_name" value="false"/> </operator> <operator activated="true" class="split_validation" compatibility="9.3.001" expanded="true" height="145" name="Validation" width="90" x="380" y="34"> <parameter key="create_complete_model" value="false"/> <parameter key="split" value="relative"/> <parameter key="split_ratio" value="0.8"/> <parameter key="training_set_size" value="100"/> <parameter key="test_set_size" value="-1"/> <parameter key="sampling_type" value="automatic"/> <parameter key="use_local_random_seed" value="false"/> <parameter key="local_random_seed" value="1992"/> <process expanded="true"> <operator activated="true" class="concurrency:optimize_parameters_grid" compatibility="9.3.001" expanded="true" height="145" name="Optimize Parameters (Grid)" width="90" x="112" y="34"> <list key="parameters"> <parameter key="Select Subprocess.select_which" value="[1.0;3;3;linear]"/> </list> <parameter key="error_handling" value="fail on error"/> <parameter key="log_performance" value="true"/> <parameter key="log_all_criteria" value="false"/> <parameter key="synchronize" value="false"/> <parameter key="enable_parallel_execution" value="true"/> <process expanded="true"> <operator activated="true" class="select_subprocess" compatibility="9.3.001" expanded="true" height="124" name="Select Subprocess" width="90" x="380" y="34"> <parameter key="select_which" value="1"/> <process expanded="true"> <operator activated="true" class="concurrency:optimize_parameters_grid" compatibility="9.3.001" expanded="true" height="124" name="Optimize Parameters (Grid) (2)" width="90" x="112" y="34"> <list key="parameters"> <parameter key="Decision Tree.maximal_depth" value="[1;20;10;linear]"/> </list> <parameter key="error_handling" value="fail on error"/> <parameter key="log_performance" value="true"/> <parameter key="log_all_criteria" value="false"/> <parameter key="synchronize" value="false"/> <parameter key="enable_parallel_execution" value="true"/> <process expanded="true"> <operator activated="true" class="split_validation" compatibility="9.3.001" expanded="true" height="124" name="Validation (2)" width="90" x="380" y="34"> <parameter key="create_complete_model" value="false"/> <parameter key="split" value="relative"/> <parameter key="split_ratio" value="0.7"/> <parameter key="training_set_size" value="100"/> <parameter key="test_set_size" value="-1"/> <parameter key="sampling_type" value="automatic"/> <parameter key="use_local_random_seed" value="false"/> <parameter key="local_random_seed" value="1992"/> <process expanded="true"> <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.3.001" expanded="true" height="103" name="Decision Tree" width="90" x="112" y="34"> <parameter key="criterion" value="gain_ratio"/> <parameter key="maximal_depth" value="10"/> <parameter key="apply_pruning" value="true"/> <parameter key="confidence" value="0.1"/> <parameter key="apply_prepruning" value="true"/> <parameter key="minimal_gain" value="0.01"/> <parameter key="minimal_leaf_size" value="2"/> <parameter key="minimal_size_for_split" value="4"/> <parameter key="number_of_prepruning_alternatives" value="3"/> </operator> <connect from_port="training" to_op="Decision Tree" to_port="training set"/> <connect from_op="Decision Tree" from_port="model" to_port="model"/> <portSpacing port="source_training" spacing="0"/> <portSpacing port="sink_model" spacing="0"/> <portSpacing port="sink_through 1" spacing="0"/> </process> <process expanded="true"> <operator activated="true" class="apply_model" compatibility="9.3.001" expanded="true" height="82" name="Apply Model (2)" width="90" x="45" y="34"> <list key="application_parameters"/> <parameter key="create_view" value="false"/> </operator> <operator activated="true" class="performance_binominal_classification" compatibility="9.3.001" expanded="true" height="82" name="Performance (2)" width="90" x="246" y="34"> <parameter key="manually_set_positive_class" value="false"/> <parameter key="main_criterion" value="AUC"/> <parameter key="accuracy" value="true"/> <parameter key="classification_error" value="false"/> <parameter key="kappa" value="false"/> <parameter key="AUC (optimistic)" value="false"/> <parameter key="AUC" value="true"/> <parameter key="AUC (pessimistic)" value="false"/> <parameter key="precision" value="false"/> <parameter key="recall" value="false"/> <parameter key="lift" value="false"/> <parameter key="fallout" value="false"/> <parameter key="f_measure" value="false"/> <parameter key="false_positive" value="false"/> <parameter key="false_negative" value="false"/> <parameter key="true_positive" value="false"/> <parameter key="true_negative" value="false"/> <parameter key="sensitivity" value="false"/> <parameter key="specificity" value="false"/> <parameter key="youden" value="false"/> <parameter key="positive_predictive_value" value="false"/> <parameter key="negative_predictive_value" value="false"/> <parameter key="psep" value="false"/> <parameter key="skip_undefined_labels" value="true"/> <parameter key="use_example_weights" value="true"/> </operator> <connect from_port="model" to_op="Apply Model (2)" to_port="model"/> <connect from_port="test set" to_op="Apply Model (2)" to_port="unlabelled data"/> <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/> <connect from_op="Performance (2)" from_port="performance" to_port="averagable 1"/> <portSpacing port="source_model" spacing="0"/> <portSpacing port="source_test set" spacing="0"/> <portSpacing port="source_through 1" spacing="0"/> <portSpacing port="sink_averagable 1" spacing="0"/> <portSpacing port="sink_averagable 2" spacing="0"/> </process> </operator> <connect from_port="input 1" to_op="Validation (2)" to_port="training"/> <connect from_op="Validation (2)" from_port="model" to_port="model"/> <connect from_op="Validation (2)" from_port="averagable 1" to_port="performance"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="source_input 2" spacing="0"/> <portSpacing port="sink_performance" spacing="0"/> <portSpacing port="sink_model" spacing="0"/> <portSpacing port="sink_output 1" spacing="0"/> </process> <description align="center" color="transparent" colored="false" width="126">Optimize Decision Tree</description> </operator> <connect from_port="input 1" to_op="Optimize Parameters (Grid) (2)" to_port="input 1"/> <connect from_op="Optimize Parameters (Grid) (2)" from_port="performance" to_port="output 1"/> <connect from_op="Optimize Parameters (Grid) (2)" from_port="model" to_port="output 2"/> <connect from_op="Optimize Parameters (Grid) (2)" from_port="parameter set" to_port="output 3"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="source_input 2" spacing="0"/> <portSpacing port="sink_output 1" spacing="0"/> <portSpacing port="sink_output 2" spacing="0"/> <portSpacing port="sink_output 3" spacing="0"/> <portSpacing port="sink_output 4" spacing="0"/> </process> <process expanded="true"> <operator activated="true" class="concurrency:optimize_parameters_grid" compatibility="9.3.001" expanded="true" height="124" name="Optimize Parameters (Grid) (3)" width="90" x="112" y="34"> <list key="parameters"> <parameter key="Random Forest.number_of_trees" value="[30;100.0;5;linear]"/> </list> <parameter key="error_handling" value="fail on error"/> <parameter key="log_performance" value="true"/> <parameter key="log_all_criteria" value="false"/> <parameter key="synchronize" value="false"/> <parameter key="enable_parallel_execution" value="true"/> <process expanded="true"> <operator activated="true" class="split_validation" compatibility="9.3.001" expanded="true" height="124" name="Validation (3)" width="90" x="380" y="34"> <parameter key="create_complete_model" value="false"/> <parameter key="split" value="relative"/> <parameter key="split_ratio" value="0.7"/> <parameter key="training_set_size" value="100"/> <parameter key="test_set_size" value="-1"/> <parameter key="sampling_type" value="automatic"/> <parameter key="use_local_random_seed" value="false"/> <parameter key="local_random_seed" value="1992"/> <process expanded="true"> <operator activated="true" class="concurrency:parallel_random_forest" compatibility="9.3.001" expanded="true" height="103" name="Random Forest" width="90" x="112" y="34"> <parameter key="number_of_trees" value="100"/> <parameter key="criterion" value="gain_ratio"/> <parameter key="maximal_depth" value="10"/> <parameter key="apply_pruning" value="false"/> <parameter key="confidence" value="0.1"/> <parameter key="apply_prepruning" value="false"/> <parameter key="minimal_gain" value="0.01"/> <parameter key="minimal_leaf_size" value="2"/> <parameter key="minimal_size_for_split" value="4"/> <parameter key="number_of_prepruning_alternatives" value="3"/> <parameter key="random_splits" value="false"/> <parameter key="guess_subset_ratio" value="true"/> <parameter key="subset_ratio" value="0.2"/> <parameter key="voting_strategy" value="confidence vote"/> <parameter key="use_local_random_seed" value="false"/> <parameter key="local_random_seed" value="1992"/> <parameter key="enable_parallel_execution" value="true"/> </operator> <connect from_port="training" to_op="Random Forest" to_port="training set"/> <connect from_op="Random Forest" from_port="model" to_port="model"/> <portSpacing port="source_training" spacing="0"/> <portSpacing port="sink_model" spacing="0"/> <portSpacing port="sink_through 1" spacing="0"/> </process> <process expanded="true"> <operator activated="true" class="apply_model" compatibility="9.3.001" expanded="true" height="82" name="Apply Model (3)" width="90" x="45" y="34"> <list key="application_parameters"/> <parameter key="create_view" value="false"/> </operator> <operator activated="true" class="performance_binominal_classification" compatibility="9.3.001" expanded="true" height="82" name="Performance (3)" width="90" x="246" y="34"> <parameter key="manually_set_positive_class" value="false"/> <parameter key="main_criterion" value="AUC"/> <parameter key="accuracy" value="true"/> <parameter key="classification_error" value="false"/> <parameter key="kappa" value="false"/> <parameter key="AUC (optimistic)" value="false"/> <parameter key="AUC" value="true"/> <parameter key="AUC (pessimistic)" value="false"/> <parameter key="precision" value="false"/> <parameter key="recall" value="false"/> <parameter key="lift" value="false"/> <parameter key="fallout" value="false"/> <parameter key="f_measure" value="false"/> <parameter key="false_positive" value="false"/> <parameter key="false_negative" value="false"/> <parameter key="true_positive" value="false"/> <parameter key="true_negative" value="false"/> <parameter key="sensitivity" value="false"/> <parameter key="specificity" value="false"/> <parameter key="youden" value="false"/> <parameter key="positive_predictive_value" value="false"/> <parameter key="negative_predictive_value" value="false"/> <parameter key="psep" value="false"/> <parameter key="skip_undefined_labels" value="true"/> <parameter key="use_example_weights" value="true"/> </operator> <connect from_port="model" to_op="Apply Model (3)" to_port="model"/> <connect from_port="test set" to_op="Apply Model (3)" to_port="unlabelled data"/> <connect from_op="Apply Model (3)" from_port="labelled data" to_op="Performance (3)" to_port="labelled data"/> <connect from_op="Performance (3)" from_port="performance" to_port="averagable 1"/> <portSpacing port="source_model" spacing="0"/> <portSpacing port="source_test set" spacing="0"/> <portSpacing port="source_through 1" spacing="0"/> <portSpacing port="sink_averagable 1" spacing="0"/> <portSpacing port="sink_averagable 2" spacing="0"/> </process> </operator> <connect from_port="input 1" to_op="Validation (3)" to_port="training"/> <connect from_op="Validation (3)" from_port="model" to_port="model"/> <connect from_op="Validation (3)" from_port="averagable 1" to_port="performance"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="source_input 2" spacing="0"/> <portSpacing port="sink_performance" spacing="0"/> <portSpacing port="sink_model" spacing="0"/> <portSpacing port="sink_output 1" spacing="0"/> </process> <description align="center" color="transparent" colored="false" width="126">Optimize Random Forest</description> </operator> <connect from_port="input 1" to_op="Optimize Parameters (Grid) (3)" to_port="input 1"/> <connect from_op="Optimize Parameters (Grid) (3)" from_port="performance" to_port="output 1"/> <connect from_op="Optimize Parameters (Grid) (3)" from_port="model" to_port="output 2"/> <connect from_op="Optimize Parameters (Grid) (3)" from_port="parameter set" to_port="output 3"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="source_input 2" spacing="0"/> <portSpacing port="sink_output 1" spacing="0"/> <portSpacing port="sink_output 2" spacing="0"/> <portSpacing port="sink_output 3" spacing="0"/> <portSpacing port="sink_output 4" spacing="0"/> </process> <process expanded="true"> <operator activated="true" class="concurrency:optimize_parameters_grid" compatibility="9.3.001" expanded="true" height="124" name="Optimize Parameters (Grid) (4)" width="90" x="112" y="34"> <list key="parameters"> <parameter key="Gradient Boosted Trees.number_of_trees" value="[10;100;5;linear]"/> </list> <parameter key="error_handling" value="fail on error"/> <parameter key="log_performance" value="true"/> <parameter key="log_all_criteria" value="false"/> <parameter key="synchronize" value="false"/> <parameter key="enable_parallel_execution" value="true"/> <process expanded="true"> <operator activated="true" class="split_validation" compatibility="9.3.001" expanded="true" height="124" name="Validation (4)" width="90" x="380" y="34"> <parameter key="create_complete_model" value="false"/> <parameter key="split" value="relative"/> <parameter key="split_ratio" value="0.7"/> <parameter key="training_set_size" value="100"/> <parameter key="test_set_size" value="-1"/> <parameter key="sampling_type" value="automatic"/> <parameter key="use_local_random_seed" value="false"/> <parameter key="local_random_seed" value="1992"/> <process expanded="true"> <operator activated="true" class="h2o:gradient_boosted_trees" compatibility="9.3.001" expanded="true" height="103" name="Gradient Boosted Trees" width="90" x="179" y="34"> <parameter key="number_of_trees" value="100"/> <parameter key="reproducible" value="false"/> <parameter key="maximum_number_of_threads" value="4"/> <parameter key="use_local_random_seed" value="false"/> <parameter key="local_random_seed" value="1992"/> <parameter key="maximal_depth" value="10"/> <parameter key="min_rows" value="10.0"/> <parameter key="min_split_improvement" value="0.0"/> <parameter key="number_of_bins" value="20"/> <parameter key="learning_rate" value="0.01"/> <parameter key="sample_rate" value="1.0"/> <parameter key="distribution" value="AUTO"/> <parameter key="early_stopping" value="false"/> <parameter key="stopping_rounds" value="1"/> <parameter key="stopping_metric" value="AUTO"/> <parameter key="stopping_tolerance" value="0.001"/> <parameter key="max_runtime_seconds" value="0"/> <list key="expert_parameters"/> </operator> <connect from_port="training" to_op="Gradient Boosted Trees" to_port="training set"/> <connect from_op="Gradient Boosted Trees" from_port="model" to_port="model"/> <portSpacing port="source_training" spacing="0"/> <portSpacing port="sink_model" spacing="0"/> <portSpacing port="sink_through 1" spacing="0"/> </process> <process expanded="true"> <operator activated="true" class="apply_model" compatibility="9.3.001" expanded="true" height="82" name="Apply Model (4)" width="90" x="45" y="34"> <list key="application_parameters"/> <parameter key="create_view" value="false"/> </operator> <operator activated="true" class="performance_binominal_classification" compatibility="9.3.001" expanded="true" height="82" name="Performance (4)" width="90" x="246" y="34"> <parameter key="manually_set_positive_class" value="false"/> <parameter key="main_criterion" value="AUC"/> <parameter key="accuracy" value="true"/> <parameter key="classification_error" value="false"/> <parameter key="kappa" value="false"/> <parameter key="AUC (optimistic)" value="false"/> <parameter key="AUC" value="true"/> <parameter key="AUC (pessimistic)" value="false"/> <parameter key="precision" value="false"/> <parameter key="recall" value="false"/> <parameter key="lift" value="false"/> <parameter key="fallout" value="false"/> <parameter key="f_measure" value="false"/> <parameter key="false_positive" value="false"/> <parameter key="false_negative" value="false"/> <parameter key="true_positive" value="false"/> <parameter key="true_negative" value="false"/> <parameter key="sensitivity" value="false"/> <parameter key="specificity" value="false"/> <parameter key="youden" value="false"/> <parameter key="positive_predictive_value" value="false"/> <parameter key="negative_predictive_value" value="false"/> <parameter key="psep" value="false"/> <parameter key="skip_undefined_labels" value="true"/> <parameter key="use_example_weights" value="true"/> </operator> <connect from_port="model" to_op="Apply Model (4)" to_port="model"/> <connect from_port="test set" to_op="Apply Model (4)" to_port="unlabelled data"/> <connect from_op="Apply Model (4)" from_port="labelled data" to_op="Performance (4)" to_port="labelled data"/> <connect from_op="Performance (4)" from_port="performance" to_port="averagable 1"/> <portSpacing port="source_model" spacing="0"/> <portSpacing port="source_test set" spacing="0"/> <portSpacing port="source_through 1" spacing="0"/> <portSpacing port="sink_averagable 1" spacing="0"/> <portSpacing port="sink_averagable 2" spacing="0"/> </process> </operator> <connect from_port="input 1" to_op="Validation (4)" to_port="training"/> <connect from_op="Validation (4)" from_port="model" to_port="model"/> <connect from_op="Validation (4)" from_port="averagable 1" to_port="performance"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="source_input 2" spacing="0"/> <portSpacing port="sink_performance" spacing="0"/> <portSpacing port="sink_model" spacing="0"/> <portSpacing port="sink_output 1" spacing="0"/> </process> <description align="center" color="transparent" colored="false" width="126">Optimize GBT</description> </operator> <connect from_port="input 1" to_op="Optimize Parameters (Grid) (4)" to_port="input 1"/> <connect from_op="Optimize Parameters (Grid) (4)" from_port="performance" to_port="output 1"/> <connect from_op="Optimize Parameters (Grid) (4)" from_port="model" to_port="output 2"/> <connect from_op="Optimize Parameters (Grid) (4)" from_port="parameter set" to_port="output 3"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="source_input 2" spacing="0"/> <portSpacing port="sink_output 1" spacing="0"/> <portSpacing port="sink_output 2" spacing="0"/> <portSpacing port="sink_output 3" spacing="0"/> <portSpacing port="sink_output 4" spacing="0"/> </process> </operator> <connect from_port="input 1" to_op="Select Subprocess" to_port="input 1"/> <connect from_op="Select Subprocess" from_port="output 1" to_port="performance"/> <connect from_op="Select Subprocess" from_port="output 2" to_port="model"/> <connect from_op="Select Subprocess" from_port="output 3" to_port="output 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="source_input 2" spacing="0"/> <portSpacing port="sink_performance" spacing="0"/> <portSpacing port="sink_model" spacing="0"/> <portSpacing port="sink_output 1" spacing="0"/> <portSpacing port="sink_output 2" spacing="0"/> </process> </operator> <operator activated="true" class="collect" compatibility="9.3.001" expanded="true" height="103" name="Collect" width="90" x="246" y="136"> <parameter key="unfold" value="false"/> </operator> <operator activated="true" class="remember" compatibility="9.3.001" expanded="true" height="68" name="Remember" width="90" x="365" y="136"> <parameter key="name" value="parameter sets"/> <parameter key="io_object" value="IOObjectCollection"/> <parameter key="store_which" value="1"/> <parameter key="remove_from_process" value="true"/> </operator> <connect from_port="training" to_op="Optimize Parameters (Grid)" to_port="input 1"/> <connect from_op="Optimize Parameters (Grid)" from_port="model" to_port="model"/> <connect from_op="Optimize Parameters (Grid)" from_port="parameter set" to_op="Collect" to_port="input 1"/> <connect from_op="Optimize Parameters (Grid)" from_port="output 1" to_op="Collect" to_port="input 2"/> <connect from_op="Collect" from_port="collection" to_op="Remember" to_port="store"/> <portSpacing port="source_training" spacing="0"/> <portSpacing port="sink_model" spacing="0"/> <portSpacing port="sink_through 1" spacing="0"/> </process> <process expanded="true"> <operator activated="true" class="apply_model" compatibility="9.3.001" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34"> <list key="application_parameters"/> <parameter key="create_view" value="false"/> </operator> <operator activated="true" class="performance_binominal_classification" compatibility="9.3.001" expanded="true" height="82" name="Performance" width="90" x="246" y="34"> <parameter key="manually_set_positive_class" value="false"/> <parameter key="main_criterion" value="first"/> <parameter key="accuracy" value="true"/> <parameter key="classification_error" value="false"/> <parameter key="kappa" value="false"/> <parameter key="AUC (optimistic)" value="false"/> <parameter key="AUC" value="false"/> <parameter key="AUC (pessimistic)" value="false"/> <parameter key="precision" value="false"/> <parameter key="recall" value="false"/> <parameter key="lift" value="false"/> <parameter key="fallout" value="false"/> <parameter key="f_measure" value="false"/> <parameter key="false_positive" value="false"/> <parameter key="false_negative" value="false"/> <parameter key="true_positive" value="false"/> <parameter key="true_negative" value="false"/> <parameter key="sensitivity" value="false"/> <parameter key="specificity" value="false"/> <parameter key="youden" value="false"/> <parameter key="positive_predictive_value" value="false"/> <parameter key="negative_predictive_value" value="false"/> <parameter key="psep" value="false"/> <parameter key="skip_undefined_labels" value="true"/> <parameter key="use_example_weights" value="true"/> </operator> <connect from_port="model" to_op="Apply Model" to_port="model"/> <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/> <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/> <connect from_op="Performance" from_port="performance" to_port="averagable 1"/> <portSpacing port="source_model" spacing="0"/> <portSpacing port="source_test set" spacing="0"/> <portSpacing port="source_through 1" spacing="0"/> <portSpacing port="sink_averagable 1" spacing="0"/> <portSpacing port="sink_averagable 2" spacing="0"/> <portSpacing port="sink_averagable 3" spacing="0"/> </process> </operator> <operator activated="true" class="subprocess" compatibility="9.3.001" expanded="true" height="103" name="Subprocess" width="90" x="581" y="136"> <process expanded="true"> <operator activated="true" class="recall" compatibility="9.3.001" expanded="true" height="68" name="Recall" width="90" x="447" y="85"> <parameter key="name" value="parameter sets"/> <parameter key="io_object" value="IOObjectCollection"/> <parameter key="remove_from_store" value="true"/> </operator> <connect from_port="in 1" to_port="out 1"/> <connect from_op="Recall" from_port="result" to_port="out 2"/> <portSpacing port="source_in 1" spacing="0"/> <portSpacing port="source_in 2" spacing="0"/> <portSpacing port="sink_out 1" spacing="0"/> <portSpacing port="sink_out 2" spacing="0"/> <portSpacing port="sink_out 3" spacing="0"/> </process> </operator> <connect from_op="Retrieve Sonar" from_port="output" to_op="Nominal to Binominal" to_port="example set input"/> <connect from_op="Nominal to Binominal" from_port="example set output" to_op="Validation" to_port="training"/> <connect from_op="Validation" from_port="model" to_port="result 1"/> <connect from_op="Validation" from_port="training" to_port="result 2"/> <connect from_op="Validation" from_port="averagable 1" to_op="Subprocess" to_port="in 1"/> <connect from_op="Subprocess" from_port="out 1" to_port="result 3"/> <connect from_op="Subprocess" from_port="out 2" to_port="result 4"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> <portSpacing port="sink_result 3" spacing="0"/> <portSpacing port="sink_result 4" spacing="0"/> <portSpacing port="sink_result 5" spacing="0"/> </process> </operator> </process>
5 -
kypexin RapidMiner Certified Analyst, Member Posts: 291 UnicornHi @dome
I suggest you to start with an analysis of an underlying problem. What exactly is the data you are working with? What metric is the most important from practical (or business) point of view? Are there different costs of misclassification for positive and negative class? This all is very important for model optimisation process.
However, couple of advices:- Narrow down possible list of models by using this online tool: http://mod.rapidminer.com/#app
- If you are sure that AUC is the metric you want to optimise first place, use COMPARE ROCS operator which would help you to compare different models. You said you tried it, and I think it is completely fine for the first step just to have default settings for all learners.
- After you have chosen a final model, you need to be sure you understand the most important parameters of the model and then use OPTIMIZE PARAMETERS operator to find the best combination. Usually there's no need to cycle through them all, most models have just a few parameters that are most important.
6 -
hughesfleming68 Member Posts: 323 UnicornKeep in mind that you don't have a lot of data so you will need to be very careful how you validate your models. While you have a lot of choice as far as operators for binary classification you can narrow them down quite significantly. Look at linear models first...SVM,GLM and then trees...random forest and Gradient boosted. You should be able to get a good feel for your data and its predictability from these four.6
Answers