Process that worked under RapidMiner 4.4 is now giving JAVA.OUTOFMEMORY ERROR

RobertoRoberto Member Posts: 13 Contributor II
edited November 2018 in Help
Hello,

I'm trying to use a feature selection with embeded validation and JSVMLearner to select for relevant features in a dataset.  The dataset is a CSV file with 28 examples each containing 2000 attributes between the values of 0 and 1 with a signle label that can either be true or false.  In the previous version of RapidMiner, I had no problem doing this...it just took lots of time.  Now with 4.5, I'm getting an out of memory error from Java within 45 minutes of the run.

Here's my code:

<operator name="Root" class="Process" expanded="yes">
    <operator name="CSVExampleSource" class="CSVExampleSource">
        <parameter key="filename" value="D:\Lab Projects\Roberto Lleras\Machine Learning Algorithms\Survival with 3 yr followup top 2000- JUNE09.csv"/>
        <parameter key="label_column" value="2"/>
        <parameter key="id_column" value="1"/>
    </operator>
    <operator name="ExampleSetTranspose" class="ExampleSetTranspose">
    </operator>
    <operator name="ChangeAttributeRole" class="ChangeAttributeRole">
        <parameter key="name" value="Survival Beyond 3 yrs"/>
        <parameter key="target_role" value="label"/>
    </operator>
    <operator name="MissingValueReplenishment" class="MissingValueReplenishment">
        <parameter key="default" value="zero"/>
        <list key="columns">
        </list>
    </operator>
    <operator name="NominalNumbers2Numerical" class="NominalNumbers2Numerical">
    </operator>
    <operator name="WrapperXValidation" class="WrapperXValidation" expanded="yes">
        <operator name="FeatureSelection" class="FeatureSelection" expanded="yes">
            <parameter key="show_stop_dialog" value="true"/>
            <parameter key="show_population_plotter" value="true"/>
            <parameter key="plot_generations" value="1"/>
            <parameter key="keep_best" value="25"/>
            <operator name="XValidation (2)" class="XValidation" expanded="yes">
                <parameter key="average_performances_only" value="false"/>
                <parameter key="leave_one_out" value="true"/>
                <operator name="JMySVMLearner" class="JMySVMLearner">
                    <parameter key="calculate_weights" value="true"/>
                </operator>
                <operator name="OperatorChain" class="OperatorChain" expanded="yes">
                    <operator name="ModelApplier" class="ModelApplier">
                        <parameter key="keep_model" value="true"/>
                        <list key="application_parameters">
                        </list>
                        <parameter key="create_view" value="true"/>
                    </operator>
                    <operator name="ClassificationPerformance" class="ClassificationPerformance">
                        <parameter key="keep_example_set" value="true"/>
                        <parameter key="accuracy" value="true"/>
                        <parameter key="classification_error" value="true"/>
                        <parameter key="kappa" value="true"/>
                        <parameter key="weighted_mean_recall" value="true"/>
                        <parameter key="weighted_mean_precision" value="true"/>
                        <parameter key="spearman_rho" value="true"/>
                        <parameter key="kendall_tau" value="true"/>
                        <parameter key="absolute_error" value="true"/>
                        <parameter key="relative_error" value="true"/>
                        <parameter key="relative_error_lenient" value="true"/>
                        <parameter key="relative_error_strict" value="true"/>
                        <parameter key="correlation" value="true"/>
                        <list key="class_weights">
                        </list>
                    </operator>
                    <operator name="MinMaxWrapper" class="MinMaxWrapper">
                        <parameter key="minimum_weight" value="0.5"/>
                    </operator>
                </operator>
            </operator>
        </operator>
        <operator name="LibSVMLearner (2)" class="LibSVMLearner">
            <parameter key="keep_example_set" value="true"/>
            <parameter key="cache_size" value="5000"/>
            <list key="class_weights">
            </list>
            <parameter key="calculate_confidences" value="true"/>
        </operator>
        <operator name="OperatorChain (2)" class="OperatorChain" expanded="yes">
            <operator name="ModelApplier (2)" class="ModelApplier">
                <parameter key="keep_model" value="true"/>
                <list key="application_parameters">
                </list>
                <parameter key="create_view" value="true"/>
            </operator>
            <operator name="ClassificationPerformance (2)" class="ClassificationPerformance">
                <parameter key="keep_example_set" value="true"/>
                <parameter key="accuracy" value="true"/>
                <parameter key="classification_error" value="true"/>
                <parameter key="kappa" value="true"/>
                <parameter key="weighted_mean_recall" value="true"/>
                <parameter key="weighted_mean_precision" value="true"/>
                <parameter key="spearman_rho" value="true"/>
                <parameter key="kendall_tau" value="true"/>
                <parameter key="absolute_error" value="true"/>
                <parameter key="relative_error" value="true"/>
                <parameter key="relative_error_lenient" value="true"/>
                <parameter key="relative_error_strict" value="true"/>
                <parameter key="correlation" value="true"/>
                <list key="class_weights">
                </list>
            </operator>
            <operator name="ModelWriter" class="ModelWriter">
                <parameter key="model_file" value="D:\Lab Projects\Roberto Lleras\Machine Learning Algorithms\Methylation machine\3 yr survival OP with complete followup.mod"/>
            </operator>
        </operator>
    </operator>
</operator>

Any help would be appreciated!  Thanks!
Roberto

...A second, less pertinent question is the wrapper validation takes forever to process, in the past I have used just a Weighted Feature Selection on the dataset after performing a SVMWeighting operator that was not nested like this and gotten 100% accuracy within a couple of hours.  Can I trust the results from that, or is the wrapper validation the way to go?  Again, thanks so much!

Answers

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    to answer the second question first: This depends were you inserted the weighting. If it was inside the learning part of the XValidation, then it should not have used data it cannot know and the XVal should return a valid performance estimation.

    To the second question: The FeatureSelection might use a very huge amount of memory, and it might be that it uses now a little more memory, because some of the underlying data structures have been changed. Perhabs you can increase the maximum heap size?

    Greetings,
      Sebastian
  • RobertoRoberto Member Posts: 13 Contributor II
    Thanks for the reply Sebastian,

    As per the status of my problem...the computer died on me this morning, so I'll have to wait until I can build my new system to deal with the memory issue.  The new computer will have 48Gb RAM so I don't think it will have a problem.  In your opinion, do you think the 48Gb system will be enough to handle a dataset with the same number of examples (28), but with about 13,500 attributes?  Our 12Gb machine maxed out at a 24X3000 matrix of real values to train on, our 8Gb maxed out at 24X2000. 

    As for my second question, this is the algorithm that I used to do that analysis...

    <operator name="Root" class="Process" expanded="yes">
        <parameter key="resultfile" value="D:\Lab Projects\Roberto Lleras\Machine Learning Algorithms\Methylation machine\Nodal with best Cpgs Lo and Hi.res"/>
        <operator name="CSVExampleSource" class="CSVExampleSource">
            <parameter key="filename" value="D:\Lab Projects\Roberto Lleras\Machine Learning Algorithms\Complete datasheet OP tumors.csv"/>
            <parameter key="label_column" value="2"/>
            <parameter key="id_column" value="1"/>
        </operator>
        <operator name="ExampleSetTranspose" class="ExampleSetTranspose">
        </operator>
        <operator name="MissingValueReplenishment" class="MissingValueReplenishment">
            <parameter key="default" value="zero"/>
            <list key="columns">
            </list>
        </operator>
        <operator name="ChangeAttributeRole (2)" class="ChangeAttributeRole">
            <parameter key="name" value="Survival Beyond 3 yrs"/>
            <parameter key="target_role" value="label"/>
        </operator>
        <operator name="Nominal2Numerical" class="Nominal2Numerical">
        </operator>
        <operator name="SVMWeighting" class="SVMWeighting">
        </operator>
        <operator name="WeightGuidedFeatureSelection" class="WeightGuidedFeatureSelection" expanded="yes">
            <parameter key="draw_dominated_points" value="false"/>
            <parameter key="population_criteria_data_file" value="D:\Lab Projects\Roberto Lleras\Machine Learning Algorithms\Methylation machine\Survival Complete OP Training data pop criteria.cri"/>
            <parameter key="generations_without_improval" value="-1"/>
            <parameter key="use_absolute_weights" value="false"/>
            <operator name="XValidation (2)" class="XValidation" expanded="yes">
                <parameter key="create_complete_model" value="true"/>
                <parameter key="average_performances_only" value="false"/>
                <parameter key="leave_one_out" value="true"/>
                <operator name="LibSVMLearner" class="LibSVMLearner">
                    <parameter key="keep_example_set" value="true"/>
                    <parameter key="nu" value="0.1"/>
                    <parameter key="cache_size" value="5000"/>
                    <list key="class_weights">
                    </list>
                    <parameter key="calculate_confidences" value="true"/>
                </operator>
                <operator name="OperatorChain" class="OperatorChain" expanded="yes">
                    <operator name="ModelApplier" class="ModelApplier">
                        <parameter key="keep_model" value="true"/>
                        <list key="application_parameters">
                        </list>
                        <parameter key="create_view" value="true"/>
                    </operator>
                    <operator name="ClassificationPerformance" class="ClassificationPerformance">
                        <parameter key="keep_example_set" value="true"/>
                        <parameter key="accuracy" value="true"/>
                        <parameter key="classification_error" value="true"/>
                        <parameter key="kappa" value="true"/>
                        <parameter key="weighted_mean_recall" value="true"/>
                        <parameter key="weighted_mean_precision" value="true"/>
                        <parameter key="spearman_rho" value="true"/>
                        <parameter key="kendall_tau" value="true"/>
                        <parameter key="absolute_error" value="true"/>
                        <parameter key="relative_error" value="true"/>
                        <parameter key="relative_error_lenient" value="true"/>
                        <parameter key="relative_error_strict" value="true"/>
                        <list key="class_weights">
                        </list>
                    </operator>
                    <operator name="ModelWriter" class="ModelWriter">
                        <parameter key="model_file" value="D:\Lab Projects\Roberto Lleras\Machine Learning Algorithms\Methylation machine\3 yr survival OP with complete DS model.mod"/>
                    </operator>
                </operator>
            </operator>
            <operator name="ProcessLog" class="ProcessLog">
                <list key="log">
                  <parameter key="gen" value="operator.WeightGuidedFeatureSelection.value.generation"/>
                  <parameter key="per" value="operator.WeightGuidedFeatureSelection.value.performance"/>
                </list>
            </operator>
        </operator>
        <operator name="PerformanceWriter" class="PerformanceWriter">
            <parameter key="performance_file" value="D:\Lab Projects\Roberto Lleras\Machine Learning Algorithms\Methylation machine\3 yr survival OP complete DS feature selection performance.per"/>
        </operator>
        <operator name="ResultWriter" class="ResultWriter">
            <parameter key="result_file" value="D:\Lab Projects\Roberto Lleras\Machine Learning Algorithms\Methylation machine\Survival with complete CpGs.res"/>
        </operator>
    </operator>


    Now if I'm understanding you correctly, your telling me to place the SVM weighting within the XValidation?  Im a little bit confused as to how to structure that.  The weight guided feature selection relies on an initial set of weights provided by the SVMWeighting operator.  That operator is supposed to select for features that improve a model that is provided by the LibSVMLearner, whose incremental performance is monitored by the XValidation/Classification Performance Operators, correct?  Then when the maximal fitness or a certain number of generations without improval is reached, the selected features are returned as a result along with the performance statistics?  So where would I put in the SVMweighting exactly, or is it fine as is? 

    thanks so much!
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Roberto,
    the problem of the your approach is, that the weighting is calculated on the complete data set. But for estimating the real performance you have to use unseen data and if you weight using all your data you don't have any unseen data left.
    So I would suppose to include the complete weight guided feature selection into an XValidation. This would increase the runtime, but would give you reliable results.

    On your computer problem: I'm not sure if 48 GBs will suffice, because I don't know if the memory consumption increases linearly. But I know that I would be able to provide you with a plugin for forward and backward feature selection supporting a nearly infinit number of attributes for less than a tenth of that computer :) Probably for less than the price of 32 Gigs of that computers RAM...

    Greetings,
      Sebastian
  • RobertoRoberto Member Posts: 13 Contributor II
    The problem with using the weight guided feature selection within a XValidation operator is that weight guided feature selection does not return a Model as its I/O, so the process fails...even if i save the resultant model with ModelWriter and then upload it using ModelUploader in the operation chain within XValidation, I still get an error.  Now if I use wrapperXValidation, I get an error that the attribute weights are not being passed to the weight guided feature selection operator???? Can you suggest a work around?

    This is the code I used to get the first error:

    <operator name="Root" class="Process" expanded="yes">
        <parameter key="resultfile" value="D:\Lab Projects\Roberto Lleras\Machine Learning Algorithms\Methylation machine\Nodal with best Cpgs Lo and Hi.res"/>
        <operator name="CSVExampleSource" class="CSVExampleSource">
            <parameter key="filename" value="G:\Roberto Lleras\Machine Learning Algorithms\OP Survival best 5000.csv"/>
            <parameter key="label_column" value="2"/>
            <parameter key="id_column" value="1"/>
        </operator>
        <operator name="ExampleSetTranspose" class="ExampleSetTranspose">
        </operator>
        <operator name="MissingValueReplenishment" class="MissingValueReplenishment">
            <parameter key="default" value="zero"/>
            <list key="columns">
            </list>
        </operator>
        <operator name="ChangeAttributeRole (2)" class="ChangeAttributeRole">
            <parameter key="name" value="Survival Beyond 3 yrs"/>
            <parameter key="target_role" value="label"/>
        </operator>
        <operator name="Nominal2Numerical" class="Nominal2Numerical">
        </operator>
        <operator name="SVMWeighting" class="SVMWeighting">
        </operator>
        <operator name="ResultWriter" class="ResultWriter">
            <parameter key="result_file" value="D:\Lab Projects\Roberto Lleras\Machine Learning Algorithms\Methylation machine\Survival with complete CpGs.res"/>
        </operator>
        <operator name="XValidation" class="XValidation" expanded="yes">
            <operator name="WeightGuidedFeatureSelection" class="WeightGuidedFeatureSelection" expanded="yes">
                <parameter key="draw_dominated_points" value="false"/>
                <parameter key="population_criteria_data_file" value="D:\Lab Projects\Roberto Lleras\Machine Learning Algorithms\Methylation machine\Survival Complete OP Training data pop criteria.cri"/>
                <parameter key="generations_without_improval" value="-1"/>
                <operator name="XValidation (2)" class="XValidation" expanded="yes">
                    <parameter key="create_complete_model" value="true"/>
                    <operator name="LibSVMLearner" class="LibSVMLearner">
                        <parameter key="keep_example_set" value="true"/>
                        <parameter key="nu" value="0.1"/>
                        <parameter key="cache_size" value="5000"/>
                        <list key="class_weights">
                        </list>
                        <parameter key="calculate_confidences" value="true"/>
                    </operator>
                    <operator name="OperatorChain" class="OperatorChain" expanded="yes">
                        <operator name="ModelApplier" class="ModelApplier">
                            <parameter key="keep_model" value="true"/>
                            <list key="application_parameters">
                            </list>
                            <parameter key="create_view" value="true"/>
                        </operator>
                        <operator name="ClassificationPerformance" class="ClassificationPerformance">
                            <parameter key="keep_example_set" value="true"/>
                            <parameter key="accuracy" value="true"/>
                            <parameter key="classification_error" value="true"/>
                            <parameter key="kappa" value="true"/>
                            <parameter key="weighted_mean_recall" value="true"/>
                            <parameter key="weighted_mean_precision" value="true"/>
                            <parameter key="spearman_rho" value="true"/>
                            <parameter key="kendall_tau" value="true"/>
                            <parameter key="absolute_error" value="true"/>
                            <parameter key="relative_error" value="true"/>
                            <parameter key="relative_error_lenient" value="true"/>
                            <parameter key="relative_error_strict" value="true"/>
                            <list key="class_weights">
                            </list>
                        </operator>
                        <operator name="ModelWriter" class="ModelWriter">
                            <parameter key="model_file" value="G:\Roberto Lleras\Machine Learning Algorithms\Methylation machine\3 yr survival model.mod"/>
                        </operator>
                        <operator name="ProcessLog" class="ProcessLog">
                            <list key="log">
                              <parameter key="gen" value="operator.WeightGuidedFeatureSelection.value.generation"/>
                              <parameter key="per" value="operator.WeightGuidedFeatureSelection.value.performance"/>
                            </list>
                        </operator>
                    </operator>
                </operator>
            </operator>
            <operator name="OperatorChain (2)" class="OperatorChain" expanded="yes">
                <operator name="ModelApplier (2)" class="ModelApplier">
                    <list key="application_parameters">
                    </list>
                </operator>
                <operator name="ClassificationPerformance (2)" class="ClassificationPerformance">
                    <list key="class_weights">
                    </list>
                </operator>
                <operator name="PerformanceWriter" class="PerformanceWriter">
                    <parameter key="performance_file" value="D:\Lab Projects\Roberto Lleras\Machine Learning Algorithms\Methylation machine\3 yr survival OP complete DS feature selection performance.per"/>
                </operator>
            </operator>
        </operator>
    </operator>


    And here's the code that doesn't pass the attribute weights to the feature selection:

    <operator name="Root" class="Process" expanded="yes">
        <parameter key="resultfile" value="D:\Lab Projects\Roberto Lleras\Machine Learning Algorithms\Methylation machine\Nodal with best Cpgs Lo and Hi.res"/>
        <operator name="CSVExampleSource" class="CSVExampleSource">
            <parameter key="filename" value="G:\Roberto Lleras\Machine Learning Algorithms\OP Survival best 5000.csv"/>
            <parameter key="label_column" value="2"/>
            <parameter key="id_column" value="1"/>
        </operator>
        <operator name="ExampleSetTranspose" class="ExampleSetTranspose">
        </operator>
        <operator name="MissingValueReplenishment" class="MissingValueReplenishment">
            <parameter key="default" value="zero"/>
            <list key="columns">
            </list>
        </operator>
        <operator name="ChangeAttributeRole (2)" class="ChangeAttributeRole">
            <parameter key="name" value="Survival Beyond 3 yrs"/>
            <parameter key="target_role" value="label"/>
        </operator>
        <operator name="Nominal2Numerical" class="Nominal2Numerical">
        </operator>
        <operator name="SVMWeighting" class="SVMWeighting">
        </operator>
        <operator name="WrapperXValidation" class="WrapperXValidation" expanded="yes">
            <operator name="WeightGuidedFeatureSelection" class="WeightGuidedFeatureSelection" expanded="yes">
                <parameter key="draw_dominated_points" value="false"/>
                <parameter key="population_criteria_data_file" value="D:\Lab Projects\Roberto Lleras\Machine Learning Algorithms\Methylation machine\Survival Complete OP Training data pop criteria.cri"/>
                <parameter key="generations_without_improval" value="-1"/>
                <operator name="XValidation (2)" class="XValidation" expanded="yes">
                    <parameter key="create_complete_model" value="true"/>
                    <operator name="LibSVMLearner" class="LibSVMLearner">
                        <parameter key="keep_example_set" value="true"/>
                        <parameter key="nu" value="0.1"/>
                        <parameter key="cache_size" value="5000"/>
                        <list key="class_weights">
                        </list>
                        <parameter key="calculate_confidences" value="true"/>
                    </operator>
                    <operator name="OperatorChain" class="OperatorChain" expanded="yes">
                        <operator name="ModelApplier" class="ModelApplier">
                            <parameter key="keep_model" value="true"/>
                            <list key="application_parameters">
                            </list>
                            <parameter key="create_view" value="true"/>
                        </operator>
                        <operator name="ClassificationPerformance" class="ClassificationPerformance">
                            <parameter key="keep_example_set" value="true"/>
                            <parameter key="accuracy" value="true"/>
                            <parameter key="classification_error" value="true"/>
                            <parameter key="kappa" value="true"/>
                            <parameter key="weighted_mean_recall" value="true"/>
                            <parameter key="weighted_mean_precision" value="true"/>
                            <parameter key="spearman_rho" value="true"/>
                            <parameter key="kendall_tau" value="true"/>
                            <parameter key="absolute_error" value="true"/>
                            <parameter key="relative_error" value="true"/>
                            <parameter key="relative_error_lenient" value="true"/>
                            <parameter key="relative_error_strict" value="true"/>
                            <list key="class_weights">
                            </list>
                        </operator>
                        <operator name="ModelWriter" class="ModelWriter">
                            <parameter key="model_file" value="G:\Roberto Lleras\Machine Learning Algorithms\Methylation machine\3 yr survival model.mod"/>
                        </operator>
                        <operator name="ProcessLog" class="ProcessLog">
                            <list key="log">
                              <parameter key="gen" value="operator.WeightGuidedFeatureSelection.value.generation"/>
                              <parameter key="per" value="operator.WeightGuidedFeatureSelection.value.performance"/>
                            </list>
                        </operator>
                    </operator>
                </operator>
            </operator>
            <operator name="LibSVMLearner (2)" class="LibSVMLearner">
                <list key="class_weights">
                </list>
            </operator>
            <operator name="OperatorChain (2)" class="OperatorChain" expanded="yes">
                <operator name="ModelApplier (2)" class="ModelApplier">
                    <list key="application_parameters">
                    </list>
                </operator>
                <operator name="ClassificationPerformance (2)" class="ClassificationPerformance">
                    <list key="class_weights">
                    </list>
                </operator>
            </operator>
        </operator>
        <operator name="PerformanceWriter" class="PerformanceWriter">
            <parameter key="performance_file" value="D:\Lab Projects\Roberto Lleras\Machine Learning Algorithms\Methylation machine\3 yr survival OP complete DS feature selection performance.per"/>
        </operator>
        <operator name="ResultWriter" class="ResultWriter">
            <parameter key="result_file" value="D:\Lab Projects\Roberto Lleras\Machine Learning Algorithms\Methylation machine\Survival with complete CpGs.res"/>
        </operator>
    </operator>


    As for the plugin, we'll have to see how the new computer performs, its already all ordered and half of the parts are here.  If we can't do what we want to do with it, though, then my boss may just take you up on that offer.  If I understand the logistics of how the feature selection operator works, though, memory consumption should be linear.

    Thanks for all your help Sebastian!
  • RobertoRoberto Member Posts: 13 Contributor II
    Sebastian,

    I talked to my boss about the plugin.  Could you send me the info on this plugin?  How much for a single seat license?  How much if I wanted to host it on a local server for up to 10 users?
    If you could please send me that info that would be great!

    Roberto
Sign In or Register to comment.