CachedDatabaseExampleSource + Problems

Angela · February 2010

Hi

I will use a lot of data and many simple caluculatiion.
For this, there Rapidminer suggest to use the CachedDatabaseExampleSource.
I have create this model, but it works first, after some minutes there were interrupt the process.
I got the information: Feb 15, 2010 11:48:43 AM WARNING: Caught exception in concurrent execution of FS (Optimize Selection): java.lang.OutOfMemoryError: Java heap space

What can I do??

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input>
<location/>
</input>
<output>
<location/>
<location/>
</output>
<macros/>
</context>
<operator activated="true" class="process" expanded="true" name="Root">
<description><p> Transformations of the attribute space may ease learning in a way, that simple learning schemes may be able to learn complex functions. This is the basic idea of the kernel trick. But even without kernel based learning schemes the transformation of feature space may be necessary to reach good learning results. </p> <p> RapidMiner offers several different feature selection, construction, and extraction methods. This selection process (the well known forward selection) uses an inner cross validation for performance estimation. This building block serves as fitness evaluation for all candidate feature sets. Since the performance of a certain learning scheme is taken into account we refer to processes of this type as &quot;wrapper approaches&quot;.</p> <p>Additionally the process log operator plots intermediate results. You can inspect them online in the Results tab. Please refer to the visualization sample processes or the RapidMiner tutorial for further details.</p> <p> Try the following: <ul> <li>Start the process and change to &quot;Result&quot; view. There can be a plot selected. Plot the &quot;performance&quot; against the &quot;generation&quot; of the feature selection operator.</li> <li>Select the feature selection operator in the tree view. Change the search directory from forward (forward selection) to backward (backward elimination). Restart the process. All features will be selected.</li> <li>Select the feature selection operator. Right click to open the context menu and repace the operator by another feature selection scheme (for example a genetic algorithm).</li> <li>Have a look at the list of the process log operator. Every time it is applied it collects the specified data. Please refer to the RapidMiner Tutorial for further explanations. After changing the feature selection operator to the genetic algorithm approach, you have to specify the correct values. <table><tr><td><icon>groups/24/visualization</icon></td><td><i>Use the process log operator to log values online.</i></td></tr></table> </li> </ul> </p></description>
<process expanded="true" height="280" width="195">
<operator activated="true" class="stream_database" expanded="true" height="60" name="CachedDatabaseExampleSource" width="90" x="45" y="120">
<parameter key="define_connection" value="url"/>
<parameter key="database_url" value="jdbc:mysql://localhost:3306/lai_gesamtzeitraum_aisa_baender"/>
<parameter key="username" value="root"/>
<parameter key="password" value="w3mYv/Z+Ew7XwAzejr7xJA=="/>
<parameter key="table_name" value="lai_gesamtzeitraum_aisa_baender"/>
<parameter key="recreate_index" value="true"/>
<parameter key="label_attribute" value="lai"/>
<parameter key="id_attribute" value="id"/>
</operator>
<operator activated="true" class="loop_batches" expanded="true" height="60" name="BatchProcessing" width="90" x="45" y="210">
<parameter key="parallelize_batch_process" value="true"/>
<process expanded="true" height="280" width="145">
<operator activated="true" class="generate_function_set" expanded="true" height="76" name="CompleteFeatureGeneration" width="90" x="45" y="30">
<parameter key="use_plus" value="true"/>
</operator>
<operator activated="true" class="rename_by_constructions" expanded="true" height="76" name="Construction2Names" width="90" x="45" y="120"/>
<operator activated="true" class="optimize_selection" expanded="true" height="94" name="FS" width="90" x="45" y="210">
<parameter key="limit_number_of_generations" value="true"/>
<parameter key="keep_best" value="3"/>
<parameter key="maximum_number_of_generations" value="1"/>
<parameter key="local_random_seed" value="-1"/>
<process expanded="true" height="100" width="145">
<operator activated="true" class="bootstrapping_validation" expanded="true" height="112" name="BootstrappingValidation (2)" width="90" x="45" y="30">
<parameter key="local_random_seed" value="-1"/>
<process expanded="true" height="100" width="30">
<operator activated="true" class="linear_regression" expanded="true" height="76" name="LinearRegression (3)" width="90" x="-70" y="30"/>
<connect from_port="training" to_op="LinearRegression (3)" to_port="training set"/>
<connect from_op="LinearRegression (3)" from_port="model" to_port="model"/>
<portSpacing port="source_training" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true" height="100" width="195">
<operator activated="true" class="apply_model" expanded="true" height="76" name="ModelApplier (3)" width="90" x="45" y="30">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_regression" expanded="true" height="76" name="RegressionPerformance" width="90" x="95" y="30">
<parameter key="main_criterion" value="squared_correlation"/>
<parameter key="root_mean_squared_error" value="true"/>
<parameter key="squared_correlation" value="true"/>
</operator>
<connect from_port="model" to_op="ModelApplier (3)" to_port="model"/>
<connect from_port="test set" to_op="ModelApplier (3)" to_port="unlabelled data"/>
<connect from_op="ModelApplier (3)" from_port="labelled data" to_op="RegressionPerformance" to_port="labelled data"/>
<connect from_op="RegressionPerformance" from_port="performance" to_port="averagable 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_averagable 1" spacing="0"/>
<portSpacing port="sink_averagable 2" spacing="0"/>
</process>
</operator>
<connect from_port="example set" to_op="BootstrappingValidation (2)" to_port="training"/>
<connect from_op="BootstrappingValidation (2)" from_port="averagable 1" to_port="performance"/>
<portSpacing port="source_example set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_performance" spacing="0"/>
</process>
</operator>
<connect from_port="exampleSet" to_op="CompleteFeatureGeneration" to_port="example set input"/>
<connect from_op="CompleteFeatureGeneration" from_port="example set output" to_op="Construction2Names" to_port="example set input"/>
<connect from_op="Construction2Names" from_port="example set output" to_op="FS" to_port="example set in"/>
<portSpacing port="source_exampleSet" spacing="0"/>
</process>
</operator>
<operator activated="true" class="write_database" expanded="true" height="60" name="DatabaseExampleSetWriter" width="90" x="95" y="120">
<parameter key="define_connection" value="url"/>
<parameter key="database_url" value="jdbc:mysql://localhost:3306/lai_gesamtzeitraum_aisa_baender"/>
<parameter key="username" value="root"/>
<parameter key="password" value="w3mYv/Z+Ew7XwAzejr7xJA=="/>
<parameter key="table_name" value="output"/>
</operator>
<connect from_op="CachedDatabaseExampleSource" from_port="output" to_op="BatchProcessing" to_port="example set"/>
<connect from_op="BatchProcessing" from_port="example set" to_op="DatabaseExampleSetWriter" to_port="input"/>
<connect from_op="DatabaseExampleSetWriter" from_port="through" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

land · February 2010

Hi Angela,
this is not a problem of the cached database source, but of the feature selection operator. I would advise to change to RapidMiner 5.0, since it includes new feature selection operators, limiting the memory consumption to a very small overhead. On 4.x these operators were only available for paying customers.

Greetings,
Sebastian

Angela · February 2010

halle Sebastian,

many thanks for your answer. I have also chance the rapid miner version from 4.6 to 5 for using the seleciton function in combination with the cachedDatabaseExampleSource. I have alrady parameter of mysql database increase.
But first until it seemed to work. Then the process was aborted.
I think I have ca. 300 different variables. And I will create different functions. So that means, that it will create 300x300 new variables, if I unly use one new fuction (summe, difference beetween the variables).
So I think, rapidminer has a limit in handing so much variables.
If you have yet another idea, I would be interested for this.

best regards

Angela

land · February 2010

Hi Angela,
RapidMiner copes with 90.000 attributes very well. In fact we have several problems, where they occur. But Databases are limited to around one thousand columns, so you cannot store this data set into a table! If you are using a cachedDatabaseExampleSource (or the Stream Database operator of RapidMiner 5) it always will store the data in the database and hence crashes.
Did you already tried the YAGGA operators? They will construct new attributes in a more directed fashion using genetic approaches?

Greetings,
Sebastian

CachedDatabaseExampleSource + Problems

Answers

Categories