Kmeans in Rapidminer

venkat · April 2014

Hello,

I am clustering my dataset using k-means in Rapidminer.

It will be very very helpful if anybody can answer my questions below:

1) when I try to import the excel file in local repository the following error is coming. the excel contains 60000 records. I am using 4GB RAM and I have given 2GB for Rapidminer.

Exception in thread "ProgressThread" java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.StringCoding$StringDecoder.decode(StringCoding.java:149)
at java.lang.StringCoding.decode(StringCoding.java:193)
at java.lang.String.<init>(String.java:416)
at java.lang.String.<init>(String.java:481)
at jxl.biff.StringHelper.getUnicodeString(StringHelper.java:176)
at jxl.read.biff.SSTRecord.readStrings(SSTRecord.java:189)
at jxl.read.biff.SSTRecord.<init>(SSTRecord.java:123)
at jxl.read.biff.WorkbookParser.parse(WorkbookParser.java:576)
at jxl.Workbook.getWorkbook(Workbook.java:237)
at com.rapidminer.operator.nio.model.ExcelResultSet.<init>(ExcelResultSet.java:113)
at com.rapidminer.operator.nio.model.ExcelResultSetConfiguration.makeDataResultSet(ExcelResultSetConfiguration.java:323)
at com.rapidminer.operator.nio.model.ExcelResultSetConfiguration.makePreviewTableModel(ExcelResultSetConfiguration.java:341)
at com.rapidminer.operator.nio.AnnotationDeclarationWizardStep$2.run(AnnotationDeclarationWizardStep.java:83)
at com.rapidminer.gui.tools.ProgressThread$2.run(ProgressThread.java:200)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

2) I can able to import the 10000 records in local repository. But similar problem coming while doing the Kmeans clustering also. Please find the error description below.

Apr 03, 2014 11:43:52 AM com.rapidminer.operator.nio.model.WizardState readNow
INFO: Reading example set...
Apr 03, 2014 11:49:13 AM com.rapidminer.gui.ProcessThread run
SEVERE: Process failed: Java heap space
java.lang.OutOfMemoryError: Java heap space
at Jama.Matrix.getArrayCopy(Matrix.java:214)
at Jama.SingularValueDecomposition.<init>(SingularValueDecomposition.java:54)
at Jama.Matrix.svd(Matrix.java:797)
at com.rapidminer.operator.features.transformation.SVDReduction.doWork(SVDReduction.java:138)
at com.rapidminer.operator.Operator.execute(Operator.java:867)
at com.rapidminer.operator.execution.SimpleUnitExecutor.execute(SimpleUnitExecutor.java:51)
at com.rapidminer.operator.ExecutionUnit.execute(ExecutionUnit.java:711)
at com.rapidminer.operator.OperatorChain.doWork(OperatorChain.java:375)
at com.rapidminer.operator.Operator.execute(Operator.java:867)
at com.rapidminer.Process.run(Process.java:949)
at com.rapidminer.Process.run(Process.java:873)
at com.rapidminer.Process.run(Process.java:832)
at com.rapidminer.Process.run(Process.java:827)
at com.rapidminer.Process.run(Process.java:817)
at com.rapidminer.gui.ProcessThread.run(ProcessThread.java:63)

Apr 03, 2014 11:49:13 AM com.rapidminer.gui.ProcessThread run
SEVERE: Here: Root[1] (Process)
subprocess 'Main Process'
+- Retrieve[1] (Retrieve)
+- Nominal to Numerical[1] (Nominal to Numerical)
+- KMeans[1] (k-Means)
==> +- SVDReduction[1] (Singular Value Decomposition)

Please find the XML view:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.013">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.013" expanded="true" name="Root">
<description>In many cases, no target attribute (label) can be defined and the data should be automatically grouped. This procedure is called &quot;Clustering&quot;. RapidMiner supports a wide range of clustering schemes which can be used in just the same way like any other learning scheme. This includes the combination with all preprocessing operators. In this experimen, the well-known Iris data set is loaded (the label is loaded, too, but it is only used for visualization and comparison and not for building the clusters itself). One of the most simple clustering schemes, namely KMeans, is then applied to this data set. Afterwards, a dimensionality reduction is performed in order to better support the visualization of the data set in two dimensions. Just perform the process and compare the clustering result with the original label (e.g. in the plot view of the example set). You can also visualize the cluster model itself. </description>
<parameter key="logverbosity" value="warning"/>
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="5.3.013" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
<parameter key="repository_entry" value="//Local Repository/data/cat23"/>
</operator>
<operator activated="true" class="nominal_to_numerical" compatibility="5.3.013" expanded="true" height="94" name="Nominal to Numerical" width="90" x="212" y="30">
<list key="comparison_groups"/>
</operator>
<operator activated="true" class="k_means" compatibility="5.3.013" expanded="true" height="76" name="KMeans" width="90" x="380" y="30">
<parameter key="k" value="3"/>
</operator>
<operator activated="true" class="singular_value_decomposition" compatibility="5.1.004" expanded="true" height="94" name="SVDReduction" width="90" x="514" y="165">
<parameter key="dimensions" value="2"/>
</operator>
<connect from_op="Retrieve" from_port="output" to_op="Nominal to Numerical" to_port="example set input"/>
<connect from_op="Nominal to Numerical" from_port="example set output" to_op="KMeans" to_port="example set"/>
<connect from_op="KMeans" from_port="cluster model" to_port="result 1"/>
<connect from_op="KMeans" from_port="clustered set" to_op="SVDReduction" to_port="example set input"/>
<connect from_op="SVDReduction" from_port="example set output" to_port="result 2"/>
<connect from_op="SVDReduction" from_port="original" to_port="result 3"/>
<connect from_op="SVDReduction" from_port="preprocessing model" to_port="result 4"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="72"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
</process>
</operator>
</process>

Thank you in advance.
Venkat

Marco_Boeck · April 2014

Hi,

RAM seems to not be enough in your case. You can check what RapidMiner Studio is actually using by activating the "System Monitor" view from the "View" -> "Show Views" top menu. There is not much else I can say at this point I'm afraid - some tasks require more memory than others.

Regards,
Marco

venkat · April 2014

Hi,

I have increased RAM to 8GB. I am using 6GB RAM for rapidminer. Even though the same error is coming. My input dataset contains 65K records.

Apr 03, 2014 5:03:22 PM com.rapidminer.gui.ProcessThread run
SEVERE: Process failed: Java heap space
java.lang.OutOfMemoryError: Java heap space
at com.rapidminer.example.table.DoubleArrayDataRow.ensureNumberOfColumns(DoubleArrayDataRow.java:72)
at com.rapidminer.example.table.MemoryExampleTable.addAttributes(MemoryExampleTable.java:209)
at com.rapidminer.operator.preprocessing.filter.NominalToNumericModel.applyOnDataDummyCoding(NominalToNumericModel.java:250)
at com.rapidminer.operator.preprocessing.filter.NominalToNumericModel.applyOnData(NominalToNumericModel.java:196)
at com.rapidminer.operator.preprocessing.PreprocessingModel.apply(PreprocessingModel.java:95)
at com.rapidminer.operator.preprocessing.PreprocessingOperator.apply(PreprocessingOperator.java:130)
at com.rapidminer.operator.AbstractExampleSetProcessing.doWork(AbstractExampleSetProcessing.java:116)
at com.rapidminer.operator.Operator.execute(Operator.java:867)
at com.rapidminer.operator.execution.SimpleUnitExecutor.execute(SimpleUnitExecutor.java:51)
at com.rapidminer.operator.ExecutionUnit.execute(ExecutionUnit.java:711)
at com.rapidminer.operator.OperatorChain.doWork(OperatorChain.java:375)
at com.rapidminer.operator.Operator.execute(Operator.java:867)
at com.rapidminer.Process.run(Process.java:949)
at com.rapidminer.Process.run(Process.java:873)
at com.rapidminer.Process.run(Process.java:832)
at com.rapidminer.Process.run(Process.java:827)
at com.rapidminer.Process.run(Process.java:817)
at com.rapidminer.gui.ProcessThread.run(ProcessThread.java:63)

Apr 03, 2014 5:03:22 PM com.rapidminer.gui.ProcessThread run
SEVERE: Here: Root[1] (Process)
subprocess 'Main Process'
+- Retrieve sample[1] (Retrieve)
+- Normalize[1] (Normalize)
+- Set Role[1] (Set Role)
+- Sample[1] (Sample)
==> +- Nominal to Numerical[1] (Nominal to Numerical)
+- KMeans[0] (k-Means)
+- SVDReduction[0] (Singular Value Decomposition)

My XML is like this:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.013">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.013" expanded="true" name="Root">
<description>In many cases, no target attribute (label) can be defined and the data should be automatically grouped. This procedure is called &quot;Clustering&quot;. RapidMiner supports a wide range of clustering schemes which can be used in just the same way like any other learning scheme. This includes the combination with all preprocessing operators. In this experimen, the well-known Iris data set is loaded (the label is loaded, too, but it is only used for visualization and comparison and not for building the clusters itself). One of the most simple clustering schemes, namely KMeans, is then applied to this data set. Afterwards, a dimensionality reduction is performed in order to better support the visualization of the data set in two dimensions. Just perform the process and compare the clustering result with the original label (e.g. in the plot view of the example set). You can also visualize the cluster model itself. </description>
<parameter key="logverbosity" value="warning"/>
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="5.3.013" expanded="true" height="60" name="Retrieve sample" width="90" x="45" y="75">
<parameter key="repository_entry" value="//Local Repository/data/sample"/>
</operator>
<operator activated="true" class="normalize" compatibility="5.3.013" expanded="true" height="94" name="Normalize" width="90" x="246" y="300">
<parameter key="invert_selection" value="true"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="set_role" compatibility="5.3.013" expanded="true" height="76" name="Set Role" width="90" x="313" y="30">
<parameter key="attribute_name" value="content"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="sample" compatibility="5.3.013" expanded="true" height="76" name="Sample" width="90" x="380" y="165">
<parameter key="sample" value="relative"/>
<parameter key="sample_size" value="-1"/>
<list key="sample_size_per_class"/>
<list key="sample_ratio_per_class"/>
<list key="sample_probability_per_class"/>
<parameter key="use_local_random_seed" value="true"/>
</operator>
<operator activated="true" class="nominal_to_numerical" compatibility="5.3.013" expanded="true" height="94" name="Nominal to Numerical" width="90" x="447" y="300">
<list key="comparison_groups"/>
</operator>
<operator activated="true" class="k_means" compatibility="5.3.013" expanded="true" height="76" name="KMeans" width="90" x="581" y="30">
<parameter key="k" value="3"/>
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="numerical_measure" value="CosineSimilarity"/>
<parameter key="use_local_random_seed" value="true"/>
</operator>
<operator activated="true" class="singular_value_decomposition" compatibility="5.1.004" expanded="true" height="94" name="SVDReduction" width="90" x="715" y="120">
<parameter key="dimensions" value="2"/>
</operator>
<connect from_op="Retrieve sample" from_port="output" to_op="Normalize" to_port="example set input"/>
<connect from_op="Normalize" from_port="example set output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Sample" to_port="example set input"/>
<connect from_op="Sample" from_port="example set output" to_op="Nominal to Numerical" to_port="example set input"/>
<connect from_op="Nominal to Numerical" from_port="example set output" to_op="KMeans" to_port="example set"/>
<connect from_op="KMeans" from_port="cluster model" to_port="result 1"/>
<connect from_op="KMeans" from_port="clustered set" to_op="SVDReduction" to_port="example set input"/>
<connect from_op="SVDReduction" from_port="example set output" to_port="result 2"/>
<connect from_op="SVDReduction" from_port="original" to_port="result 3"/>
<connect from_op="SVDReduction" from_port="preprocessing model" to_port="result 4"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="72"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
</process>
</operator>
</process>

Please guide me on this.

Thanks,
Venkat

fras · April 2014

Having not yet an answer to your problem two things come into my mind:
1) _After_ you raised RAM your process stops earlier in step "NominalToNumerical" that
was passed in your first post.
2) Why are you normalizing your (numeric) data _before_ all attributes are numeric ?

MariusHelf · April 2014

venkat, posting the same question in at least 3 threads does not help to improve your reputation. Please do not double post.

I have already posted another answer here: http://rapid-i.com/rapidforum/index.php/topic,7816.0.html

~Marius

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Kmeans in Rapidminer

Answers