Standard Data Sets - memory issue

svendeswansvendeswan Member Posts: 8 Contributor II
edited November 2018 in Help

I am trying to run some standard datasets like 20 newsgroups or reuters21578 but unfortunately I run into memory problems. The reuters coul be used for nearest neighbour but nothing else, the 20 newsgroups didn't run at all... Maybe I am doing something wrong?!
I use the Rapidminer 4.5....

Do you have some hints for me?



  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Sven,
    I guess you are using the TextPlugin, correct? Did you switch to sparse ExampleSet storage? It would help a lot, if you could paste the process below.
    And another thing: How much memory does your rapid miner use? Please take a look in the memory monitor and tell us...

  • svendeswansvendeswan Member Posts: 8 Contributor II
    Dear Sebastian,

    thank you for the answer. I did not quite understand how to set up the process for the sparse storage.
    The memory is about 1.9 gByte....


    And here is the process:

    <?xml version="1.0" encoding="MacRoman"?>
    <process version="4.5">

      <operator name="Root" class="Process" expanded="yes">
          <description text="#ylt#h3#ygt#Learning and storing a text classifier#ylt#/h3#ygt##ylt#p#ygt#This experiments shows how to lear and store a model o a set of texts.#ylt#/p#ygt##ylt#p#ygt#Most important to notice here is, that the list of words used for learning must be stored, if the model should be applied to new texts.  This will ensurethat new texts will be represented exactly in the same way then then the texts used during training. #ylt#/p#ygt#"/>
          <parameter key="logverbosity" value="error"/>
          <parameter key="random_seed" value="2001"/>
          <parameter key="send_mail" value="never"/>
          <parameter key="process_duration_for_mail" value="30"/>
          <parameter key="encoding" value="SYSTEM"/>
          <operator name="TextInput" class="TextInput" expanded="yes">
              <list key="texts">
                <parameter key="acq" value="/20news-bydate/20news-bydate-train/alt.atheism"/>
                <parameter key="alum" value="/20news-bydate/20news-bydate-train/"/>
                <parameter key="bop" value="/20news-bydate/20news-bydate-train/"/>
                <parameter key="carcass" value="/20news-bydate/20news-bydate-train/"/>
                <parameter key="cocoa" value="/20news-bydate/20news-bydate-train/comp.sys.mac.hardware"/>
                <parameter key="coffee" value="/20news-bydate/20news-bydate-train/"/>
                <parameter key="copper" value="/20news-bydate/20news-bydate-train/"/>
                <parameter key="cotton" value="/20news-bydate/20news-bydate-train/"/>
                <parameter key="cpi" value="/20news-bydate/20news-bydate-train/"/>
                <parameter key="cpu" value="/20news-bydate/20news-bydate-train/"/>
                <parameter key="crude" value="/20news-bydate/20news-bydate-train/"/>
                <parameter key="dlr" value="/20news-bydate/20news-bydate-train/sci.crypt"/>
                <parameter key="dmk" value="/20news-bydate/20news-bydate-train/sci.electronics"/>
                <parameter key="earn" value="/20news-bydate/20news-bydate-train/sci.crypt"/>
                <parameter key="fuel" value="/20news-bydate/20news-bydate-train/sci.electronics"/>
                <parameter key="gas" value="/20news-bydate/20news-bydate-train/"/>
                <parameter key="gnp" value="/20news-bydate/20news-bydate-train/"/>
                <parameter key="gold" value="/20news-bydate/20news-bydate-train/soc.religion.christian"/>
                <parameter key="grain" value="/20news-bydate/20news-bydate-train/talk.politics.guns"/>
                <parameter key="heat" value="/20news-bydate/20news-bydate-train/talk.politics.mideast"/>
                <parameter key="housing" value="/20news-bydate/20news-bydate-train/talk.politics.misc"/>
                <parameter key="income" value="/20news-bydate/20news-bydate-train/talk.religion.misc"/>
              <parameter key="default_content_type" value=""/>
              <parameter key="default_content_encoding" value=""/>
              <parameter key="default_content_language" value=""/>
              <parameter key="prune_below" value="-1"/>
              <parameter key="prune_above" value="-1"/>
              <parameter key="vector_creation" value="TFIDF"/>
              <parameter key="use_content_attributes" value="false"/>
              <parameter key="use_given_word_list" value="false"/>
              <parameter key="return_word_list" value="false"/>
              <parameter key="output_word_list" value="/RapidMinerWordProject/Traningsdaten/wordvectorList.txt"/>
              <parameter key="id_attribute_type" value="number"/>
              <list key="namespaces">
              <parameter key="create_text_visualizer" value="false"/>
              <parameter key="on_the_fly_pruning" value="-1"/>
              <parameter key="extend_exampleset" value="false"/>
              <operator name="ToLowerCaseConverter" class="ToLowerCaseConverter">
              <operator name="StringTokenizer" class="StringTokenizer">
          <operator name="NearestNeighbors" class="NearestNeighbors">
              <parameter key="keep_example_set" value="false"/>
              <parameter key="k" value="1"/>
              <parameter key="weighted_vote" value="false"/>
              <parameter key="measure_types" value="MixedMeasures"/>
              <parameter key="mixed_measure" value="MixedEuclideanDistance"/>
              <parameter key="nominal_measure" value="NominalDistance"/>
              <parameter key="numerical_measure" value="EuclideanDistance"/>
              <parameter key="divergence" value="GeneralizedIDivergence"/>
              <parameter key="kernel_type" value="radial"/>
              <parameter key="kernel_gamma" value="1.0"/>
              <parameter key="kernel_sigma1" value="1.0"/>
              <parameter key="kernel_sigma2" value="0.0"/>
              <parameter key="kernel_sigma3" value="2.0"/>
              <parameter key="kernel_degree" value="3.0"/>
              <parameter key="kernel_shift" value="1.0"/>
              <parameter key="kernel_a" value="1.0"/>
              <parameter key="kernel_b" value="0.0"/>
          <operator name="ModelWriter" class="ModelWriter">
              <parameter key="model_file" value="/RapidMinerWordProject/NearestNeighbor.mod"/>
              <parameter key="overwrite_existing_file" value="true"/>
              <parameter key="output_type" value="Binary"/>

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Sven,
    the TextInput operator always creates a sparse example set if you don't switch on extend_exampleset. Then it would depend on the input example set.
    I have downloaded the data set and will try myself. But I think I already know whats the problem: Unlike the data set, KNN does not save the data in a sparse format. That causes the memory consumption to explode. Just think of a matrix of 45000x10000 entries a 4 bytes to get an impression of how many data would have to be stored. Nearest Neighbors isn't a good idea on text data at all, especially on so many examples and becomes completely worthless, if you don't switch the distance measure to cosine similarity.
    SVMs or NaiveBayes should cope with this amout of data  much better and will have a better performance anyway.

  • svendeswansvendeswan Member Posts: 8 Contributor II
    Dear Sebastian,

    thank you for your hints. Maybe I wait for your results :-) I am trying to build some kind of perfomance matrix for this dataset (and the Reuters too) using different learners and preprocessing. In my experience kNN worked well for big data sets in the past, but I never tried this with Rapidminer. Maybe you could paste the process then so I have the chance to build my matrix by simply exchanging the learner operators :-)

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Sven,
    it just finished loading the data. The results are somehow overwhelming: around 46.000 examples with 120.000 attributes. If stored in a standard, non sparse array, this would consume around 36 GB of RAM. The standard kNN will not work on this. Never. Even if it would save it in a sparse array, it would have to look through each of the trainings examples for classifying ONE new example and each time it would have to compute the distance over all this attributes...
    So you should simply replace the kNN in the operator tree of your process with the LibSVM or the NaiveBayes operator. This should work then...

Sign In or Register to comment.