"FP-Growth fails with 4GB of memory"

RMSchwartzRMSchwartz Member Posts: 2 Contributor I
edited May 2019 in Help

I can't get FP-growth to complete. I have allocated 4GB of memory to MAX_JAVA_MEMORY and that amount shows up in the system monitor within RapidMiner. I've put a small sample in the chain so that it has only about 150 cases to deal with. Nonetheless, it fails to execute to the end, exhausting 4GB of memory.

I'd welcome some assistance.

Thanks,

Bob Schwartz



<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.006">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.006" expanded="true" name="Process">
    <description>Reads collections of text from a set of directories, assigning each directory to a class (as specified by parameter text_directories), and transforms them into a TF-IDF or other word vector. Finally, an SVM is applied to model the input texts.</description>
    <process expanded="true" height="476" width="547">
      <operator activated="true" class="retrieve" compatibility="5.1.006" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
        <parameter key="repository_entry" value="//NewLocalRepository/BPP Fishing/July25a"/>
      </operator>
      <operator activated="true" class="text:wordlist_to_data" compatibility="5.1.001" expanded="true" height="76" name="WordList to Data" width="90" x="112" y="120"/>
      <operator activated="true" class="text:process_document_from_data" compatibility="5.1.001" expanded="true" height="76" name="Process Documents from Data" width="90" x="313" y="75">
        <parameter key="vector_creation" value="Binary Term Occurrences"/>
        <parameter key="add_meta_information" value="false"/>
        <parameter key="keep_text" value="true"/>
        <parameter key="prune_method" value="absolute"/>
        <parameter key="prune_below_absolute" value="3"/>
        <parameter key="prune_above_absolute" value="99"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <connect from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="numerical_to_binominal" compatibility="5.1.006" expanded="true" height="76" name="Numerical to Binominal" width="90" x="45" y="345">
        <parameter key="min" value="0.05"/>
        <parameter key="max" value="5.0"/>
      </operator>
      <operator activated="true" class="sample" compatibility="5.1.006" expanded="true" height="76" name="Sample" width="90" x="179" y="345">
        <parameter key="sample" value="probability"/>
        <list key="sample_size_per_class"/>
        <list key="sample_ratio_per_class"/>
        <list key="sample_probability_per_class"/>
      </operator>
      <operator activated="true" class="fp_growth" compatibility="5.1.006" expanded="true" height="76" name="FP-Growth" width="90" x="246" y="255">
        <parameter key="min_number_of_itemsets" value="10"/>
        <parameter key="min_support" value="0.1"/>
      </operator>
      <operator activated="true" class="create_association_rules" compatibility="5.1.006" expanded="true" height="76" name="Create Association Rules" width="90" x="380" y="210"/>
      <connect from_op="Retrieve" from_port="output" to_op="WordList to Data" to_port="word list"/>
      <connect from_op="WordList to Data" from_port="word list" to_op="Process Documents from Data" to_port="word list"/>
      <connect from_op="WordList to Data" from_port="example set" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Numerical to Binominal" to_port="example set input"/>
      <connect from_op="Numerical to Binominal" from_port="example set output" to_op="Sample" to_port="example set input"/>
      <connect from_op="Sample" from_port="example set output" to_op="FP-Growth" to_port="example set"/>
      <connect from_op="FP-Growth" from_port="frequent sets" to_op="Create Association Rules" to_port="item sets"/>
      <connect from_op="Create Association Rules" from_port="item sets" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>
Tagged:

Answers

  • dan_agapedan_agape Member Posts: 106 Maven
    Hi,

    Your sampled dataset has not many rows indeed, but very likely it has very many attributes... as obtained from text preprocessing. The data source was not available, but most likely your problem is due to the following:

    In the Numerical to Binominal operator you set min=0.05 and max=5. Why? You should have set min=0 max=0. 

    With your own setting of min and max, when the above operator is executed, for each document, its relevant words (seen as attributes in the dataset here) are assigned the value false in its word vector, and all the words not in the document - imagine how many!! are assigned the value true in the same word vector. Doing so you gave a lot of work to do to the FP-Growth operator that will have to make a lot of combinations of words that were assigned true in order to obtain the frequent itemsets, so the 4GB and even more would not be enough, by far.

    min=0 and max=0 will make all the words not in the document to be assigned false, and all the words in the document to be assigned true, and you may have a chance to get your results, assuming you do some more preprocessing as filtering stopwords, which again increase exponentially the number of combinations when computing the frequent itemsets, since they may be many enough in each document and can repeat themselves across most of if not all the documents ...

    Dan
  • haddockhaddock Member Posts: 849 Maven
    Hi Bob,

    It may be that your process will finish if you disable the Association Rule operator, the reasons for this are set out here...

    http://rapid-i.com/rapidforum/index.php/topic,3619.msg13530.html#msg13530

    Just a thought, good luck!
  • RMSchwartzRMSchwartz Member Posts: 2 Contributor I
    Many thanks, Haddock and Dan. I'll reset Numerical to Binomial and dig into the link.

    Best,

    Bob
  • Kajan81Kajan81 Member Posts: 1 Contributor I
    Hi,

    Have a dataset with 3 columns (Transaction ID, Product Description, Value) and appox 1 million rows.

    I am trying to apply FP-Growth and Create Association but this keeps failing due to memory at the "Numerical to Binomial" stage of my process . I have allocated 56GB of RAM.

    "This process would need more than the maximum amount of available memory. You can either leave......"

    Am I doing something wrong here? I would have thought 56GB of RAM would be more than enough to cope with this.

    Any help will be much appreciated

    Thanks.
  • David_ADavid_A Administrator, Moderator, Employee, RMResearcher, Member Posts: 297 RM Research
    It sounds that you are using an older version of RapidMiner. With version 6.5 the license model of RapidMiner changed and it no longer has any memory constraints. The process below runs on my machine with 10gb RAM allocated in under 3 seconds:
    <process version="7.0.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="7.0.001"    expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="generate_data"    compatibility="7.0.001" expanded="true" height="68" name="Generate Data" width="90" x="112" y="34">
            <parameter key="number_examples" value="1000000"/>
            <parameter key="number_of_attributes" value="2"/>
          </operator>
          <operator activated="true" class="numerical_to_binominal" compatibility="7.0.001" expanded="true" height="82" name="Numerical to Binominal" width="90" x="313" y="34">
            <parameter key="min" value="-10.0"/>
          </operator>
          <connect from_op="Generate Data" from_port="output" to_op="Numerical to Binominal" to_port="example set input"/>
          <connect from_op="Numerical to Binominal" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
Sign In or Register to comment.