[SOLVED] Memory consumption in RapidMiner

MandarMandar Member Posts: 8 Contributor II
edited November 2018 in Help
Hello Everyone,

I have been using RapidMiner to build models having 500K records and each record having 1000 attributes. I use the Read Database operator and then perform variable reduction using the 'Remove Useless Attributes' operator. I have to perform additional operations to determine the importance of the attributes with respect to the target but I have been facing memory constraints.

I have observed that after each operation RM reserves more memory and then it ends up giving an error which asks me to allocate more space to the software. I have a powerful machine with 16GB RAM and I have allocated 13GB to RM but since it does not free the memory and I keep on facing the same issue. I have tried using the 'Free Memory' operator but it frees up memory of unused objects and it does not end up freeing memory in my case. Is there a way to tell RM explicitly to free memory after an operation has been completed. I would appreciate any help in this matter.

Thanks,
Mandar

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi Mandar,

    what do you mean exactly by operation? Does RapidMiner need more and more memory if you run the same process several times in a row? If that is the case, it is surely a bug, which we believe to have fixed for the next release.
    Or does the process need more and more memory if you add more operators to it?

    Best regards,
    Marius
  • MandarMandar Member Posts: 8 Contributor II
    Hi Marius,

    By operation I mean adding operators in the process. I am using "Remove Useless Attributes" operator after "Read Database" to reduce attributes and I apply "Weight by Chi square" to determine the relevance of the attributes with the target. I have been observing System Monitor that after each operator RM stacks up memory and then while executing "Weight by Chi Square" operator it occupies more than 11GB of memory and then it gives me an error message saying the process requires more memory.

    Is there a way where I can explicitly release copies of data I don't need using RM?
    Or is there any approach for variable reduction phase using large data sets(~2Gb in size) using RM other than the one I am using.

    Thanks!

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Maybe you can split your process into several smaller ones, where the first one determines the interesting attributes, and the second one only retrieves those attributes from the database. That will keep the memory footprint small right from the beginning.

    Best regards,
    Marius
  • MandarMandar Member Posts: 8 Contributor II
    Thank you Marius.
    This looks interesting. Can you suggest the operator which can determine the interesting attribute without bringing the entire database in the memory? Another issue is that if I close my current process and open a new one RM still blocks the memory from the previous process. Is there a way where I can specify RM to release memory explicitly?

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Concerning your first question, there is no such operator. But you could load parts of your data and perform the analysis on that data.

    You have two options. The easier and more common version would be to draw a random sample from the database, e.g. only 100000 rows, and run the attribute selection on that data. Usually using only parts of such a large data set is sufficient and common practice.

    If you want to use all rows, you have to process your attributes iterativley. Load e.g. the first 100 attributes, run one of the Weight by XXX operators (be sure to switch off relative/normalized weights), and record the weights. Repeat this for all attributes.
    Then when continuing the analysis, only load the x attributes with the highest weight.



    The second problem about RapidMiner not releasing the memory has two reasons: the first one is that RapidMiner released memory only when it needs to, e.g. because it loads new data. To force freeing all releasable memory you can use the Free Memory operator.
    But in the current release there is also a bug that prevents RapidMiner from freeing some of the memory, even if in theory it would be releasable. That bug has already been fixed and will be included in the next release.

    Best regards,
    Marius
  • MandarMandar Member Posts: 8 Contributor II
    Thank you Marius for your suggestions!  :)
    I have figured out a way around the memory issue and it works fine for me at the moment.

    I look forward to the next release of RapidMiner where the issue of memory release has been resolved.
  • seshadotcomseshadotcom Member Posts: 33 Contributor II
    Hello mandar,

    Could you please share with us what was the workaround you have done for the memory issue. I face similar issue at my process , I am trying to build association rules and I connect it to output port of FP Growth. The FP Growth itself does not have any problem. It is with my association rule generation it takes time and crashes. I indeed used free memory operator before the CreateAssociationRules.

    Regards
    Sesha
  • MandarMandar Member Posts: 8 Contributor II
    Hi Sesha,

    I observed that each data pre-processing operator in RapidMiner generates data copy for itself. This was the reason why it was accumulating lot of memory even though I had allocated 13GB RAM on my machine. I decided to remove such operators and instead use a weight by chi square operator to get a list of variables relevant to the target. I then chose the top 100 variables. This eliminated the data pre-processing step and I was able to run my process till completion. If I understand your problem I think CreateAssociationRules takes up memory because it has to go through the entire data set to find association among different variables. I would suggest the following options based on my experience:

    1. Try and minimize use of operators which generate copies of data.
    2. If you are not using such operators the only option is to reduce the sample size. This can be done by using Loop Batches operator or specifying SQL query while reading from database.
    3. Try using Remember and Recall (input to CreateAssociationRules) operator and include remove from store option in Recall operator.

    I too have included Free Memory operator but that isn't very helpful as it frees up memory of unused objects. This selection of unused objects is done by the software internally so there is not much it can do.

    Hope this helps.

    Regards
    Mandar
Sign In or Register to comment.