Heap Space / PAREN Plugin

skylar_suttonskylar_sutton Member Posts: 3 Contributor I
edited November 2018 in Help
I'm trying to use the PAREN plugin to analyze a data model that has about 5,000 records with 25 attributes. I'm running Vista x64 and the latest version of RapidMiner / PAREN.

I keep getting a Heap Space / Out Of Memory error when I try to auto-analyze it with the LibSVM model. I'm baffled as to why though, as I've edited the startup script to include MAX_JAVA_MEMORY=2048 (also tried MAX_JAVA_MEMORY=3072).

The data model file is only 1 MB on disk... how in the world is this ballooning into an OOM error? Any thoughts on how to optimize the memory config?

Thanks!

Answers

  • skylar_suttonskylar_sutton Member Posts: 3 Contributor I
    Bump. Any thoughts?
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,645  RM Founder
    Hi,

    well, disk size is usually far from size needed in memory. In addition to a normal "blow up" by loading the data into data structures, you will also get things like meta data, attribute objects, descriptions etc. which also add to the memory usage. And probably more important here: I don't know how the SVM is used within PAREN but it could be that standard preprocessing steps like a transformation from nominal to binominal and from there to numerical is applied which can literarilly let your data set explode. Don't get me wrong, this preprocessing is in general a good idea, but not for every data set - especially not for those containing nominal attributes with many different values. 

    And last but not least: depending on the SVM parameter settings used within PAREN, it could be that a complete kernel matrix would be pre-calculated which would result in a quadratic number of matrix entries, in your case 5000 * 5000 entries. In addition to the extended data set, this might be too much.

    In short: 5000 examples are certainly not too much but depending on the properties of the data it might be too much for a fully automatic approach like the PAREN way. Maybe a sample or a user-defined preprocessing (like getting rid of nominal attributes with too many values) might help here in order to "support" PAREN in its work...

    Cheers,
    Ingo
  • skylar_suttonskylar_sutton Member Posts: 3 Contributor I
    Thanks for the reply.

    I understand that disk space != heap space, but all 25 attributes are real's so each one should only take up 8 bytes (assuming a real maps directly to a Java double), making each row a 200 byte object. As a former developer, maybe I'm making too many assumptions about the object model used to store a row?

    Thanks for the preprocessing suggestions. I don't have any nominal attributes, they're all reals resulting from scientific measurements - any thoughts on how to preprocess that? (Sorry for the rookie questions, I'm new to this.) If it helps you understand my model, each example represents a snapshot of scientific instrument readings at a given moment in time (g-forces, magnometer readings, etc.) and contains a nominal label that acts as a boolean true/false to tell me if everything was operational at the moment those readings were taken. I'm trying to build a formula that I can apply to new data which weeds out some of the non-events and alerts me to records that need some manual investigation.
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,645  RM Founder
    Hi,

    ...only take up 8 bytes (assuming a real maps directly to a Java double), making each row a 200 byte object.
    yeah, in a perfect world maybe  ;)
    Just kidding, but indeed there is much more overhead introduced by Java's internal data handling and probably even more introduced by us for supporting fast calculations, data access, meta data propagation, statistics on the fly and more. As you surely know it's always a trade-off between performance, memory usage, and maintanability...

    I don't have any nominal attributes, they're all reals resulting from scientific measurements - any thoughts on how to preprocess that?
    Well, that should be the "good" case already where hardly any preprocessing should be necessary at all. Are you sure that those numbers are indeed handled as numbers by RapidMiner? I mean, is the meta data showing that those attributes are numererical (real, integer)? If the data accidentally was stored as nominal, the PAREN extension will transform it and memory usage will blow up.

    Can you learn an SVM model without PAREN?

    Cheers,
    Ingo
Sign In or Register to comment.