Working with large datasets

MuehliManMuehliMan Member Posts: 85 Guru
edited November 2018 in Help
Dear all,

I am working with a really large dataset (with >2,6 million examples, ~25 attributes, 1 polynominal ID).
After renaming some attributes and generating a basic mathematical calulation with another attribute, I wanted to apply a model on the predict those large set with the model. Unfortunately, it always crashed havin exceed memory limit. Even when I split them in subsets of 1 million examples this happens.

So my questions:

- Is there a smarter way to store those data (short array or some other options)?
- Would it be better to convert the ID into interger values?
- Interestingly, the workflow crashes when using materialize data and/or free memory.

Could you give me some tips, working with larger datasets?



  • keithkeith Member Posts: 157 Guru
    First questions I think of when I see memory-related issues in RM are: What operating system are you running RapidMiner on?  Is it 32-bit or 64-bit?  If 64-bit, is the Java version you are using also 64-bit?  And is your version of RM up to date?
  • MuehliManMuehliMan Member Posts: 85 Guru
    I am running RM with the newest version on two types of systems:
    1) Windows XP 32 bit. Quad Core, 4GB (nominell) RAM
    2) Linux OpenSUSE, newest version (Java), 2GB RAM

    The Windows version is able to process more of the dataset file, than the Linux system, but it is also crashing. Interstingly, I just want to apply a model on the large dataset. So it "just" need to process each line.

    Another idea would have been to load the csv in slices (loop over them) to go through the file in smaller pieces. But there is no range criteria in the Read CSV operator.

  • keithkeith Member Posts: 157 Guru

    XP 32-bit can only address 3 GB of memory, so the 4th GB is wasted.  The operating system itself takes some memory overhead, plus memory for RapidMiner and any other apps you have running, before you're data is loaded.  You probably only have around 2 GB of effective memory.  You can check the system monitor in RM to see what the "Max" memory addressable is.  While running your process, if the system monitor shows you're topping out near the max memory usage, that may be your problem.  The extra cores on the processor won't help the memory problem.

    Linux OpenSUSE comes in both 32 and 64 bit flavors, so it's not clear which you have, though I'm guessing 32 bit.  2 GB of RAM is addressable by both 32 and 64 bit Linux, but with OS overhead, you're probably dealing with 1.0-1.5 GB effectively available, even less than on Windows.  Again, checking the system monitor inside RM should give you an idea when you're memory constrained.

    The polynomial ID is already been represented internally as an integer, I believe, so recoding it to be an int won't save much space.

    Materializing the data can blow up memory consumption, because it takes what was the base data with a virtual view of transformed data and instantiates it as real transformed data in memory.  This is especially true if your process keeps around the original example set and the transformed one.  Does the process get farther if you remove the Materialize Data/Free Memory operators?

    One idea to try, at least for for debugging purposes, is to save the example set to a file after the transform, then write a new process that reads in the already transformed data to apply the model.  That separates any extra memory being used for the ETL step from the memory consumption used by the model prediction itself.

    If this is typical of the data sizes you'll be working with, I'd highly recommend getting a 64-bit operating system (Windows or Linux) and having at least 4 GB of memory, preferably 8 GB or more.

Sign In or Register to comment.