Very Large dataset filesize despite only 2000 examples

Ray_CRay_C Member Posts: 6 Learner I
edited November 2020 in Help
Hi, I am a newbie so apologies in advance if I'm missing something obvious.

I am working on a binary classifier for use with a large synthetic dataset for credit card fraud which I have split and sampled into a training and testing dataset, both with balanced classes, 1000 of each. However, there seems to be something up somewhere along the line. The full dataset with 6.3M examples occupies 538MB. However, my training and test datasets are taking up 95.3MB and should only be a tiny fraction of this size. They also behave like 100MB files, taking ages up to open up etc. Training dataset caused AM to crash. Can somebody tell me where I am going wrong please? TIA Ray.
 

Best Answer

Answers

  • Ray_CRay_C Member Posts: 6 Learner I
    Process xml attached if that helps
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Hi,
    can you use a "Remove Unused Values" operator right before the store? That could do it.

    Best,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Ray_CRay_C Member Posts: 6 Learner I
    Martin, many thanks for suggestion.

    I have added that operator before the stores but it appears just to have moved the bottleneck from AutoModel's (non-handling) of the pseudo 100MB dataset, back to the data prep process itself, where, as I type and for the past 5 minutes, one of the "Remove Unused Values" remains in progress, and may not in fact complete I suspect. 

    I think maybe I could do with carrying out some research on the handling of very large datasets.

    Just in: RM has crashed with an OOM exception. I've got a Core i7 16GB RAM so I need to change the methodology for sure.
  • kaymankayman Member Posts: 662 Unicorn
    What if you move the unused Value operator more upfront? I typically tend to use it when I have a filter or like in your case split operator. 
    Try with adding one on both split outputs, as the 'hidden' information will travel through this otherwise. Also try to tick the 'include special attributes' option, given you use a role the remove option might have limited impact if these are like unique identifiers as all your entries will be special.
  • Ray_CRay_C Member Posts: 6 Learner I
    Thanks Kayman, I have moved the operator as suggested. Process now completes (albeit laboriously) but resulting Test and Training datasets still far too large @ 31.2MB and 68.1MB respectively (whose sizes mirror the 70:30 ratio in the Split data operator incidentally.)

    I am not sure what's going on TBH, specifically what the Remove Unused Value operator is doing or supposed to do - this is a synthetic dataset with no missing values etc, what would it be removing after the split? 

    Also, I do find it unusual that there is no out-of-box answer for this issue (and no disrespect intended here). I mean I am assuming that many, many people will have worked on this dataset before (kaggle.com/ntnu-testimon/paysim1), and many will have split the data into Test and Training using the Split Data operator within RM I am sure.

    Yet I can't seem to find any references online to anybody else experiencing this kind of issue. I mean I am not trying to achieve anything that could be described as complex, I am barely off first base, with the only operation being the assignation of a label which is required in order to obtain a balanced  dataset? I just don't understand why the split datasets do not appear to be amenable to the sampling process?

    I keep asking myself is there something fundamentally wrong with my approach but the responses to date (much appreciated) do not suggest that there is? 





  • Ray_CRay_C Member Posts: 6 Learner I
    Hi Martin, thanks for taking the time to set me right. So you were dead right: There were two features of polynominal type with millions of unique values (rendering them useless as predictors for binary classification anyway) in there. So I stripped out these 2 from the dataset as a first step, the succeeding processes completed quickly, resulting in training and test dataset sizes of 117KB each. Thanks again to yourself and Kayman.   
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    Another operator you may be interested to check out is the Replace Rare Values operator.
    This is helpful when you have nominal attributes with many different unique values, some of which might occur frequently enough to be useful, but most of which occur infrequently and are thus not useful.  It would allow you to keep the largest ones and remap all the other values into a generic "Other" category much more easily than the normal Map operator (which would require you to list them all out individually).

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Ray_CRay_C Member Posts: 6 Learner I
    @Telcontar120, belated thanks for the tip - yes that sounds like a very useful operator indeed. Cheers.

Sign In or Register to comment.