🦉🦉   WOOT WOOT!   RAPIDMINER WISDOM 2020 EARLY BIRD REGISTRATION ENDS FRIDAY DEC 13!   REGISTER NOW!   🦉🦉

HBOS memory issue in rapidminer studio 9.3.001

MaartenKMaartenK Member Posts: 11 Contributor I
Hi all,

There was a previous thread about this issue, but that did not solve my problem.
I have a dataset with 13 features. I use HBOS from the analmaly extension version 2.4.001.
If i sample my dataset down to 100 items and then apply HBOS, studio will still run out of memory. It uses up to the max of 30Gb and then stops with an error after several minutes. It seems studio spends more time on garbage collection than on the actual algorithm. 
Any helpful suggestions are welcome.

Kind regards,

Maarten

Answers

  • MaartenKMaartenK Member Posts: 11 Contributor I
    I did some more experimenting and in believe this to be a bug in the anomaly extention HBOS component. If i add the HBOS to the process freshly it will run once. After applying any changes to the model the above behaviour occurs. 
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,203  RM Data Scientist
    Hi @MaartenK,
    i think there was a known issue with Date-Time attributes. Can you please check if your data set contains dates?
    Best,
    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
    yyhuangTghadially
  • MaartenKMaartenK Member Posts: 11 Contributor I
    Hi mschmitz  The dataset did contain date fields. However i replaced them with numerical fields using DateToNumerical. The dataset now contains 1 label (polynominal), 3 integers, 2 polinominal and 5 reals. It contains 100 items.

    It seems something in the dataset is triggering a problem. Also if i place a select attributes component before HBOS and select 1 attribute, the HBOS component will still show all 10 attributes when using the 'single' option. Also if i remove attributes using the selector in HBOS and apply changes, it will once again user all 10 attributes. 
    Tghadiallyyyhuang
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,203  RM Data Scientist
    Hi,
    can you please try to add a 'materialize' operator right infront of HBOS? That may work.

    Best,
    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • MaartenKMaartenK Member Posts: 11 Contributor I
    Thanks for the swift response. This did not solve my issue. Meanwhile I asked permission to share the dataset with you. It is an educational dataset. When permission is granted I can share the model and dataset with you for reproduction. 
    yyhuangmschmitzsgenzerTghadially
  • MaartenKMaartenK Member Posts: 11 Contributor I
    I tried to do some more steps to reproduce the issue. It seems the problem may be triggered by missing values in the dataset.  Pls find attached  2 models and 2 datasets. The sample with 100 items containing missing values triggers the memory issue. The sample with 100 items containing no missing values is processed in a split second.
    Tghadially
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 911   Unicorn
    Hi @MaartenK,

    Sorry I 'm just coming to confirm that the bug is due to the missing values and that the only thing to do is what you have done : impute missing values.

    In deed, after reflexion and introspection, I think that a such algorithm (outlier detection) can not natively handle missing values, 
    so the best strategy here (and in general too) is to impute the missing values, what you have done...
    Of course the tricky part is to find the best algorithm or method (mean, median etc.) to impute the missing values...

    It's the result of my humble reflexion on this subject but I'm not an expert in outlier/anomaly detection and I will be happy if someone 
    can add some thoughts and/or correct me if I'm wrong.

    Regards,

    Lionel


    varunm1
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,203  RM Data Scientist
    This is a clear bug. The extension is not from RM, but open soruce. i've tried to get it to gradle for.. 30min and didn't make it. So it's tough for us to add the check for missings.
    Any Java guru here to help? Maybe @rfuentealba ?
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
    Tghadially
  • rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 442   Unicorn
    Hello,

    An educated guess is that like it’s not missing the condition for handling null values but that the null condition does not close where it should.

    Just to make sure, is the code on GitHub? I can take a look later today.

    All the best,

    Rod.
    Tghadially
  • MaartenKMaartenK Member Posts: 11 Contributor I
    Thanks for the support. Sourcecode seems to be here. https://github.com/Markus-Go/rapidminer-anomalydetection
    build.xml indeed mentiones 2.4.001 as version. 
    Of course i cannot be sure that the source where the current extention was built from.

    Tghadially
  • MaartenKMaartenK Member Posts: 11 Contributor I
    In the meantime i mailed with Markus Goldstein. He let me know that he currently has a student working on new operations and will have him take a look into the HBOS preconditions afterwards.
  • sgenzersgenzer 12Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,656  Community Manager
    @MaartenK if you could please connect me with Markus I would appreciate it!
    Tghadially
Sign In or Register to comment.