The Altair Community is migrating to a new platform to provide a better experience for you. The RapidMiner Community will merge with the Altair Community at the same time. In preparation for the migration, both communities are on read-only mode from July 15th - July 24th, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here.

Help interpreting outliers/anomalies when using Isolation Forest operator

kdafoekdafoe Member Posts: 20 Maven
edited January 2022 in Help
Hi. I'm really liking the Isolation Forest operator under the Anomaly Detection Extension. Trees =100, Leaf Size =2, and average path as the score calculation gives me a result where the first 5 outliers match exactly with an R script using the Mahalanobis Distance function. That is great for comparisons. But is there a calculation or rule of thumb that you suggest for the Trees parameter? Or for cutoff score? Using my R script comparison I can easily match the 5 lowest scores. Score wise, is there a point or a calculation where outliers/anomalies end and the rest are not outliers? Thanks for any help.

Best Answer

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,525 RM Data Scientist
    Solution Accepted
    great to hear that we produce the same output as R. I am the author of it and I only compared to sklearn.

    I think generally no real way to find the right parameters or cutoff for the anomaly_score. If you have a list of anomalies you may be able to calculate recall and precision on that set. But thats rather rare.

    For trees: I would suspect that more is better but at some point the score should converge and more trees only cause more computation time.

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany


  • Options
    kdafoekdafoe Member Posts: 20 Maven
    Thanks Martin. With my little sampling and testing I've found that tree count is less important than leaf size. Variations between 100 to 10,000 trees did little to change my results of top anomalies, but changing from a leaf size of 1 to 2, continually narrowed the top (meaning those with lowest scores) to a match with my R script. Leaf size in a decision tree is easy enough to understand, and you can see the result of playing with it in the visualization, but I don't understand what leaf size does in an Isolation Forest when the goal is finding anomalies and not distinctions (or impurities) in the decision making process. Can you shed some light on this? Thanks again.
Sign In or Register to comment.