Help interpreting outliers/anomalies when using Isolation Forest operator
Hi. I'm really liking the Isolation Forest operator under the Anomaly Detection Extension. Trees =100, Leaf Size =2, and average path as the score calculation gives me a result where the first 5 outliers match exactly with an R script using the Mahalanobis Distance function. That is great for comparisons. But is there a calculation or rule of thumb that you suggest for the Trees parameter? Or for cutoff score? Using my R script comparison I can easily match the 5 lowest scores. Score wise, is there a point or a calculation where outliers/anomalies end and the rest are not outliers? Thanks for any help.
MartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,345 RM Data ScientistHi,great to hear that we produce the same output as R. I am the author of it and I only compared to sklearn.I think generally no real way to find the right parameters or cutoff for the anomaly_score. If you have a list of anomalies you may be able to calculate recall and precision on that set. But thats rather rare.For trees: I would suspect that more is better but at some point the score should converge and more trees only cause more computation time.Best,Martin- Head of Data Science Services at RapidMiner -