Which outlier detection should I use?

LilC · July 2020

There are about 5000 rows of data.
Some of the stores' sales have been entered wrong or some other reasons are outliers. But I don't know how many.

So before I run any other analysis (such as correlation), I need to run an outlier detection.
For other software or tools, I normally use box/whisker plot to find the acceptable range of the value and then use filter.
Since RM has outlier operators, I want to use one of them. But I am not sure which one should I use in this case.

And it feels like running process with outlier detection takes very long time, is that right? Or I did something wrong...

jacobcybulski · July 2020

There are many outlier / anomaly detection tools built in RM. The box / whisker methods is available in RM and it is called Tukey Test (in Operator Toolbox extension). However, Tukey test is really applicable to a single attribute rather than an example as a whole. An extension of this approach is Histogram Outlier Score (in Anomaly Detection extension), which tests a distribution of each attribute to determine if the attribute value can be considered an outlier across the attribute range, subsequently an outlier score is calculate to combine the scores of all attributes. In fact the Anomaly Detection extension has many useful operators, which are more efficient than those built in and they also produce pre-processing models, which can be saved or applied to new data. The anomaly detection approaches fall into two main categories, i.e. global and local. The global anomaly is an example far away from the "centre" of all examples, whereas the local anomaly detection identified example groups and determines examples which are outside those groups but could be very well hiding in-between them (often right in the centre of all examples). My favorite is a k-NN Global Anomaly Score, which is lightning fast and which as the name suggests identifies examples which are furthest away from their k neighbours. There are a number k-NN based outlier detectors. The two very useful local outlier detection methods are a clustering-based and density-based approaches, which as the name suggests determine outliers in relationship to clusters (out of cluster examples) and those examples in the sparsely populated areas of example space. Finally, you should also consider a custom approach, where you look not for "absolute" outliers within the your data set, but rather outliers against the model you may be building, a classic way is to look for examples which generate large regression residuals.

jacobcybulski · July 2020

One more thing, if you find that outlier detection takes too long, I suggest to do the following. Split your large data set into two parts, i.e. a smaller one (e.g. 1000 examples) to be used for training anomaly detection, as well as their identification in the process, and a larger one (e.g. 4000 examples) to which you can apply the pre-trained anomaly detection model to find the rest of anomalies. As I mentioned before, for this you will need an outlier / anomaly detection which creates an anomaly model, e.g. k-NN Global Anomaly Score, which has optional output with the anomaly model, and optional input taking a pre-trained anomaly model. If you decide to go this way, you must ensure that all your pre-processing leading to the training of the anomaly model is replicated exactly in the larger data set as well - and here I mean the pre-processing models produced by operators such as Normalization or Nominal to Numerical must be captured (possibly saved) and applied to pre-processing of your larger data set as well.

LilC · July 2020

Thanks, Jacob!! That was super helpful. Tukey Test is very easy to use and understand.

Srilatha · July 2021

Im using Clustering Based Multi Variate Outlier Detection model. When I apply the cluster Model on test data. it gives me Cluster label for each instance that it belongs based on trained model. But I would need outlier score to determine it is outlier or not. Please suggest me

jacobcybulski · July 2021

It is an anomaly / outlier operator which gives the anomaly score and not the clustering operator. You'd have to cluster the test set with the model constructed using the training data and then apply the anomaly model (again derived from the training data) to the clustered test set. Having said this, personally I had not much luck applying a cluster-based anomaly detection to a new data set (there seems to be a problem with the operator which insists that a clustered data set is the same that created the model). So if you are seeking a similar behaviour, I suggest using local density-based anomaly detection, which will find anomalies in between dense groups of examples.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Which outlier detection should I use?

Best Answers

Answers