Outlier detection taking lot of time and not giving any results

shubham_samant · May 2019

I am using Automodel feature in Rapidminer educational license version. My processor is 6 core, i7-8750h with 16 GB ram.The dataset have more than 30000 rows and 12 columns . It have numeric and text data both. I tried running using the best features(as detected by Auto model) and am using the distance based outlier method. The model ran for more than 18 hours but still processing.
I have installed the outlier detection extension too still it didn't help.
How can I solve this issue? Is it because the educational version uses only 1 core?
Kindly help

hughesfleming68 · May 2019

There is no easy solution except to experiment with a smaller dataset. Before jumping in with 30000 rows try 10% of your data. At least you will be able to evaluate your results and get a sense of your computing time. This is not an uncommon situation but if you need to solve this size of problem regularly then you will have to rethink how you do it. Not all processes scale well just by adding cores. This is a major misconception.

jacobcybulski · May 2019

I found that the methods included in the anomaly detection extension are more efficient than those built into the RM itself. Also the extension operators create a model of outlier detection which allows you to do the following: take a representative sample of your data, build an outlier detection model and then apply the model to the remaining data to identify all other outliers. It is the model building that takes so much time, application of the model is very fast. You can use the same approach when outlier detection created in training a predictive model needs to be deployed and thus you'd use exactly the same way of identifying and eliminating anomalous examples in new data as in your training data set .

shubham_samant · May 2019

Our data set have more than 100000 records , I reduced the sample size to 30000 if I further reduce the data set to say like 3000 then the sample representation is too small for model training.

I have tried running the full data set in Python applying 2-3 different algorithms and its giving me the results successfully. When I run outlier detection models on python it do not give out of memory issue but in Rapid Miner with relatively smaller data set too it goes out of memory. Why is Automodel –Outlier Detection failing on relatively mid size data sets?

jacobcybulski · May 2019

I have just created a similar kind of a model using 130,000 wine reviews from Kaggle. See this process (also attached as RMP):

Note that it is all in the preparation of data, e.g. make sure you have no missing values, numeric values are preferred and I have normalized my attributes. I have split all data into two parts, i.e. first is a smaller data set of 40K examples used for creating an anomaly model and the second part is a larger data set of the remaining 90K examples, where the anomalies I found by using the pre-trained anomaly model. To visualize anomalies, I've created a PCA model, marked examples with high anomaly score as outliers and then plotted both, see below (40K first and 90K second):

Image: https://us.v-cdn.net/6030995/uploads/editor/cu/e9jau68mmgcd.png

Image: https://us.v-cdn.net/6030995/uploads/editor/as/rtdzmooowtvt.png

It all took less than 5 minutes on my old laptop with 16Gb of RAM.

If it does not work for you, try upgrading your RM to version 9.2+, as it fixes some bugs which you may still have in your older version of RM?

Good luck -- Jacob

P.S. This process can be used for finding anomalies in very large data sets (you'd have to join the two steams of processing) or for deploying an anomaly model for new data (the bottom part can be placed in a separate process but you'd have to save your pre-processing models).

P.P.S. I have included the improved version of the process as V2, unfortunately I could not figure oout how to delete an older attachment...

sgenzer · May 2019

@jacobcybulski I like this...can I add it to my list of processes to add to the Community Repo?

jacobcybulski · May 2019

@sgenzer I have updated the example to make it more complete to show that the anomaly model can be trained and used either for large data sets of for deployment. Please feel free to add this example. And yes, it is lightning fast

I could not figure out how to delete an older attachment though

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Outlier detection taking lot of time and not giving any results

Answers