🦉 🎤   RapidMiner Wisdom 2020 - CALL FOR SPEAKERS   🦉 🎤

We are inviting all community members to submit proposals to speak at Wisdom 2020 in Boston.


Whether it's a cool RapidMiner trick or a use case implementation, we want to see what you have.
Form link is below and deadline for submissions is November 15. See you in Boston!

CLICK HERE TO GO TO ENTRY FORM

Outlier detection taking lot of time and not giving any results

shubham_samantshubham_samant Member Posts: 2 Newbie
edited June 27 in Help
I am using Automodel feature in Rapidminer educational license version. My processor is 6 core, i7-8750h with 16 GB ram.The dataset have more than 30000 rows and 12 columns . It have numeric and text data both. I tried running using the best features(as detected by Auto model) and am using the distance based outlier method. The model ran for more than 18 hours but still processing. 
I have installed the outlier detection extension too still it didn't help.
How can I solve this issue?  Is it because the educational version uses only 1 core? 
Kindly help

Answers

  • hughesfleming68hughesfleming68 Member Posts: 249   Unicorn
    edited May 28
    There is no easy solution except to experiment with a smaller dataset. Before jumping in with 30000 rows try 10% of your data. At least you will be able to evaluate your results and get a sense of your computing time. This is not an uncommon situation but if you need to solve this size of problem regularly then you will have to rethink how you do it. Not all processes scale well just by adding cores. This is a major misconception.
  • jacobcybulskijacobcybulski Member, University Professor Posts: 83   Unicorn
    edited May 29
    I found that the methods included in the anomaly detection extension are more efficient than those built into the RM itself. Also the extension operators create a model of outlier detection which allows you to do the following: take a representative sample of your data, build an outlier detection model and then apply the model to the remaining data to identify all other outliers. It is the model building that takes so much time, application of the model is very fast. You can use the same approach when outlier detection created in training a predictive model needs to be deployed and thus you'd use exactly the same way of identifying and eliminating anomalous examples in new data as in your training data set .

    hughesfleming68
  • shubham_samantshubham_samant Member Posts: 2 Newbie

    Our data set have more than 100000 records , I reduced the sample size to 30000 if I further reduce the data set to say like 3000 then the sample representation is too small for model training.

    I have tried running the full data set in Python applying 2-3 different algorithms and its giving me the results successfully. When I run outlier detection models on python it do not give out of memory issue but in Rapid Miner with relatively smaller data set too it goes out of memory. Why is Automodel –Outlier Detection failing on relatively mid size data sets?


  • jacobcybulskijacobcybulski Member, University Professor Posts: 83   Unicorn
    edited May 30
    I have just created a similar kind of a model using 130,000 wine reviews from Kaggle. See this process (also attached as RMP):


    Note that it is all in the preparation of data, e.g. make sure you have no missing values, numeric values are preferred and I have normalized my attributes. I have split all data into two parts, i.e. first is a smaller data set of 40K examples used for creating an anomaly model and the second part is a larger data set of the remaining 90K examples, where the anomalies I found by using the pre-trained anomaly model. To visualize anomalies, I've created a PCA model, marked examples with high anomaly score as outliers and then plotted both, see below (40K first and 90K second):
     
     
    It all took less than 5 minutes on my old laptop with 16Gb of RAM.
    If it does not work for you, try upgrading your RM to version 9.2+, as it fixes some bugs which you may still have in your older version of RM?
    Good luck -- Jacob
    P.S. This process can be used for finding anomalies in very large data sets (you'd have to join the two steams of processing) or for deploying an anomaly model for new data (the bottom part can be placed in a separate process but you'd have to save your pre-processing models).
    P.P.S. I have included the improved version of the process as V2, unfortunately I could not figure oout how to delete an older attachment...


    sgenzerDavid_A
  • sgenzersgenzer 12Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,518  Community Manager
    @jacobcybulski I like this...can I add it to my list of processes to add to the Community Repo?
    ----------------------
    Don't forget to submit your great ideas for Wisdom 2020! Deadline is November 15.

    Wisdom 2020 – Call for Speakers Form 

  • jacobcybulskijacobcybulski Member, University Professor Posts: 83   Unicorn
    edited May 29
    @sgenzer I have updated the example to make it more complete to show that the anomaly model can be trained and used either for large data sets of for deployment. Please feel free to add this example. And yes, it is lightning fast :) I could not figure out how to delete an older attachment though :(

    varunm1sgenzerIngoRM
Sign In or Register to comment.