Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

Outliers in a big dataset

MirteMirte Member Posts: 7 Learner III
edited December 2018 in Help

Hello, i'm a total newby with Rapidminer.

 

I have a big dataset with targets with numeric values and many (34) attributes.

I have to estimate the value of the target value and i will use a linear regression.

 

Now I want to detect outliers but RM freezes whenever I do this.

What is the  best way to tackle this? Do I need to downsize the dataset with the Sample operator?

Or should should i use the "Remove useless attributes" operator and maby also downsize the dataset?

Tagged:

Best Answer

  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    Solution Accepted

    Sampling is always a good way to start exploring a problem without running into long runtimes or out-of-memory issues.

    I would highly recommend it.  "Remove Useless Attributes" will only take out attributes that are constant or missing so it probably isn't going to reduce your overall dataset size very much.

    I would also explore some of weighting operators to understand which attributes are related to your target label.  Weight by correlation is a good starting point if you are thinking of using a linear regression.

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts

Answers

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hello @Mirte welcome to the community! I'd recommend posting your XML process here (see "Read Before Posting" on right when you reply) and attach your dataset. This way we can replicate what you're doing and help you better.

     

    Scott

     

     

  • MirteMirte Member Posts: 7 Learner III

    This is an example of how i am doing it. My goal is to predict the target with a linear regression.

    I am doing this the right way?

     

  • MirteMirte Member Posts: 7 Learner III

    I also have an adiditional question. If i want to estimate the the value of the target attribute with linear regression, and the are so many attributes, what is the best way to identify the relevant attributes that influence the target variable and how to remove the other ones to make the dataset smaller?

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hi...thanks for posting. So that detect outliers is exponentially going to take longer based on the number of rows that you are examining. Running your process with 1000 rows takes 4 seconds. Running with 2000 rows takes 18 seconds. Running with 3000 rows takes 76 seconds.  You get the idea.  It's a BigO thing.

     

    Scott

     

     

     

     

     

Sign In or Register to comment.