Options

Outliers in a big dataset

MirteMirte Member Posts: 7 Contributor I
edited December 2018 in Help

Hello, i'm a total newby with Rapidminer.

 

I have a big dataset with targets with numeric values and many (34) attributes.

I have to estimate the value of the target value and i will use a linear regression.

 

Now I want to detect outliers but RM freezes whenever I do this.

What is the  best way to tackle this? Do I need to downsize the dataset with the Sample operator?

Or should should i use the "Remove useless attributes" operator and maby also downsize the dataset?

Tagged:

Best Answer

  • Options
    Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    Solution Accepted

    Sampling is always a good way to start exploring a problem without running into long runtimes or out-of-memory issues.

    I would highly recommend it.  "Remove Useless Attributes" will only take out attributes that are constant or missing so it probably isn't going to reduce your overall dataset size very much.

    I would also explore some of weighting operators to understand which attributes are related to your target label.  Weight by correlation is a good starting point if you are thinking of using a linear regression.

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts

Answers

  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hello @Mirte welcome to the community! I'd recommend posting your XML process here (see "Read Before Posting" on right when you reply) and attach your dataset. This way we can replicate what you're doing and help you better.

     

    Scott

     

     

  • Options
    MirteMirte Member Posts: 7 Contributor I

    This is an example of how i am doing it. My goal is to predict the target with a linear regression.

    I am doing this the right way?

     

  • Options
    MirteMirte Member Posts: 7 Contributor I

    I also have an adiditional question. If i want to estimate the the value of the target attribute with linear regression, and the are so many attributes, what is the best way to identify the relevant attributes that influence the target variable and how to remove the other ones to make the dataset smaller?

  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hi...thanks for posting. So that detect outliers is exponentially going to take longer based on the number of rows that you are examining. Running your process with 1000 rows takes 4 seconds. Running with 2000 rows takes 18 seconds. Running with 3000 rows takes 76 seconds.  You get the idea.  It's a BigO thing.

     

    Scott

     

     

     

     

     

Sign In or Register to comment.