Options

Trimming a vectorset

kaymankayman Member Posts: 662 Unicorn
edited November 2018 in Help

I have a vectorset containing a few thousand attributes, and want to work with amples for testing purposes. Easy enough to get the sampleset, but I also get the full attributeset while I am only interested in the ones that are applicable for the sample. What is the easiest way to remove all irrelevant attributes?

 

example :

 

value1 - true - true -false - true

value2 - true - false -false -false

value3 - false - true -false - true

 

In this oversimplied example it would mean I like to skip the 3d attribute, as it is always false. What is the fastes way to achieve this (on a much bigger set)? 

Answers

  • Options
    JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn

    The Remove Useless Attributes would be my first thought from your example.  (I'm sure you have more complex examples, but certainly it will pick up all the attributes like your example below).

     

    See attached process to import into RM. 

  • Options
    kaymankayman Member Posts: 662 Unicorn

    I actually tried that one, but it turned out that my repositories were bigger after implementing this than before, also the amount of attributes looked the same. I have to admit I did not really check into detail, so maybe I'll try myself with a smaller testset.

  • Options
    Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    The "remove useless attributes" operator should do exactly that--actually remove attributes that are all the same or that have deviation below the minimum that you set.  You can look at the summary statistics page to see whether attributes are really all the same or whether there are a small number of cases with different values that may be preventing them from being removed.

    A somewhat more sophisticated approach would be to run a principal components analysis (using the PCA operator), which analyzes all specified attributes and reduces them to a new smaller set that provide variances above a certain user-defined threshold.  This is somewhat more useful because if you have variables that are not strictly useless (meaning all the same values) but are highly correlated with each other or otherwise add minimal variance to the full dataset, then it will reduce the dataset down to a smaller set of synthetic attributes that capture almost all the variance of the full set.  You can read more about PCA here: https://en.wikipedia.org/wiki/Principal_component_analysis

     

     

     

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.