RapidMiner

Trimming a vectorset

Elite

Trimming a vectorset

I have a vectorset containing a few thousand attributes, and want to work with amples for testing purposes. Easy enough to get the sampleset, but I also get the full attributeset while I am only interested in the ones that are applicable for the sample. What is the easiest way to remove all irrelevant attributes?

 

example :

 

value1 - true - true -false - true

value2 - true - false -false -false

value3 - false - true -false - true

 

In this oversimplied example it would mean I like to skip the 3d attribute, as it is always false. What is the fastes way to achieve this (on a much bigger set)? 

3 REPLIES
Elite III

Re: Trimming a vectorset

The Remove Useless Attributes would be my first thought from your example.  (I'm sure you have more complex examples, but certainly it will pick up all the attributes like your example below).

 

See attached process to import into RM. 

-- Training, Consulting, Sales in China, Hong Kong & Taiwan --
www.RapidMinerChina.com

Attachments

Elite

Re: Trimming a vectorset

I actually tried that one, but it turned out that my repositories were bigger after implementing this than before, also the amount of attributes looked the same. I have to admit I did not really check into detail, so maybe I'll try myself with a smaller testset.

Highlighted
Elite III

Re: Trimming a vectorset

The "remove useless attributes" operator should do exactly that--actually remove attributes that are all the same or that have deviation below the minimum that you set.  You can look at the summary statistics page to see whether attributes are really all the same or whether there are a small number of cases with different values that may be preventing them from being removed.

A somewhat more sophisticated approach would be to run a principal components analysis (using the PCA operator), which analyzes all specified attributes and reduces them to a new smaller set that provide variances above a certain user-defined threshold.  This is somewhat more useful because if you have variables that are not strictly useless (meaning all the same values) but are highly correlated with each other or otherwise add minimal variance to the full dataset, then it will reduce the dataset down to a smaller set of synthetic attributes that capture almost all the variance of the full set.  You can read more about PCA here: https://en.wikipedia.org/wiki/Principal_component_analysis

 

 

 

 

Brian T., Lindon Ventures - www.lindonventures.com
Analytics Consulting by Certified RapidMiner Analysts