Compete in RapidMiner's 3rd Competition: Fantasy Football. Top prize is $750. Deadline December 19.
Download RapidMiner Studio or Server 8.0 Public Beta. Let us know how you like it! Ends November 27.
Watch RapidMiner's "Getting Started" videos on YouTube. Everything you need to do data science - fast and simple!
I have a vectorset containing a few thousand attributes, and want to work with amples for testing purposes. Easy enough to get the sampleset, but I also get the full attributeset while I am only interested in the ones that are applicable for the sample. What is the easiest way to remove all irrelevant attributes?
value1 - true - true -false - true
value2 - true - false -false -false
value3 - false - true -false - true
In this oversimplied example it would mean I like to skip the 3d attribute, as it is always false. What is the fastes way to achieve this (on a much bigger set)?
The Remove Useless Attributes would be my first thought from your example. (I'm sure you have more complex examples, but certainly it will pick up all the attributes like your example below).
See attached process to import into RM.
I actually tried that one, but it turned out that my repositories were bigger after implementing this than before, also the amount of attributes looked the same. I have to admit I did not really check into detail, so maybe I'll try myself with a smaller testset.
The "remove useless attributes" operator should do exactly that--actually remove attributes that are all the same or that have deviation below the minimum that you set. You can look at the summary statistics page to see whether attributes are really all the same or whether there are a small number of cases with different values that may be preventing them from being removed.
A somewhat more sophisticated approach would be to run a principal components analysis (using the PCA operator), which analyzes all specified attributes and reduces them to a new smaller set that provide variances above a certain user-defined threshold. This is somewhat more useful because if you have variables that are not strictly useless (meaning all the same values) but are highly correlated with each other or otherwise add minimal variance to the full dataset, then it will reduce the dataset down to a smaller set of synthetic attributes that capture almost all the variance of the full set. You can read more about PCA here: https://en.wikipedia.org/wiki/Principal_component_analysis