Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Trimming a vectorset
I have a vectorset containing a few thousand attributes, and want to work with amples for testing purposes. Easy enough to get the sampleset, but I also get the full attributeset while I am only interested in the ones that are applicable for the sample. What is the easiest way to remove all irrelevant attributes?
example :
value1 - true - true -false - true
value2 - true - false -false -false
value3 - false - true -false - true
In this oversimplied example it would mean I like to skip the 3d attribute, as it is always false. What is the fastes way to achieve this (on a much bigger set)?
0
Answers
The Remove Useless Attributes would be my first thought from your example. (I'm sure you have more complex examples, but certainly it will pick up all the attributes like your example below).
See attached process to import into RM.
I actually tried that one, but it turned out that my repositories were bigger after implementing this than before, also the amount of attributes looked the same. I have to admit I did not really check into detail, so maybe I'll try myself with a smaller testset.
The "remove useless attributes" operator should do exactly that--actually remove attributes that are all the same or that have deviation below the minimum that you set. You can look at the summary statistics page to see whether attributes are really all the same or whether there are a small number of cases with different values that may be preventing them from being removed.
A somewhat more sophisticated approach would be to run a principal components analysis (using the PCA operator), which analyzes all specified attributes and reduces them to a new smaller set that provide variances above a certain user-defined threshold. This is somewhat more useful because if you have variables that are not strictly useless (meaning all the same values) but are highly correlated with each other or otherwise add minimal variance to the full dataset, then it will reduce the dataset down to a smaller set of synthetic attributes that capture almost all the variance of the full set. You can read more about PCA here: https://en.wikipedia.org/wiki/Principal_component_analysis
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts