RapidMiner 9.7 is Now Available

Lots of amazing new improvements including true version control! Learn more about what's new here.

CLICK HERE TO DOWNLOAD

Transform data into table with every attribute representation

moritz_moellermoritz_moeller Member Posts: 5 Learner I
edited January 2019 in Help
Hey there,

since my data set is too big to analyze it with a clustering algorithm (moreover I don't want to wait as long as it needs), I want to transform it into a smaller set.

The question I have is if it is possible to transform it into a data set that represents every attribute in a representative amount? For example: I have a data set that has 3 columns that all have 5 different, possible values (i.e. 1-5) and 10 million rows. Now I want to have a data set that contains all 3 columns with all types of values but only 100k rows so that I can analyze it. Is there an option to do that automatically in RM? If not I think I have to do it manually somehow.

Thanks and Greetings,

Moritz


Best Answer

Answers

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,408   Unicorn
    You can use the Sample or Sample(stratified) operator to get a smaller set for your initial testing.  The option to do stratified sampling allows you to preserve a relative distribution for a label.  In your case, you don't necessarily have just one label but you could try designating any one of your 3 columns as the label (use Set Role) and then after sampling just check the distribution of all 3 columns relative to the orginal complete dataset.  As long as the sample is large enough and your values are not extreme outliers, you should get a representative mix of all your possible values.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.