Transform data into table with every attribute representation

moritz_moellermoritz_moeller Member Posts: 5 Learner I
edited January 2019 in Help
Hey there,

since my data set is too big to analyze it with a clustering algorithm (moreover I don't want to wait as long as it needs), I want to transform it into a smaller set.

The question I have is if it is possible to transform it into a data set that represents every attribute in a representative amount? For example: I have a data set that has 3 columns that all have 5 different, possible values (i.e. 1-5) and 10 million rows. Now I want to have a data set that contains all 3 columns with all types of values but only 100k rows so that I can analyze it. Is there an option to do that automatically in RM? If not I think I have to do it manually somehow.

Thanks and Greetings,

Moritz


Best Answer

Answers

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    You can use the Sample or Sample(stratified) operator to get a smaller set for your initial testing.  The option to do stratified sampling allows you to preserve a relative distribution for a label.  In your case, you don't necessarily have just one label but you could try designating any one of your 3 columns as the label (use Set Role) and then after sampling just check the distribution of all 3 columns relative to the orginal complete dataset.  As long as the sample is large enough and your values are not extreme outliers, you should get a representative mix of all your possible values.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.