Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Aggregate Duplicates
Can you suggest a method to remove duplicate examples and add a "count" attribute to the remaining unique items?
I would like to do that to reduce the size of the dataset and then use this counter attribute with a k-NN operator. Is that even possible in RM?
I would like to do that to reduce the size of the dataset and then use this counter attribute with a k-NN operator. Is that even possible in RM?
Tagged:
0
Answers
The aggregate operator is your friend - here's an example regards
Andrew
If I understand correctly, you suggest aggregating duplicates using the aggregate operator and "group by" all attributes.
How can this be utilized to make a k-NN faster?
Having 20 million samples with 20 attributes but only 1 million possible attribute combinations will result in a dataset of 1 million examples with 21 attributes.
How will k-NN work on that (ie use the 21st attribute as weight/count or something).
I think k-NN would still work, the new aggregation attribute would need to be carefully selected in order to ensure that unseen data is near to representative examples.
As always, an experiment is needed.
regards
Andrew