RapidMiner

Mini Batch K-means in RapidMiner

Wisdom logo Registration now open for RapidMiner Wisdom Americas | New Orleans | October 10-12, 2018   Learn More
Highlighted
Learner III morita
Learner III

Mini Batch K-means in RapidMiner

Hi
I have a huge dataset (4000000 records) of text data and I want to do clustering.

Because of memory problems and time complexity of text pre-processing I want to read small batches from database and after pre-processing use mini-batch K-means to cluster data. But I wonder how to use mini-batch clustering in RapidMiner.
Thanks in advance for your answers.

2 REPLIES
Community Manager Community Manager
Community Manager

Re: Mini Batch K-means in RapidMiner

Hi,

 

there are different Loop operators in RapidMiner.

You can easily implement this batching behaviour by using a loop with a numeric counter and select data from your database with LIMIT n OFFSET (i - 1) * n

n would be your preferred batch size, and i the current iteration number, starting at 1. Usually you need to calculate the offset yourself outside of the statement, e. g. with Generate Macro. Not all databases support the LIMIT ... OFFSET syntax, but most have the functionality under a different name. 

 

Regards,

Balázs

--
Balázs Bárány
Data Scientist, Vienna
https://datascientist.at
Learner III morita
Learner III

Re: Mini Batch K-means in RapidMiner

Hi thanks for your answer 

Mini batch K-Means algorithm takes small batches of the dataset for each iteration. It then assigns a cluster to each data point in the batch, depending on the previous locations of the cluster centroids and updates the locations of cluster centroids based on the new points from the batch.
How could I make a process like this?
because loop operator in each iteration makes new clusters for current batch and doesn’t assign new points to previous clusters

 

@BalazsBarany