I have a huge dataset (4000000 records) of text data and I want to do clustering.
Because of memory problems and time complexity of text pre-processing I want to read small batches from database and after pre-processing use mini-batch K-means to cluster data. But I wonder how to use mini-batch clustering in RapidMiner.
Thanks in advance for your answers.
there are different Loop operators in RapidMiner.
You can easily implement this batching behaviour by using a loop with a numeric counter and select data from your database with LIMIT n OFFSET (i - 1) * n.
n would be your preferred batch size, and i the current iteration number, starting at 1. Usually you need to calculate the offset yourself outside of the statement, e. g. with Generate Macro. Not all databases support the LIMIT ... OFFSET syntax, but most have the functionality under a different name.
Hi thanks for your answer
Mini batch K-Means algorithm takes small batches of the dataset for each iteration. It then assigns a cluster to each data point in the batch, depending on the previous locations of the cluster centroids and updates the locations of cluster centroids based on the new points from the batch.
How could I make a process like this?
because loop operator in each iteration makes new clusters for current batch and doesn’t assign new points to previous clusters