Text clustering in RapidMiner Studio
I'm trying to do an unsupervised clustering of text in RM. The data is in a .CSV file. One attribute is a text field with free text that I would like to cluster. I have configured this as a data source in my repository. I marked the field as type text. I also marked the id field as type id. I believe I need to create a word vector for each example in my set. I think I do this using "Process Documents from Data". I have this set for create word vector using TF-IDF.
Inside of Process Documents, I have a tokenizer, case transformer, stopword filter, stemmer, and n-gram builder in sequence. I wired the output of Process Documents to the input of k-means clustering. Everything runs for a while and then halts with an error that the example set contains non-numeric values in a column. Is there a way to focus the clustering on only the attributes of interest (i.e. the terms found in process documents)? Or do I have to filter out the other attributes first?
I also tried switching the k-means measure type to mixed, but then I get an error that I have missing values.
All of the articles I read on clustering text describe the process I'm using, but it doesn't work for me. Please help.