Process of X-means cluster with text data

Joanneyu · November 2019

Hi all,

I want to do x-means cluster with text data, but I am super new with Rapidminer. I followed several different tutorials and ended up with this process.
My data looks like the excel format at left hand side, where I have only one column with several single words.

If would be so nice if someone can confirm whether the process is right or wrong. I want to use X-means cluster because I want to see what is the ideal number of clusters. I am using TF-IDF, and Inside "process document from data", there are tokenize, transform cases, stopwords, and stem (poter). As for "X-Means", I set the k min of 10 and k max 60, with Cosine similarity.

Image: https://scontent-vie1-1.xx.fbcdn.net/v/t1.15752-9/78434949_511896169396269_9040932357380505600_n.png?_nc_cat=105&_nc_ohc=Bq_oDVPohJ8AQmHGL4LyWeSnf7WThCILRs2SAg_gzQnYoAz0LZFuyn3OQ&_nc_ht=scontent-vie1-1.xx&oh=bb521ef990fa3fecfc27b7a1ef7d1aa3&oe=5E47935E

However, the results appear weird to me because cluster 0 has almost all the data. Also, I expected that the results will tell me what would be the most ideal number of clusters? Or did I make any mistake in the process?

Image: https://scontent-vie1-1.xx.fbcdn.net/v/t1.15752-9/78890556_548710235706695_8703660597239087104_n.png?_nc_cat=108&_nc_ohc=sSX7LJoUVJEAQl_jIsaEQUdFdxNngKBn23v0CCf7Eg_kTHwdMh9WOLOcQ&_nc_ht=scontent-vie1-1.xx&oh=599ae7c52a6bdc38cf667e3faa07932d&oe=5E4AF338

Thank you in advance!!!

sgenzer · December 2019

hi @Joanneyu there's nothing that I can see wrong with your process (although I must say using Auto Model is MUCH easier than what you're trying to do here with operators). Having one cluster with almost all the items is not unusual per se; could be a very homogenous group, or you're not creating enough/the right features to find differences in your texts.
Again I'd try Auto Model.

Scott

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Process of X-means cluster with text data

Answers