The beta program for the RapidMiner 9.7 release is now available. Lots of amazing new improvements including true version control!


What's the best way to determine the number of topics in the Extract Topics from Data (LDA) operator

cmotencmoten Member Posts: 2 Learner I
I have a dataset made of thousands of ways users have listed product names. For example, Apple MacBook, MacBook, MacBookPro, etc. There are all sorts of products included, but I'm trying to group similar ways people have described them into clusters. The Extract Topics from Data operator seems to be doing the trick but I'm manually having to choose the number of groups. Is there a way to determine the number of groups based on similarity? I hope this makes sense. 

Best Answer


  • cmotencmoten Member Posts: 2 Learner I
    Thank you so much for the example. This helps a lot. It looks like you are splitting the text on commas and saving them as columns. You then flip the data around so it lists the columns as rows and renames the last column to “text”. You then append all the individual example sets into one.

    The Optimization Parameter determines that the optimal number of topics is 6, but it seems like the number of topics listed on the Extract Topics from Data operator still shows 10. The results from the Optimization Parameter are being passed through as a parameter for Extract Topics. I think I get how it works.

    I tried applying to my dataset, and initially received an error. I think the overall size was too large, so I took a sample of the data and it worked. The results didn’t get me what I was looking for, but I will have another process to add to my tool belt. I’ll keep experimenting with it. Thanks again for the help.
Sign In or Register to comment.