The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on readonly mode from October 28  November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
What's the best way to determine the number of topics in the Extract Topics from Data (LDA) operator
I have a dataset made of thousands of ways users have listed product names. For example, Apple MacBook, MacBook, MacBookPro, etc. There are all sorts of products included, but I'm trying to group similar ways people have described them into clusters. The Extract Topics from Data operator seems to be doing the trick but I'm manually having to choose the number of groups. Is there a way to determine the number of groups based on similarity? I hope this makes sense.
Tagged:
0
Best Answer

lionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 UnicornHi @cmoten,
In RapidMiner, in first approximation, I see the following method (method to be confirmed by @mschmitz : Extract Topics  LDA operator is Martin's baby ... ) :
Use an Optimize parameters (grid) operator and plot the "Perplexity" according to the number of topic(s) k :
The lower the perplexity, the better the model.
For example in the example below, the "optimal" number of topics k is 6 :
In attached file, an example of process to find the optimal number of topics using Optimize Parameters (Grid) operator.
Regards,
Lionel
8
Answers
Dortmund, Germany
Hi @mschmitz,
first of all thank you for your contributions! That is a very interesting approach!
I am interested at the question to which extent additional quality measures can be considered beside Perplexity in RapidMiner in order to ensure a holistic base with regard to the decision of optimal topics? As you mentioned, we have ofentimes not only one and only solution for optimization problems.
Thank you in advance for your feedback!
Best regards,
Fatih
somebody who can give feedback on the abovementioned question regarding the evaluation measures for optimal topic determination?
Best regards,
Fatih