Options

Problem with choosing K

m_keshavarz_comm_keshavarz_com Member Posts: 28 Contributor I
edited November 2018 in Help

Hi, I'm looking for the best k for clustering with kmeans
From the operator
I used the process by distance to DB
Will result
I know that the lower the db is, the better k
But I chose the miximaziation mark
Now how much DB is better
Less or more
I saw that
https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/choose-best-cluster-number/m-p/44992#M29530
But did not help
Thanks a lot
If you help me

Answers

  • Options
    lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi @m_keshavarz_com,

     

    If you have no idea of the optimal k, you can use the X-Means operator.

     

    Regards,

     

    Lionel

  • Options
    rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn
    Hi,

    To let RapidMiner help you choose the best K, you might want to use the "Optimize Parameters" operator. I'm travelling and don't have my computer with me but I'm pretty sure I answered something similar yesterday.

    Hope this helps, I'll be back to you once I get my Mac back.

    Rodrigo.
  • Options
    rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn

    Hi @m_keshavarz_com,

     

    I suggested to apply the "Optimize Parameters" operator to find the best K. Not feasible, sorry for misdirecting, I don't know what I was thinking!

     

    To find the best K, you should check your question first and prepare your parameters for exploration. Perhaps an example might be good:

     

    Let's suppose you have a list of customers buying bus seats but you don't know which ones buy normal seats and which ones buy premium seats. Then you should take a look at your data and see what parameters you have (ticket id, gender, age, origin, destination and type of seat). Your best bet would be k = 2, as you have two types of seats. Then, you should take your next variable, see how many values you have and multiply the current K by that class. Let's say that the next variable is the time required to travel between origin and destination. If you have numerical values that are very variable, you should consider discretizing (that helped me in the past). Rinse and repeat for each variable that makes sense to consider in a cluster.

     

    Clustering will help you understand how your data looks like, but further analyses are required to fully unleash the power of it. I remember that @mschmitz wrote an article on how to use Decision Trees to understand your clusters and I'm keen to recommend it, but couldn't find it.

     

    All the best and sorry for my first post.

     

  • Options
    m_keshavarz_comm_keshavarz_com Member Posts: 28 Contributor I

    Hello Dear friends
    Are you good
    thank you

    rfuentealba

    Yes, but using Optimize Parameters is time consuming and my computer is hanging
    Maybe with the conditions I'm talking about DB?
    The higher value represents k is better. Or less?
    I want to cluster tweets. Now, in your opinion, how much K is better?

    I did not see an article you said ...
    you're welcome
    Thank you

  • Options
    rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn

    Hi @m_keshavarz_com

     

    Do not use "Optimize Parameters". It was a mistake from my side. I don't know what is your case. The value for k comes from the kind of data you are clustering, that is what I tried to explain. If you explain your use case, we might be able to help. I wrote you a PM.

     

    All the best,

     

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,507 RM Data Scientist
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn
    Awesome! Thank you, Martin! @m_keshavarz_com, there you have it.

    Have fun!
  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    and just a friendly reminder @m_keshavarz_com that if your computer is hanging when you start doing things like Optimize Parameters, you are likely pushing against one or more barriers such as single core processing (for a free license). Upgrading your license will likely improve your performance a LOT.  :)

     

    Scott

     

  • Options
    m_keshavarz_comm_keshavarz_com Member Posts: 28 Contributor I

    Hello dear friends
    Thank you very much for helping me

    Dear Master @mschmitz, I want to cluster tweets and know what similar tweaks are in a cluster?
    Is the decision tree able to find the best K?
    How do i do Sorry i know i should not ask
    But I am a beginner
    May I send a sample process to me
    ?
    If I use the performance by distance operator. What is a better number for db? I know the DB should be low, but I chose the miximization mark.

    And
    My system is five-core. How to prevent hang-up?
    Dear @rfuentealba

    I want to cluster tweets

    Thank you so much for everyone
    Thankful

Sign In or Register to comment.