Clustering Dummy Variables

mario_sarkmario_sark Member Posts: 13 Contributor I
edited June 2019 in Help
Dears, 

I am working on to segment a list customers into different cluster based on different variables, but some of these variables are Dummy variables for example below is the list of variables that i will use to apply the clustering technique:

Unpaid : Yes/No (dummy)
Deposit : Continuous (Some Customers has Zero deposits)
Term Deposits: Continuous (some customer has Zero Term Deposits)
Number of returned Checks : discrete (Some Customers Has Zero)
Insurance Product :  discrete (some Customer has Zero) - this can be transform into (Yes /No)
Credit Card Spending : Continuous ( Some customers has zero since they don't hold credit Cards)
Number of Product (Loans) : it can be number of Car Loan ,Personal Loan, Housing Loans, ...(some customer has zero)

What is the best algorithm in RapidMiner i can use to cluster these customers into different segments to highlight the less profitable group. 

As i know K-means can hold only continuous variable, and i am afraid to normalize the dummy variables available in the data set

Hope That you can help with this. !!

Thank you in advance, 
Mario



Tagged:

Answers

  • SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 344 Unicorn
    Hi Mario,

    I'm currently working on a similar use case. I'm using k-means with normalization and also dropping off outliers. I think that so long the variables have a similar range, they can be considered as equally important for k-means. That includes discrete variables (also note that the result of dummy coding is discrete).

    Another option if you don't have too many customers is agglomerative clustering.

    I look forward to reading about other possibilities
    Sebastian
    mario_sarksgenzer
  • mario_sarkmario_sark Member Posts: 13 Contributor I
     Hi SGolbet, 

    Thank you for your reply, the list of customer that i am going to clusters is around 70,000 Customers. 
    I was wondering if there is any algorithm other than K-means.  I

    i am looking forward also to read about other possibilities. 

    Thank you, 
    Mario
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,630 Unicorn
    You can definitely use k-means with dummy variables.  You simply need to Normalize all your other numerical variables to also be in a similar range (e.g., 0 -1) first so it doesn't bias the distance calculations.  You can even use nominal attributes as well, and then used mixed Euclidean distance metric (with the same proviso about making sure that all the numericals are then in the same 0-1 range).
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
    sgenzerIngoRM
Sign In or Register to comment.