The Altair Community is migrating to a new platform to provide a better experience for you. The RapidMiner Community will merge with the Altair Community at the same time. In preparation for the migration, both communities are on read-only mode from July 15th - July 24th, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here.

Options

# "K-Means Clustering with Mixed Attributes"

Hello Everyone,

I want to segment my customer base (13,000 customers) according to several attributes such as:

1. Total Deposits (numerical)

2. Total #Accounts (integer)

3. #Months Since Customer Acquisition (Integer)

4. Has the client subscribed for Online Banking or not? (Categorical)

I want to see what is common among my customers by splitting them into clusters.

I have mixed attributes in my data set (numerical and categorical).

The questions I have are:

1. What is the best distance measure in this case?

2. Do I need to transform any attribute?

3. Do I need to normalize any attribute?

4. What is the best way to set up the model?

Any help would be appreciated.

Thank You

I want to segment my customer base (13,000 customers) according to several attributes such as:

1. Total Deposits (numerical)

2. Total #Accounts (integer)

3. #Months Since Customer Acquisition (Integer)

4. Has the client subscribed for Online Banking or not? (Categorical)

I want to see what is common among my customers by splitting them into clusters.

I have mixed attributes in my data set (numerical and categorical).

The questions I have are:

1. What is the best distance measure in this case?

2. Do I need to transform any attribute?

3. Do I need to normalize any attribute?

4. What is the best way to set up the model?

Any help would be appreciated.

Thank You

Tagged:

0

## Answers

3,525RM Data ScientistA general tip: You can analyse your clusters by taking it as label and user Feature Selection techniques (e.g. Weight by Gini Index or a Forward Selection). Then you get the most important attributes distingiushing the clusters.

If you use a one vs all strategy you can answer the question "What distinguishes cluster1 from the others?" which might be really helpful for interpreting results.

Second tip: If you use any other distance than euclidian distance, you should not use k-Means but k-Menoids. Otherwise the algorithm might not converge.

Cheers,

Martin

Dortmund, Germany

10Contributor IIThank you for your reply.

As you might have guessed, i am new to RapidMiner.

It would be great if you could elaborate some more in terms of steps you would take first.

Of course, bringing in the data would be the first one.

But I am having problems in figuring out the sequence of other steps.

Q. Do you transform first and then normalize? or vice-versa?

Q. How do you transform and what to transform and into what? Same with normalizing.

Q. How do you determine how good your model is and if you've clustered your data well?

it would be great if you could let me know.

I appreciate your help.

Thank You

3,525RM Data Scientistall normalize models are just applyable on numerical data. so in general i would use transformation first.

I would use Nominal to Numerical using dummy coding. be sure to exclude the second newly created attribute using select attributes.

For normalizing i would use either Z-Transformation or range transformation in the Normalize operator.

The thing about "how good" is a model is a really tricky one. Thats maybe the biggest problem in unsupervised learning. You can either look if those clusters make sense (maybe using the label approach i mentioned earlier) or take a look on the performance which can be generated by the clustering performance operators. Those performances usually have the problem, that more k result in better values. As i said: This is really a tricky problem

Do you use RM 6.3? Then wisdom of the crowds might really help you in choosing the paramaters.

Cheers,

Martin

Dortmund, Germany

1Contributor IHi Martin,

What would be the approach if there are more categorical variables, especially of nominal type?

In khannadh's list of attributes, if there are more nominals like - 5. Location of the customer 6. Preferred marketing material.

What is the recommended approach for clustering in RM?

I use Gower distance for mixed datatypes in R.

Thanks!