The Altair Community and the RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here.
Options
How to process categorical type data using unsupervised algorithm in anomaly detection?
I encounter a problem in anomaly detection. We know that distance is measured between different instances. Now my dataset contains categorical data. I have 3 choices. First, I remove the categorical features, however, I think that there are useful messages in categorical features. Second, I transform the categorical data into numerical value using LabelEncoder of sklearn, however, I think the transform can't correspond to the distance measure. Third, I use OneHotEncoder of sklearn to process the categorical features, however, I think that the demensions of features increase and it affect clustering.
Tagged:
0
Answers
General preference is to one hot encode and yes it increases the dimensions of features but you can use PCA for dimensionality reduction on these features to reduce them. If this is not good, you can use k-modes in python which is a mixed model that can take both categorical and numeric features for clustering.
K-modes: http://www.cs.ust.hk/~qyang/Teaching/537/Papers/huang98extensions.pdf
Thanks
Varun
https://www.varunmandalapu.com/
Be Safe. Follow precautions and Maintain Social Distancing