How to process categorical type data using unsupervised algorithm in anomaly detection?
I encounter a problem in anomaly detection. We know that distance is measured between different instances. Now my dataset contains categorical data. I have 3 choices. First, I remove the categorical features, however, I think that there are useful messages in categorical features. Second, I transform the categorical data into numerical value using LabelEncoder of sklearn, however, I think the transform can't correspond to the distance measure. Third, I use OneHotEncoder of sklearn to process the categorical features, however, I think that the demensions of features increase and it affect clustering.
General preference is to one hot encode and yes it increases the dimensions of features but you can use PCA for dimensionality reduction on these features to reduce them. If this is not good, you can use k-modes in python which is a mixed model that can take both categorical and numeric features for clustering.
Be Safe. Follow precautions and Maintain Social Distancing