How to process categorical type data using unsupervised algorithm in anomaly detection?

limingliming Member Posts: 13 Contributor I
edited June 2019 in Help
I encounter a problem in anomaly detection. We know that distance is measured between different instances. Now my dataset contains categorical data. I have 3 choices. First, I remove the categorical features, however, I think that there are useful messages in categorical features. Second, I transform the categorical data into numerical value using LabelEncoder of sklearn, however, I think the transform can't correspond to the distance measure. Third, I use OneHotEncoder of sklearn to process the categorical features, however, I think that the demensions of features increase and it affect clustering.


    varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
    Hello @liming

    General preference is to one hot encode and yes it increases the dimensions of features but you can use PCA for dimensionality reduction on these features to reduce them. If this is not good, you can use k-modes in python which is a mixed model that can take both categorical and numeric features for clustering.



    yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data Scientist
    Have you tried the anomaly detection extension from RapidMiner marketplace? As far as I know, the knn-global anomaly score operator can use nominal measures to calculate nearest neighbor distances. The LOF outlier detecter is similar. If you want to apply PCA for anomaly scores, you will need to convert nominal to numerical attributes. Here is an example applied on the Titanic data
