"SOM clustering: Conversion of non-numeric attributes to numeric?"

ruserruser Member Posts: 40 Maven
edited May 2019 in Help
As you might have seen in my other posts, I'm trying to perform Clustering of the data using SOM method. I'm able to generate the Clusters. But, I assume that to create meaningful clustering, I'll have to convert all the non-numerical attributes (categorical, timestamp) to the numeric attributes without affecting the meaning of those attributes while deciding the cluster. Is my understanding correct? What is the support from Rapidminer for doing that?
What are the operators I have to use, to achieve that?


  • Options
    ruserruser Member Posts: 40 Maven
    I'm not clear whether it is mandatory to do this conversion, or SOM algorithm takes care (not by ignoring, but by really normalizing it) of this automatically. I doubt whether SOM can really do this, because it cannot know the real difference between the different values of a categorical attribute like bad, good, excellent. It is only us who can forcibly make an algorithm to interpret it in a different way.
    bad=0, good=1, excellent=2
    bad=1, good=2, excellent=3
    bad=0.5, good=1, excellent=2

    This is because I could find some articles in the web which suggest that SOM takes care of it (unlike K-means). Here is a sample link:

    Can anybody shed light on this?
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    in fact this SOM implementation does not make differences between any valuetype. Instead it assumes, that everything is numerical, which will give you strange results, if you have nominal values. To be precise: Nominal values are replaced by its internal id, which is just a counter...
    But you might transform these nominal values into binominal ones, where each value is represented by a binominal, true or false, attribute.
    The operator is called "Nominal2Binominal".

    If a given implementation takes care of making nominal values binominal, depends just on the programmers will and has nothing to do with the actual algorithm. A Kmeans calculation could easily transform it beforehand, too. But your link is right, when it states, that KMeans will suffer from the great number of attributes, because calculating the distances will become more expensive. Then KMedoids is a good replacement, because it is able to make this transformation implicitly by using a different distance measure. The only disadvantage is, that it always has to use an example as centroid. But with a growing number of examples, this converges to the KMeans solution...

    If you don't want to convert every nominal attribute into binominals, because there is an ordinal information available, you could transform it into user defined numerical variable with the combination of two operators. Here's how it works:
    <operator name="Root" class="Process" expanded="yes">
        <operator name="NominalExampleSetGenerator" class="NominalExampleSetGenerator">
        <operator name="Mapping" class="Mapping">
            <parameter key="attributes" value=".*"/>
            <list key="value_mappings">
              <parameter key="0.1" value="value0"/>
              <parameter key="0.3" value="value1"/>
              <parameter key="0.9" value="value2"/>
              <parameter key="23" value="value3"/>
              <parameter key="-5" value="value4"/>
        <operator name="NominalNumbers2Numerical" class="NominalNumbers2Numerical">
  • Options
    ruserruser Member Posts: 40 Maven
    Thanks, it helps!

    The Table I have is bigger and also it contains some columns with lot of nominal values. Due to this, the Nominal2Binominal is consuming lot of time (even more than 30 mins). So, I decided to perform some mapping wherever possible. The following queries are related to that:

    One of the columns in a table, takes one of the following 3 values:
    - first
    - second
    - NULL

    I can map 'first' to 1 and 'second' to 2. It makes sense for my application, not to map 'NULL' to '0'. Because practically this column can take only 'first' or 'second'. The presence of 'NULL' indicates that the value is unknown. Hence, I do not want to map NULL to '0', as otherwise it might lead to wrong formation of clusters.

    Now, my question is whether the SOM algorithm really understands this, if I just leave it to NULL. As I'm aware, the neural networks based SOM algorithm should work well in cases where the data contains some noise also.

    Please comment on this.

    There are some columns of type 'TIMESTAMP'/DateTime. Do I have to do any conversion for such columns?
    I assume it is not required.
Sign In or Register to comment.