Options

Interpreting SOMDimensionalityReduction results

lexusboylexusboy Member Posts: 22 Maven
Hello,

I am new to machine learning and its methods, so excuse me if my question sounds stupid. So, I have some data(750 rows in an Excel sheet) on which i first do some pre processing (stopword filtering, stemming etc), after i get the output from these operators i want to do SOM clustering so I can see the results and make a feature vector of the most important concepts in my document. However all i see in the result from the SOM operator is just different colored "dots" spread across the map, and i have no idea what they mean, if anyone can help me in this regard, I would be very grateful.

Here is the code from my experiment:

<operator name="Root" class="Process" expanded="yes">
   <operator name="ExcelExampleSource" class="ExcelExampleSource">
       <parameter key="excel_file" value="C:\Documents and Settings\Lexusboy\My Documents\TCV\Postings_short.xls"/>
       <parameter key="first_row_as_names" value="true"/>
       <parameter key="id_column" value="4"/>
   </operator>
   <operator name="StringTextInput" class="StringTextInput" expanded="yes">
       <parameter key="filter_nominal_attributes" value="true"/>
       <parameter key="remove_original_attributes" value="true"/>
       <parameter key="default_content_language" value="english"/>
       <parameter key="vector_creation" value="BinaryOccurrences"/>
       <parameter key="return_word_list" value="true"/>
       <parameter key="output_word_list" value="C:\Documents and Settings\Lexusboy\My Documents\RapidMiner\pre processing\word_list"/>
       <parameter key="id_attribute_type" value="short"/>
       <list key="namespaces">
       </list>
       <parameter key="create_text_visualizer" value="true"/>
       <operator name="StringTokenizer" class="StringTokenizer">
       </operator>
       <operator name="GermanStopwordFilter" class="GermanStopwordFilter">
       </operator>
       <operator name="StopwordFilterFile" class="StopwordFilterFile" activated="no">
           <parameter key="file" value="C:\Documents and Settings\Lexusboy\My Documents\RapidMiner\pre processing\stopwords.txt"/>
       </operator>
       <operator name="GermanStemmer" class="GermanStemmer">
       </operator>
       <operator name="TokenLengthFilter" class="TokenLengthFilter">
           <parameter key="min_chars" value="3"/>
       </operator>
   </operator>
   <operator name="SOMDimensionalityReduction" class="SOMDimensionalityReduction">
       <parameter key="return_preprocessing_model" value="true"/>
       <parameter key="number_of_dimensions" value="1"/>
       <parameter key="training_rounds" value="50"/>
   </operator>
</operator>
Best Regards

Answers

  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    the SOM, or Self-Organizing Map, is simply a map of vectors. These vectors represent some prototypical examples, and are arranged on the map during the self-organization phase.
    The color in the background is generated by some functions to give insight into the multidimensional vector space. The usual U matrix visualizes the distance between two neighbouring vectors: The higher the landscape, the higher the distance.The P matrix gives an impression of the empirical distance around the local vector in the data space.
    The U* matrix weights both.

    The colored dots probably are your examples, arranged on the map on the position of the vector, which is most similar to them.

    Hope this could help a little bit?

    Greetings,
      Sebastian
  • Options
    lexusboylexusboy Member Posts: 22 Maven
    Hi Sebastian,

    Thanks for your reply, it did clear up the concept of the different matrices in the SOM output of RapidMiner. Well I am a little bit familiar with the working of the SOM algorithm, and I do understand that a nodes which are closer to each other are similar to each other, however since the nodes are not labeled, I can't comprehend which concepts in my document are similar to each other and which are not.

    In one research article that I read (http://faculty.cis.drexel.edu/~xlin/fulltext/ACM91.pdf), in the SOM map there you can actually see the nodes labeled with the words they represent in the vector space, along with marked areas on the map (belonging to the words),  where the size of the areas signified the frequency of the occurrence of the words (i.e. more frequent words getting more space on the map)

    Again any help is appreciated :)
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    unfortunately the current implementation of the som doesn't support this. I made a better one during my diploma thesis, but I didn't find the time to include the code into rapid miner...

    Greetings,
      Sebastian
  • Options
    lexusboylexusboy Member Posts: 22 Maven
    Hi Sebastian,

    Thanks for your reply, however there is one more thing I would love to get cleared up, it is on what basis are the nodes represented on the map.

    For example, using a input matrix of 140*25 (where 140 is the no.of documents and 25 is the no. of words) , with a net size of 15, a 100 training rounds, rest being default.

    I understand the color coding scheme here would imply, a value 0-> blue, 14 ->red, but these values that SOM assigns during the training phase are surely not the vector representation on a 2D grid, because you have blue/red dots in the middle of the map, which is highly unlikely given that they are the two extremes.

    I hope my question is not confusing, and I would appreciate any insight into this.

    Best Regards
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    do you refer to the color scheme of the background or the color scheme of the points?

    Greetings,
      Sebastian
  • Options
    lexusboylexusboy Member Posts: 22 Maven
    Hi,

    I was referring to the color scheme of the points, after seeing the bar on top of each map you can easily make out the color coding scheme, being in ascending order from blue to red (o being blue, 14 being red).

    But what I didn't understand is: lets say i am using a input matrix of 140*25 (where 140 is the no.of documents and 25 is the no. of words), with a net size of 15 ( I choose 15 because for a 2D SOM, 15*15 would make it 225 neurons for 140 input rows which I think is a good balance, please correct me if I am wrong) a 100 training rounds, rest being default:
    Now SOM gives the values to my input vectors, on a scale from 0-14 (e.g. 2, 5, 12 etc.). I initially thought these values were the vector representation of my input vectors on a 2D (x,y) map. However when I see blue or red dots in the middle of the map, I know my interpretation is incorrect since these colors would be represented close to the map borders otherwise.

    I would be grateful to you, if you could explain this to me.

    Best Regards,
    Bhavya
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Bhavya,
    the SOM is a non linear dimensionality reduction. It is not necessary true, that the values of all dimensions are ordered from small to large numbers. If you are trying to fold a high dimensional space on two dimensions, this isn't possible anyhow.
    Another important fact on the map is, that it does not have borders. The each node at the right border is a direct neighbor of a node on the left border. Same holds for the top/bottom nodes. So it doesn't make sense that the extreme values are on the borders of the map.


    Greetings,
      Sebastian
  • Options
    lexusboylexusboy Member Posts: 22 Maven
    Hi Sebastian,

    Thanks for your answer, let me tell you what I am doing, so you can understand my problem better, I am working on my Thesis for semantic analysis of user generated web content. Now I have an Excel data of about 750 postings from a forum regarding a particular brand (e.g Nike), which in the end I want to classify as "positive mentions of the brand" or "negative mentions" using a supervised learning algorithm. Right now after the pre-processing stage (stopword filtering & stemming, with "Binary Occurrences" vector creation option) I have got about 8000 odd words (which means the input matrix that is fed to SOM is 750*8000).

    Now my motivation for using SOM was basically to condense these 8000 words to a much lower number to get a feature vector which contains a number of features that are descriptive of the input that can then be easily used in the Supervised Learning stage. In RapidMiner when I use the SOM operator, after the learning phase, SOM gives my input some values, and as you said in no particular order, but they should certainly tell me something about my data, isn't it? Like the rows with same/similar values (typically the neurons representing these would form a cluster on the map) should mean they are similar in some way. However when I checked these "similar" rows against my Excel input I found that wasn't the case. Isn't it possible to make the SOM cluster the negative & positive postings separately, which then can be used to make my feature vector by iteratively reducing a percentage of features from the input and then using SOM to generate a map similar to the initial map (with all the inputs)

    Now I have been using Rapid Miner for my research experiments so far, and I did all my pre-processing using the in built operators, what I would also like to know from you is, this the optimal configuration for the SOMDimensionalityReduction for my data.

    <operator name="SOMDimensionalityReduction" class="SOMDimensionalityReduction">
            <parameter key="return_preprocessing_model" value="true"/>
            <parameter key="number_of_dimensions" value="1"/>
            <parameter key="net_size" value="50"/>
            <parameter key="training_rounds" value="2500"/>
        </operator>

    or do you suggest some changes?

    Any help is appreciated.

    Best Regards
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    I doubt that the current SOM implementation is suitable for your problem. Unfortunately you don't get information about the node an example is assigned beside it's coordinate. You might try to use this coordinate as some sort of cluster information and try to extract informations, describing this cluster. But this isn't done automatically.
    Another point is, that you are going to condense the data from 8000 dimensions to just 1. It's simply impossible to keep neighborhood informations consistent. You might perform a clustering with 50 clusters instead, it's pretty much the same under this conditions. This would have the advantage, that you might choose cosine similarity, which is much more suited for TFIDF data.

    Greetings,
      Sebastian
  • Options
    lexusboylexusboy Member Posts: 22 Maven
    Hi Sebastian,

    Thanks for the quick reply.

    >>You might try to use this coordinate as some sort of cluster information and try to extract informations, describing this cluster. But this isn't done automatically.

    I suppose you meant that using the values that SOM gave the inputs during the learning phase (e.g 21, 22, 24), I take these inputs with similar values, and try to extract information by looking them up in my Excel sheet to detect some patterns, rite? If that's the case I actually already did it, and didn't find any meaningful patterns in my data :(

    >>You might perform a clustering with 50 clusters instead, it's pretty much the same under this conditions. This would have the advantage, that you might choose cosine similarity, which is much more suited for TFIDF data.

    I think you are talking about using other clustering techniques like K-means, but I don't understand "the clustering with 50 clusters" part and also could you suggest a clustering technique that could be suitable to my data?

    P.S: Just a short query

    >>Unfortunately you don't get information about the node an example is assigned beside it's coordinate.

    I know you explained a bit of this in a previous post, but I am sorry I didn't understand this point, how does SOM give the values (e.g. 21, 23, 25) to the inputs after the learning phase?

    What I understood from the article by Alfred Ultsch on ESOM-Maps was the background is drawn depicting the distance values between the nodes in a landscape visualization, so "this value will be large in areas where no or few data points reside, creating mountain ranges for cluster boundaries.  The sum will be small in areas of high densities, thus clusters are depicted as valleys".  So the map is obviously in no order because you see the blue dots on the same level as the red ones in some maps, and in some you see the blue on top & the red at the bottom. What I am trying to say is that I understand the Self Organizing Map is clearly not like a 2D grid, with the lower values at the bottom and the high values at the top, so what is the reason behind this random representation of the neurons on the map, is it just to comply with the background?

    Thanks a lot in advance !

    Best Regards
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    sorry, but I have to put an end to this discussion right now. I really would like to talk to you about SOMs and their properties for hours and hours, but unfortunately I don't get paid for this. What I can do is to point you at the original book of Kohonen:

    [Kohonen 1997] Kohonen, Teuvo (1997). Self-organizing maps. Springer-
    Verlag New York, Inc., Secaucus, NJ, USA.

    It explains how everything works. The U matrix of Ultsch is only a nice visualization for getting an impression of the quality of the map. Don't get confused by this.

    Another definitive statement: You can never know from looking at the meta data of your examples, what appropriate parameters would be. This needs intensive testing. But I would suggest, that you use a higher dimension for the dimensionality reduction. Currently you are using only 1 dimension with 50 nodes. This will end up in 50 nodes, mixing up everything.

    Greetings,
      Sebastian
  • Options
    lexusboylexusboy Member Posts: 22 Maven
    Hi,

    Thanks for pointing me in the right direction for collecting more information on SOM, I have started reading them, however I would be grateful if you explain to me one point. When you choose 2 dimensions for you SOM with a net size of 50, how many neurons are you actually using for the training?
    Is it just simply 50*50 = 2500 neurons or something else?

    Best Regards,
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    yes, a net size of 50 would result in a grid with 50x50 = 2500 Neurons. It's just that simple :)

    Greetings,
      Sebastian
Sign In or Register to comment.