# SOM Reduction

Member Posts: 30 Guru
Hi,

WHen we do SOM reduction, do we get reduced feature set (like in PCA, or SVD reduction)? I thought SOM is a unsupervised classification scheme, am I missing something?

The ExampleSet Returned doesnot retain the ID number of the input ExampleSet, is there a way to retain it?

Regards,
Vijay

• Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
Hi Vijay,
 WHen we do SOM reduction, do we get reduced feature set (like in PCA, or SVD reduction)? I thought SOM is a unsupervised classification scheme, am I missing something?
Actually, a SOM is a sort of combination between a dimensionality reduction and a clustering technique. It's first purpose is the reduction of high-dimensional data sets on (usually) two dimensions so that points which are close together in the input space are also close together in the transformed space. The underlying Kohonen net allows also for non-linear transformations (unlike, for example PCA or SVD). The second purpose can be seen if you use a map beneath the transformed data points (can be seen in the SOM data plotter, press on "calculate" after selecting the SOM plotter for your data set). Mountains show regions with larger distances so you also get a clustering in the transformed data space.

The plotter produces the reduced dimensions together with the map (clustering). The operator SOMDimensionalityReduction, as the name indicates, only produces the transformation and can therefore be used like other dimensionality reductions (PCA etc.). It is also able to produce a preprocessing model so that the transformation can also be applied on test data sets.
 The ExampleSet Returned doesnot retain the ID number of the input ExampleSet, is there a way to retain it?
Well, I just tried it and the example set still contains the ID. Here is an example:
<operator name="Root" class="Process" expanded="yes">    <parameter key="logverbosity"	value="status"/>    <operator name="ExampleSource" class="ExampleSource">        <parameter key="attributes"	value="C:\home\ingo\rm_workspace\sample\data\iris.aml"/>    </operator>    <operator name="SOMDimensionalityReduction" class="SOMDimensionalityReduction">    </operator></operator>
Could you please post your process (XML) so I can check why it might get lost? Please note that the plotters do not show the ID as a column but as tool tip for the points when you move the mouse over plot points.

Cheers,
Ingo
• Member Posts: 30 Guru
Thanks for the info.  All the feature in SOM are between 0 to 29 for a net size of 30. That means SOM reductions returns distances between input space samples as a feature and this could be use as cluster number too? Does it mean if I need 7-8 cluster, I should specify size net size as 7-8?

I may have missed to included ID before, not sure what I had done. But it does work now. Sorry for false alarm.

Regards,
Vijay
• Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
Hi Vijay,

the net size is only the size in one dimension. If you reduce it to, let's say, two dimensions, then you will get 30 values in each dimension resulting in a total of 900 clusters. If you use two dimensions and want to come up with 7-8 clusters I would suggest a net size of 3. Using more dimensions allows you to keep more of the original value variance. If you want to come up with a single cluster number you could use the AttributeMerge operator like in this example:
<operator name="Root" class="Process" expanded="yes">    <operator name="ExampleSetGenerator" class="ExampleSetGenerator">        <parameter key="number_examples"	value="200"/>        <parameter key="number_of_attributes"	value="3"/>        <parameter key="target_function"	value="gaussian mixture clusters"/>    </operator>    <operator name="SOMDimensionalityReduction" class="SOMDimensionalityReduction">        <parameter key="net_size"	value="3"/>    </operator>    <operator name="AttributeMerge" class="AttributeMerge">        <parameter key="first_attribute"	value="SOM_0"/>        <parameter key="second_attribute"	value="SOM_1"/>    </operator>    <operator name="ChangeAttributeName" class="ChangeAttributeName">        <parameter key="new_name"	value="cluster"/>        <parameter key="old_name"	value="SOM_0_SOM_1"/>    </operator>    <operator name="ChangeAttributeRole" class="ChangeAttributeRole">        <parameter key="name"	value="cluster"/>        <parameter key="target_role"	value="cluster"/>    </operator></operator>
 I may have missed to included ID before, not sure what I had done. But it does work now. Sorry for false alarm.
Not a problem at all. We really appreciate each report of a possible bug. Better a false alarm than not knowing that there is something wrong....

Cheers,
Ingo
• Member Posts: 30 Guru
Thanks for detailed explaination. So basically if I set to return one dimension, it returns me clusters of net size. But I wouldn't want to have just one dimension for obvious reason.

I copied the XML code under XML tab and then pressed run button. But I had to save it before I could run so I said save as  test.xml...
But it happens that it doesnot create the operator tree. Even the file save is NULL file. Because I quit the application without pressing anywhere else.

If I press the neighboring tab like new_operator, parameter or comment the operater tree is generated. I have tried this now 3-4 times.
Shouldn't the tree be generated when I press the save file?

Regards,
Vijay
• Moderator, Employee, Member Posts: 291 RM Product Management
Hi Vijay,

thanks for pointing this out. This behaviour is indeed not intended and the operator tree should be generated not only when you change to the parameter or comment tab but also when you run or save the process.

I added this to our todo list. Unfortunately, there is currently plenty of work on our todo list and hence we will not be able to fix this issue in the next few days. Until then, please use the workaround you explained and manifest a new XML process representation by clicking on another tab.

Regards,
Tobias
• Member Posts: 18 Maven
I am not pretty sure if I understood everything correctly but
for me SOM uses sort of a neuronal net to find a good representation of your
high dimensional data.

There exists a paper which suggests using SOM for input-dimension-reduction in a way to exclude
certain parameters by comparing SOM component planes.
de Abajo, N.: ANN quality diagnostic models for packaging manufacturing: an industrial data mining case study
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining

"Particularly insightful are the so called component planes rep-
resented in ﬁgure 5, that provide us with a big picture of the
input values distribution. Similar maps show an analogous
behavior and, therefore, a redundancy in the information."

So my question is: Can rapidminer provide this functionality too?

And if I push the calculate button in the "SOM chart view"
the chart changes the appearence completely. If that is a correct behaviour
what is the information I can get from that chart?

• Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
Hi,

without having a look into the paper I am assuming that the different attributes are used as "label" for the SOM creation and than a probably matrix based image comparison is calculated to build the attribute groups containing redundant information, right?

So, the answer is quite easy: no. It is currently not possible to calculate a feature "similarity" based on SOM graph visualizations. The basic algorithms are all there (SOM creation, matrix comparisons, clustering) but you would probably have to create an own operator for this (or let us do this for you  ). Although I would like to mention that I am not completely convinced that this calculation is too meaningful but that's another question...
 If that is a correct behaviour what is the information I can get from that chart?
Yes, this behaviour is desired. Since the SOM calculation is based on random initializations, the result is different for each repeated calculation of the SOM. And as you have probably noticed, the results can sometimes look completely different. And this is exactly the reason why I am not convinced that a SOM based feature similarity calculation is a good idea: the result would depend too much on the initialization. To get around this problem, you would have to repeat the SOM creation with the same initializations several times for all variables which would be hardly feasible for large data sets since SOMs are not exactly what I would call a very fast algorithm. Just my 2 cents.

Cheers,
Ingo