The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.
"Weight, Clustering and Decision Tree"
mariozupan
Member Posts: 15 Contributor II
I have 300 companies that I want to divide in clusters, according to a financial performance indicators. Then I want to describe every cluster with Decision Tree.
So, I have a few questions:
1. My attributes (financial indicators) are not normally distributed. I tried some statistical tests. Is it matter?
2. My attributes have different ranges. Do I need normalization operator?
3. Do I need some selecting by weight operator for choosing indicators which are significant or k-means make clusters according to a attributes weight?
As you can see from above questions that I tried something but I didn't get clusters that I can describe as "good" "better" "the best". I need an answer as soon as it is possible. Small example, or even data miner who is willing to create cluster on my data for a decent fee.
So, I have a few questions:
1. My attributes (financial indicators) are not normally distributed. I tried some statistical tests. Is it matter?
2. My attributes have different ranges. Do I need normalization operator?
3. Do I need some selecting by weight operator for choosing indicators which are significant or k-means make clusters according to a attributes weight?
As you can see from above questions that I tried something but I didn't get clusters that I can describe as "good" "better" "the best". I need an answer as soon as it is possible. Small example, or even data miner who is willing to create cluster on my data for a decent fee.
Tagged:
1
Answers
If you have further questions, feel free to answer to this message.
Good luck!
~Marius
It is seamed that I ask a lot, but I just need to know is it possible to shape the clusters on that way "stars" loosers" etc.If R extensions are necessary I'm ready for them. Fuzzy genetic algorithm for example. As I see you are very unselfish, so I will not be a different if you ask a fee for your knowledge.
You can run this algorithm using the Weka plugin.
You can also automatically rescale your attributes to best fit some hold out set.
As far as I recall, there is an easy way to do this?
Best regards,
Wessel
But how to correlate vector which is result of attributes normalization with the attributes?
Above questions applies to k-means, I will try network neighbour operator. I need to study network neighbour functioning, but you mean that in the case of financial indicator attribute network neigbour will be more suited then the k-meand and self-organized maps?
There is no easy solution to solve the rescaling or normalization issue.
You should understand how both the k-nearest neighbors and k-means clustering algorithms work.
Attributes with large scaling tend to get more weight in the distance calculation because the maximum possible distance as computed on these attributes is bigger. Similarly, nominal attributes are also weighted disproportionally, because a single nominal attribute counts as at least two numerical attributes after dummy coding (i.e. converting to binary attributes).
Why don't you try different lazy learners and see which one performs best?
If K* gives far superior performance to Euclidean distance then you know you should be worried.
Best regards,
Wessel
I have one more subquestion about clustering. I see the Statistica video tutorial about Kohonen SOM clustering. I tried SOM operator inside Rapidminer but I didn't got clusters, only dimensions, so I put k-means operator after SOM. Like this:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
<parameter key="random_seed" value="-1"/>
<process expanded="true" height="665" width="710">
<operator activated="true" class="retrieve" compatibility="5.2.008" expanded="true" height="60" name="Retrieve" width="90" x="45" y="75">
<parameter key="repository_entry" value="//Samples/data/Iris"/>
</operator>
<operator activated="true" class="loop_parameters" compatibility="5.2.008" expanded="true" height="112" name="Loop Parameters" width="90" x="179" y="120">
<list key="parameters">
<parameter key="Clustering.k" value="[2.0;20;10;linear]"/>
</list>
<process expanded="true" height="400" width="582">
<operator activated="true" class="self_organizing_map" compatibility="5.2.008" expanded="true" height="94" name="SOM" width="90" x="112" y="71">
<parameter key="number_of_dimensions" value="3"/>
</operator>
<operator activated="true" class="k_means" compatibility="5.2.008" expanded="true" height="76" name="Clustering" width="90" x="45" y="255">
<parameter key="k" value="20"/>
</operator>
<operator activated="true" class="cluster_distance_performance" compatibility="5.2.008" expanded="true" height="94" name="Performance" width="90" x="246" y="255"/>
<operator activated="true" class="log" compatibility="5.2.008" expanded="true" height="76" name="Log" width="90" x="380" y="300">
<list key="log">
<parameter key="DaviedBouldin" value="operator.Performance.value.DaviesBouldin"/>
<parameter key="avg_within_distance" value="operator.Performance.value.avg_within_distance"/>
<parameter key="k" value="operator.Clustering.parameter.k"/>
</list>
</operator>
<connect from_port="input 1" to_op="SOM" to_port="example set input"/>
<connect from_op="SOM" from_port="example set output" to_op="Clustering" to_port="example set"/>
<connect from_op="SOM" from_port="original" to_port="result 1"/>
<connect from_op="SOM" from_port="preprocessing model" to_port="result 2"/>
<connect from_op="Clustering" from_port="cluster model" to_op="Performance" to_port="cluster model"/>
<connect from_op="Clustering" from_port="clustered set" to_op="Performance" to_port="example set"/>
<connect from_op="Performance" from_port="performance" to_op="Log" to_port="through 1"/>
<connect from_op="Log" from_port="through 1" to_port="performance"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_performance" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
</process>
</operator>
<connect from_op="Retrieve" from_port="output" to_op="Loop Parameters" to_port="input 1"/>
<connect from_op="Loop Parameters" from_port="result 1" to_port="result 1"/>
<connect from_op="Loop Parameters" from_port="result 2" to_port="result 2"/>
<connect from_op="Loop Parameters" from_port="result 3" to_port="result 3"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
</process>
</operator>
</process>
Now k-means work with new dimensions which are derived from my 6 attributes. Is that make sense? Statistica and Matlab SOM toolbox tutorials shows that it is easy to interpret connection between attributes and SOM clusters. I can't find the way how to manage the same with Rapidminer.
Look what I want to get:
http://www.google.hr/url?sa=t&;rct=j&q=using%20self%20organizing%20maps%20to%20cluster%20stocks%20and%20financial%20ratios&source=web&cd=2&ved=0CCwQFjAB&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.124.3253%26rep%3Drep1%26type%3Dpdf&ei=uWKCULTIK5DMsgbxioHIDg&usg=AFQjCNFY_aKuPeGVf7y2vGP2YJqja7KaSw