The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.

"Weight, Clustering and Decision Tree"

mariozupanmariozupan Member Posts: 15 Contributor II
edited June 2019 in Help
I have 300 companies that I want to divide in clusters, according to a financial performance indicators. Then I want to describe every cluster with Decision Tree.
So, I have a few questions:
1. My attributes (financial indicators) are not normally distributed. I tried some statistical tests. Is it matter?
2. My attributes have different ranges. Do I need normalization operator?
3. Do I need some selecting by weight operator for choosing indicators which are significant or k-means make clusters according to a attributes weight?

As you can see from above questions that I tried something but I didn't get clusters that I can describe as "good" "better" "the best". I need an answer as soon as it is possible. Small example, or even data miner who is willing to create cluster on my data for a decent fee.

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi, even with decent and modern tools, data analysis always more than following recipes which says "100 g of butter and one pound of flour". It is always a process of design, evaluation and design adaption. However, for many things there exists a way of best-practice:
    mariozupan wrote:

    I have 300 companies that I want to divide in clusters, according to a financial performance indicators. Then I want to describe every cluster with Decision Tree.
    So, I have a few questions:
    1. My attributes (financial indicators) are not normally distributed. I tried some statistical tests. Is it matter?
    No.
    2. My attributes have different ranges. Do I need normalization operator?
    Yes, you need it always for k-Means clustering.
    3. Do I need some selecting by weight operator for choosing indicators which are significant or k-means make clusters according to a attributes weight?
    Here applies what I said above: there is no single answer, this heavily depends on the data and the use case. However, I would probably try to reduce the feature set, e.g. as you mentioned with Weight by XXX and Select by Weights. Just try it out and see if your results improve.

    If you have further questions, feel free to answer to this message.

    Good luck!
    ~Marius
  • mariozupanmariozupan Member Posts: 15 Contributor II
    Thank you on your generous help. You really help me alot, but I have a question which bothering me even after lot of reading and tries.How to get a meaningful clusters? Show me a way how to look for a clusters that are understandable for human logic. I tried by selecting attributes by weight, I even add SOM operator before K-means, I even got over 95% accuracy of the model, but what I cannot is the cluster description. The model I mentioned: Normalization, Selection by Weight, SOM, K-Means classify enterprise correctly, but every cluster could be described something like this: average profitability, average liquidity, low activity, all kind of investibility. Naturally that I want some "stars"  or "liquid but unprofitable" or "ready for bakrupt" clusters. How to adjust clusters so that I can filter "good" better and "best" in some part of financial performances? What operators do I have to use? Does Association rules will help me better in opposite to the Decision Tree, for describing cluster? How to get clear clusters in Decision tree? I got it but then just a few attributes are showed.
    It is seamed that I ask a lot, but I just need to know is it possible to shape the clusters on that way "stars" loosers" etc.If R extensions are necessary I'm ready for them. Fuzzy genetic algorithm for example. As I see you are very unselfish, so I will not be a different if you ask a fee for your knowledge.
  • wesselwessel Member Posts: 537 Maven
    Marius wrote:

    Yes, you always need to rescale your attributes when using k-Means clustering.
    Alternatively you can use the Weka Kstar nearest neighbor algorithm, which uses a distance measure based on entropy.
    You can run this algorithm using the Weka plugin.

    You can also automatically rescale your attributes to best fit some hold out set.
    As far as I recall, there is an easy way to do this?

    Best regards,

    Wessel
  • mariozupanmariozupan Member Posts: 15 Contributor II
    You mean normalize by size (i.e from -1 to +1) when you wrote "rescale"? I already normalize all attributes from -1 to +1. All tutorials and scientific papers about clustering and intepretation of clusters didn't mentioned procedures for clusters adjustments and logical explanation of clusters. If 20 companies are in the same cluster, it means that they are pretty close to the center of the cluster, according to a five attributes for example. Every row is the five-dimensional vector, right?
    But how to correlate vector which is result of attributes normalization with the attributes?

    Above questions applies to k-means, I will try network neighbour operator. I need to study network neighbour functioning, but you mean that in the case of financial indicator attribute network neigbour will be more suited then the k-meand and self-organized maps?
  • wesselwessel Member Posts: 537 Maven
    I don't fully understand your comments.

    There is no easy solution to solve the rescaling or normalization issue.
    You should understand how both the k-nearest neighbors and k-means clustering algorithms work.

    Attributes with large scaling tend to get more weight in the distance calculation because the maximum possible distance as computed on these attributes is bigger. Similarly, nominal attributes are also weighted disproportionally, because a single nominal attribute counts as at least two numerical attributes after dummy coding (i.e. converting to binary attributes).

    Why don't you try different lazy learners and see which one performs best?
    If K* gives far superior performance to Euclidean distance then you know you should be worried.

    Best regards,

    Wessel
  • mariozupanmariozupan Member Posts: 15 Contributor II
    Ok. I think I understand. I need to study  different clustering methods and measure performance. Rescaling and normalizing attributes at the same time to get meaningful clusters that satisfied my needs. Do you use some specific operator for rescaling?

    I have one more subquestion about clustering. I see the Statistica video tutorial about Kohonen SOM clustering. I tried SOM operator inside Rapidminer but I didn't got clusters, only dimensions, so I put k-means operator after SOM. Like this:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
        <parameter key="random_seed" value="-1"/>
        <process expanded="true" height="665" width="710">
          <operator activated="true" class="retrieve" compatibility="5.2.008" expanded="true" height="60" name="Retrieve" width="90" x="45" y="75">
            <parameter key="repository_entry" value="//Samples/data/Iris"/>
          </operator>
          <operator activated="true" class="loop_parameters" compatibility="5.2.008" expanded="true" height="112" name="Loop Parameters" width="90" x="179" y="120">
            <list key="parameters">
              <parameter key="Clustering.k" value="[2.0;20;10;linear]"/>
            </list>
            <process expanded="true" height="400" width="582">
              <operator activated="true" class="self_organizing_map" compatibility="5.2.008" expanded="true" height="94" name="SOM" width="90" x="112" y="71">
                <parameter key="number_of_dimensions" value="3"/>
              </operator>
              <operator activated="true" class="k_means" compatibility="5.2.008" expanded="true" height="76" name="Clustering" width="90" x="45" y="255">
                <parameter key="k" value="20"/>
              </operator>
              <operator activated="true" class="cluster_distance_performance" compatibility="5.2.008" expanded="true" height="94" name="Performance" width="90" x="246" y="255"/>
              <operator activated="true" class="log" compatibility="5.2.008" expanded="true" height="76" name="Log" width="90" x="380" y="300">
                <list key="log">
                  <parameter key="DaviedBouldin" value="operator.Performance.value.DaviesBouldin"/>
                  <parameter key="avg_within_distance" value="operator.Performance.value.avg_within_distance"/>
                  <parameter key="k" value="operator.Clustering.parameter.k"/>
                </list>
              </operator>
              <connect from_port="input 1" to_op="SOM" to_port="example set input"/>
              <connect from_op="SOM" from_port="example set output" to_op="Clustering" to_port="example set"/>
              <connect from_op="SOM" from_port="original" to_port="result 1"/>
              <connect from_op="SOM" from_port="preprocessing model" to_port="result 2"/>
              <connect from_op="Clustering" from_port="cluster model" to_op="Performance" to_port="cluster model"/>
              <connect from_op="Clustering" from_port="clustered set" to_op="Performance" to_port="example set"/>
              <connect from_op="Performance" from_port="performance" to_op="Log" to_port="through 1"/>
              <connect from_op="Log" from_port="through 1" to_port="performance"/>
              <portSpacing port="source_input 1" spacing="0"/>
              <portSpacing port="source_input 2" spacing="0"/>
              <portSpacing port="sink_performance" spacing="0"/>
              <portSpacing port="sink_result 1" spacing="0"/>
              <portSpacing port="sink_result 2" spacing="0"/>
              <portSpacing port="sink_result 3" spacing="0"/>
              <portSpacing port="sink_result 4" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Retrieve" from_port="output" to_op="Loop Parameters" to_port="input 1"/>
          <connect from_op="Loop Parameters" from_port="result 1" to_port="result 1"/>
          <connect from_op="Loop Parameters" from_port="result 2" to_port="result 2"/>
          <connect from_op="Loop Parameters" from_port="result 3" to_port="result 3"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
          <portSpacing port="sink_result 4" spacing="0"/>
        </process>
      </operator>
    </process>

    Now k-means work with new dimensions which are derived from my 6 attributes. Is that make sense? Statistica and Matlab SOM toolbox tutorials shows that it is easy to interpret connection between attributes and SOM clusters. I can't find the way how to manage the same with Rapidminer.
    Look what I want to get:
    http://www.google.hr/url?sa=t&;rct=j&q=using%20self%20organizing%20maps%20to%20cluster%20stocks%20and%20financial%20ratios&source=web&cd=2&ved=0CCwQFjAB&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.124.3253%26rep%3Drep1%26type%3Dpdf&ei=uWKCULTIK5DMsgbxioHIDg&usg=AFQjCNFY_aKuPeGVf7y2vGP2YJqja7KaSw
Sign In or Register to comment.