"advice on which clustering/classification operators to use"

wahoo_pawahoo_pa Member Posts: 3 Contributor I
edited May 2019 in Help
I'm looking for some recommendations on which operators might be best for my task. My task is as follows: I have an example set which consists of a text field and a label. There are 4 possible values for the label field. Each text field has already been assigned a label by a human being. The catch is that  there is concern that the assignment of labels is either not being done carefully, or that items are purposely being assigned incorrect labels.

I went through all of the normal document processing, tokenizing, filtering out stop words, etc. My first thought was to use k-nn to see how well the predicted labels would match up with the pre-assigned labels, then I could perhaps create an exception set of instances where k-nn thought the text might be misclassified. However, I'm not crazy about the lack of output/diagnostics from k-nn. I would prefer to have some additional information about how certain the algorithm is about the label it has assigned.

So, I started to look at some unsupervised methods. I tried k-means but it doesn't seem to offer that much more in diagnostics or output than k-nn. I'm looking at the Expectation Maximization Clustering but it seems to hang and not complete. It sounds like some sort of fuzzy clustering is what I want, but it doesn't sound like there are any operators like that right now for RapidMiner.

So, are there any operators or extensions that offer fuzzy clustering or something similar ? What I'm looking for is either a supervised method that returns some info on the certainty of each label assignment, or an unsupervised method that provides info on the certainty of each assignment, plus info on the characteristics of each cluster.

Any help would be much appreciated, thanks in advance !


  • dan_agapedan_agape Member Posts: 106 Maven

    Fuzzy clustering is still unavailable in RM. EM clustering is more computationally intensive than k-means, and since your dataset has a very high dimensionality (and perhaps it has many rows if there are many labelled texts), then this may explain why the computation didn't finish in a reasonable amount of time. On the other hand k-means is fast and is quite used with text mining.

    In general, depending on the clustering technique you apply, you may get some information on the characteristics of the clusters directly (for instance centroids, squared errors, cardinality of clusters, etc). In addition, by changing the role of the cluster attribute into that of label (in the clustered exampleset) and by applying a supervised learning technique as decision tree, you may obtain useful information on the profiles of the clusters using the rules of the tree. Then if you further apply a (supervised learning) model evaluation, the resulting performance indicators (as accuracy) can give you valuable information about the quality of the initial clustering.

    The above is a general technique, as said. In your case you may try to experiment by first choosing binary term occurrence when you build the word vectors, then cluster with k-means (k=4). Inspect the cardinality of the clusters (in general k-means is more adequate when the expected clusters are balanced in the number of examples). Then you may try to build a decision tree as explained, so that you may get some characteristic words (shown in the non terminal nodes) relevant to some clusters (shown in the leaves). Get the accuracy of the model which is an indication of the tree quality and the initial clustering quality. If (depending on your data) you do not get anything encouraging with the decision tree, convert your dataset (numeric to binominal, and nominal to binominal) and apply FP-Growth and then Create Association Rules operators to get association rules. You would be interested in and select those rules having the cluster attribute in the consequences (the premises will show you characteristic words for those clusters).

    For determining the clustering quality with an alternative supervised learning algorithm you may wish to learn a model using k-nn instead of the decision tree (you get a less explanatory model, but it is satisfactory and is working well in general, and its training requires less sophisticated tuning in any circumstance), and evaluate its model accuracy. Also, you may wish to try to cluster again your dataset after repeating the text to word vector conversion (by considering term frequency or tf-idf for the word vector construction; I would avoid using the term occurrences option for this application), and then re-evaluate the clustering quality via supervised learning evaluation, as explained. It seems like trying several operations and making a number of attempts, but finally this is what data mining work is about. 

    Once you get your best clustering you are happy with, you can "match" the clusters against the human being assigned labels / classes. The theory recommends, among others, the entropy or the Gini impurity measures (larger values - less match), or the Gamma (correlation) statistic, Rand statistic, Jaccard coefficient, all of which are to be computed using the ideal cluster similarity matrix and the ideal class similarity matrix here (larger values - more match). However I would choose the Chi square test here, by using RapidMiner's Weight by Chi Squared Statistic operator, to test whether or not there is dependence between the labels and the clusters. If the weight you get with this operator is more than 16.92 (figure representing the 95% percentile in the Chi Square probabilistic distribution with 9 degrees of freedom, where the degree of freedom is obtained by multiplying the number of clusters -1 with the number of classes -1, that is 3 x3) then there exists a statistically proven dependency between clusters and labels. Otherwise the data is rather consistent with the hypothesis according to which clusters are independent from the human being assigned labels (which may suggest most labels may have been assigned quite randomly).  [[ By the way, through a light extension of the Weight by Chi Squared Statistic operator that I suggested at http://rapid-i.com/rapidforum/index.php/topic,3954.0.html , this kind of testing the independence between attributes (with all its benefices) would be immediate in RapidMiner]].

    Finally I would say that your initial idea of using k-nn to check how carefully the labels were assigned, would be a useful direction too for this analysis, that may work very well under the following natural assumption:

    1. very similar documents would very likely receive the same expected label if labels were assigned ideally correctly by the human being;
    2. a significant proportion of labels (say at least half) were assigned correctly in reality.

    With this assumption, a k-nn model (let us say k=9) would likely predict the correct label for an example, from the example set composed of word vectors and human being assigned labels. It is not possible to estimate with which probability the predicted label would be correct - this would largely depend of the mentioned proportion from (2) above, which is not known.

    The level of accuracy (and even better the kappa statistic) in a process like the one below can indicate the extent to which the labels were not assigned randomly. In particular, if the kappa statistic is very close to 0, it is likely that labels were assigned quite randomly by the human being.

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.006">
      <operator activated="true" class="process" compatibility="5.1.006" expanded="true" name="Process">
        <process expanded="true" height="335" width="480">
          <operator activated="true" class="text:process_document_from_file" compatibility="5.1.001" expanded="true" height="76" name="Process Documents from Files" width="90" x="45" y="30">
            <list key="text_directories">
              <parameter key="neg" value="C:\Users\D\Desktop\DMDB_work\datasets\review_polarity\txt_sentoken\neg"/>
              <parameter key="pos" value="C:\Users\D\Desktop\DMDB_work\datasets\review_polarity\txt_sentoken\pos"/>
            <parameter key="vector_creation" value="Term Frequency"/>
            <parameter key="prune_method" value="percentual"/>
            <parameter key="prunde_below_percent" value="2.0"/>
            <parameter key="prune_above_percent" value="98.0"/>
            <parameter key="prune_below_absolute" value="20"/>
            <parameter key="prune_above_absolute" value="1970"/>
            <process expanded="true" height="305" width="480">
              <operator activated="true" class="text:tokenize" compatibility="5.1.001" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
              <operator activated="true" class="text:transform_cases" compatibility="5.1.001" expanded="true" height="60" name="Transform Cases" width="90" x="179" y="30"/>
              <operator activated="true" class="text:filter_by_length" compatibility="5.1.001" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="45" y="120">
                <parameter key="min_chars" value="2"/>
                <parameter key="max_chars" value="35"/>
              <operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.001" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="179" y="120"/>
              <operator activated="true" class="text:stem_porter" compatibility="5.1.001" expanded="true" height="60" name="Stem (Porter)" width="90" x="313" y="75"/>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
              <connect from_op="Transform Cases" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
              <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
              <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Stem (Porter)" to_port="document"/>
              <connect from_op="Stem (Porter)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
          <operator activated="true" class="x_validation" compatibility="5.1.006" expanded="true" height="112" name="Validation" width="90" x="179" y="30">
            <parameter key="sampling_type" value="shuffled sampling"/>
            <process expanded="true" height="335" width="218">
              <operator activated="true" class="k_nn" compatibility="5.1.006" expanded="true" height="76" name="k-NN" width="90" x="63" y="62">
                <parameter key="k" value="9"/>
              <connect from_port="training" to_op="k-NN" to_port="training set"/>
              <connect from_op="k-NN" from_port="model" to_port="model"/>
              <portSpacing port="source_training" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            <process expanded="true" height="341" width="218">
              <operator activated="true" class="apply_model" compatibility="5.1.006" expanded="true" height="76" name="Apply Model" width="90" x="45" y="30">
                <list key="application_parameters"/>
              <operator activated="true" class="performance_classification" compatibility="5.1.006" expanded="true" height="76" name="Performance (2)" width="90" x="45" y="165">
                <parameter key="kappa" value="true"/>
                <parameter key="correlation" value="true"/>
                <list key="class_weights"/>
              <connect from_port="model" to_op="Apply Model" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
              <connect from_op="Apply Model" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
              <connect from_op="Performance (2)" from_port="performance" to_port="averagable 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_averagable 1" spacing="0"/>
              <portSpacing port="sink_averagable 2" spacing="0"/>
          <connect from_op="Process Documents from Files" from_port="example set" to_op="Validation" to_port="training"/>
          <connect from_op="Validation" from_port="training" to_port="result 1"/>
          <connect from_op="Validation" from_port="averagable 1" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
    In case of the levels of accuracy and kappa statistic being good, corroborated with the verification of a domain expert of the part (2) of the assumption above on a data sample manually, it would be quite enough to use the model learned in the process above to score the whole dataset with labels that in general would be more suitable.

  • wahoo_pawahoo_pa Member Posts: 3 Contributor I

    Thanks, this is very helpful !
  • dan_agapedan_agape Member Posts: 106 Maven
    Hi wahoo_pa,

    You are welcome. Did you do also the computation of the kappa statistic? It would be interesting to see what value you got for this (and for the accuracy of the same model).

Sign In or Register to comment.