Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

Clustering with labels?

Fred12Fred12 Member Posts: 344 Unicorn
edited November 2018 in Help

Hi,

is there any way to do clustering with labels to control performance (in classification)? what operator can I use to do that (e.g with k-means?)

 

and is there some way to cluster the data with the "help" from labels if the class is known, so I mean clustering based on given labels (e.g find out which class label is clustered together, and then get the centroid of that local cluster and so on... ?) 

Is there some operator existent that uses labels for clustering? I just want to find out some more properties about my dataset and my classes (e.g local cluster labels centroid tables... etc.)

Tagged:

Best Answer

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,529 RM Data Scientist
    Solution Accepted

    did you try Map Clustering on Labels and then the performance operators?

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany

Answers

  • dangdang Member Posts: 11 Contributor II

    If you have labeled data, most of the time clustering is bring owls to Athens....

     

    Of course you can use 'set role' to make lable column to normal regular attributes and pretend to not have any label information. Use the data without special attribute 'label' you can do any clustering you  want.

     

    Hope that makes senses...

  • Fred12Fred12 Member Posts: 344 Unicorn

    I know the purpose of clustering, but I want to compare the found clusters with labeled "clusters" if you know what I mean, to find the "goodness" of clusters by comparing them with some ground truth...

    any sophisticated way to do so? any ideas?

  • Fred12Fred12 Member Posts: 344 Unicorn

    yeah thanks, that seemed to work, but I still don't know how that operator works,

    how is it choosing which cluster is what label?

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,529 RM Data Scientist

    Mh, good question. The important code is in ClusterToPrediction.java - but it's quite a chunk.

     

    	@Override
    public void doWork() throws OperatorException {
    ExampleSet exampleSet = exampleSetInput.getData(ExampleSet.class);
    ClusterModel model = clusterModelInput.getData(ClusterModel.class);

    // generate the predicted attribute
    Attribute labelAttribute = exampleSet.getAttributes().getLabel();
    PredictionModel.createPredictedLabel(exampleSet, labelAttribute);
    Attribute predictedLabel = exampleSet.getAttributes().getPredictedLabel();

    HashMap<Integer, String> intToClusterMapping = new HashMap<Integer, String>();
    int[][] mappingTable = new int[model.getNumberOfClusters()][model.getNumberOfClusters()];

    // count the occurrence of each label with every cluster
    int a = 0;
    for (int i = 0; i < model.getNumberOfClusters(); i++) {
    HashMap<String, Integer> labelOccurrence = new HashMap<String, Integer>();
    for (Example example : exampleSet) {
    String label = example.getValueAsString(labelAttribute);
    if (!labelOccurrence.containsKey(label)) {
    labelOccurrence.put(label, 0);
    if (i == 0) {
    intToClusterMapping.put(a, label);
    a++;
    }
    }
    if (example.getValue(example.getAttributes().getCluster()) == i) {
    labelOccurrence.put(label, labelOccurrence.get(label) + 1);
    }
    }

    if (i == 0 && model.getNumberOfClusters() != labelOccurrence.size()) {
    throw new UserError(this, 943, labelOccurrence.size(), model.getNumberOfClusters());
    }

    for (int j = 0; j < mappingTable[i].length; j++) {
    String clusterName = intToClusterMapping.get(j);
    int occ = labelOccurrence.get(clusterName);
    mappingTable[i][j] = occ;
    }
    }
    /*
    * Munkres-algorithm or the hungarian method
    */
    // find the maximum
    int maxValue = -1;
    for (int i = 0; i < mappingTable.length; i++) {
    for (int j = 0; j < mappingTable[i].length; j++) {
    if (mappingTable[i][j] > maxValue) {
    maxValue = mappingTable[i][j];
    }
    }
    }

    // compute the new (inverted) table (and column-minima)
    for (int i = 0; i < mappingTable.length; i++) {
    int minimum = Integer.MAX_VALUE;
    for (int j = 0; j < mappingTable[i].length; j++) {
    mappingTable[i][j] = maxValue - mappingTable[i][j];
    if (mappingTable[i][j] < minimum) {
    minimum = mappingTable[i][j];
    }
    }
    // subtract the column-minima
    if (minimum > 0) {
    for (int j = 0; j < mappingTable[i].length; j++) {
    mappingTable[i][j] = mappingTable[i][j] - minimum;
    }
    }
    }
    // compute and subtract the row-minima
    for (int i = 0; i < mappingTable[0].length; i++) {
    int minimum = Integer.MAX_VALUE;
    for (int j = 0; j < mappingTable.length; j++) {
    if (mappingTable[j][i] < minimum) {
    minimum = mappingTable[j][i];
    }
    }
    // subtract the row-minima
    if (minimum > 0) {
    for (int j = 0; j < mappingTable.length; j++) {
    mappingTable[j][i] = mappingTable[j][i] - minimum;
    }
    }
    }
    while (!assignmentAvailable(mappingTable)) {
    Vector<Integer> markedRows = new Vector<Integer>();
    Vector<Integer> markedColumns = new Vector<Integer>();

    // mark all rows which have no marked zero (start labeling)
    for (int i = 0; i < mappingTable[0].length; i++) {
    boolean markedZero = false;
    for (int j = 0; j < mappingTable.length; j++) {
    if (mappingTable[j][i] == Integer.MIN_VALUE) {
    markedZero = true;
    break;
    }
    }
    if (!markedZero) {
    markedRows.add(i);
    }
    }

    boolean newMarked = true;
    while (newMarked) {
    newMarked = false;
    // mark all columns with a slashed zero in a marked row
    for (int i = 0; i < mappingTable.length; i++) {
    for (int j = 0; j < mappingTable[i].length; j++) {
    if (mappingTable[i][j] == Integer.MAX_VALUE) {
    if (markedRows.contains(j) && !markedColumns.contains(i)) {
    newMarked = true;
    markedColumns.add(i);
    }
    }
    }
    }
    // mark all rows with a marked zero in a marked column
    for (int i = 0; i < mappingTable[0].length; i++) {
    for (int j = 0; j < mappingTable.length; j++) {
    if (mappingTable[j][i] == Integer.MIN_VALUE) {
    if (markedColumns.contains(j) && !markedRows.contains(i)) {
    newMarked = true;
    markedRows.add(i);
    }
    }
    }
    }
    } // end while (newMarked)

    // inverting of the marked columns
    for (int i = 0; i < mappingTable.length; i++) {
    if (!markedColumns.contains(i)) {
    markedColumns.add(i);
    } else {
    markedColumns.removeElement(i);
    }
    }

    // find the minimum in the marked range
    int minimum = Integer.MAX_VALUE;
    for (int i = 0; i < markedRows.size(); i++) {
    for (int j = 0; j < markedColumns.size(); j++) {
    if (mappingTable[markedColumns.get(j)][markedRows.get(i)] < minimum) {
    minimum = mappingTable[markedColumns.get(j)][markedRows.get(i)];
    }
    }
    }
    // substract the minimum from all elements in the marked range
    for (int i = 0; i < markedRows.size(); i++) {
    for (int j = 0; j < markedColumns.size(); j++) {
    mappingTable[markedColumns.get(j)][markedRows.get(i)] = mappingTable[markedColumns.get(j)][markedRows
    .get(i)] - minimum;
    }
    }

    // add the minimum to all elements which are neither marked in a row nor in a column
    for (int i = 0; i < mappingTable.length; i++) {
    if (!markedColumns.contains(i)) {
    for (int j = 0; j < mappingTable[i].length; j++) {
    if (!markedRows.contains(j)) {
    mappingTable[i][j] = mappingTable[i][j] + minimum;
    }
    }
    }
    }
    // reset the Integer.MIN_VALUE and Integer.MAX_VALUE to zero
    for (int i = 0; i < mappingTable.length; i++) {
    for (int j = 0; j < mappingTable[i].length; j++) {
    if (mappingTable[i][j] == Integer.MAX_VALUE) {
    mappingTable[i][j] = 0;
    }
    if (mappingTable[i][j] == Integer.MIN_VALUE) {
    mappingTable[i][j] = 0;
    }
    }
    }
    } // end while(!assignmentAvailable)

    // compute the mapping (there must be a possible assignment)
    HashMap<Integer, String> clusterToPrediction = new HashMap<Integer, String>();
    for (int i = 0; i < mappingTable.length; i++) {
    int result = -1;
    for (int j = 0; j < mappingTable[i].length; j++) {
    if (mappingTable[i][j] == Integer.MIN_VALUE) {
    result = j;
    break;
    }
    }
    String resultCluster = intToClusterMapping.get(result);
    clusterToPrediction.put(i, resultCluster);
    }

    // insert the result in the predicted attribute
    HashMap<String, Integer> predictionToCluster = new HashMap<String, Integer>();
    // set the preditedLabel in the example table and compute to each prediction the cluster
    int i = 0;
    Attribute clusterAttribute = exampleSet.getAttributes().getCluster();
    for (Example example : exampleSet) {
    String resultLabel = clusterToPrediction.get((int) example.getValue(example.getAttributes().getCluster()));
    example.setValue(predictedLabel, resultLabel);
    if (predictionToCluster.size() < model.getNumberOfClusters()) {
    if (!predictionToCluster.containsKey(example.getValueAsString(example.getAttributes().getPredictedLabel()))) {
    String clusterNumber = example.getValueAsString(clusterAttribute).replaceAll("[^\\d]+", "");
    try {
    int number = Integer.parseInt(clusterNumber);
    predictionToCluster.put(example.getValueAsString(example.getAttributes().getPredictedLabel()),
    number);
    } catch (NumberFormatException e) {
    throw new UserError(this, 145, clusterAttribute.getName());
    }
    }
    }
    i++;
    }

    // set the confidence in the example table
    i = 0;
    for (Example example : exampleSet) {
    if (model.getClass() == FlatFuzzyClusterModel.class) {
    FlatFuzzyClusterModel fuzzyModel = (FlatFuzzyClusterModel) model;
    for (int j = 0; j < clusterToPrediction.size(); j++) {
    String label = clusterToPrediction.get(j);
    example.setConfidence(label,
    fuzzyModel.getExampleInClusterProbability(i, predictionToCluster.get(label)));
    }
    } else {
    example.setConfidence(clusterToPrediction.get((int) example.getValue(example.getAttributes().getCluster())),
    1);
    }
    i++;
    }

    exampleSetOutput.deliver(exampleSet);
    clusterModelOutput.deliver(model);
    }

    /* Returns true, if there is a solution availble. */
    private boolean assignmentAvailable(int[][] mappingTable) {
    int markedZeros = 0;
    boolean modificationDone = true;

    while (modificationDone) {
    while (modificationDone) {
    modificationDone = false;
    // column by column
    for (int i = 0; i < mappingTable.length; i++) {
    int position = -1;
    for (int j = 0; j < mappingTable[i].length; j++) {
    if (mappingTable[i][j] == 0) {
    if (position == -1) {
    position = j;
    } else {
    position = -1;
    break;
    }
    }
    }
    if (position != -1) {
    modificationDone = true;
    mappingTable[i][position] = Integer.MIN_VALUE; // marked zero
    for (int k = 0; k < mappingTable.length; k++) {
    if (mappingTable[k][position] == 0) {
    mappingTable[k][position] = Integer.MAX_VALUE; // slashed zeros
    }
    }
    markedZeros++;
    }
    }
    if (markedZeros == mappingTable.length) {
    return true;
    }

    // line by line
    for (int i = 0; i < mappingTable[0].length; i++) {
    int position = -1;
    for (int j = 0; j < mappingTable.length; j++) {
    if (mappingTable[j][i] == 0) {
    if (position == -1) {
    position = j;
    } else {
    position = -1;
    break;
    }
    }
    }
    if (position != -1) {
    modificationDone = true;
    mappingTable[position][i] = Integer.MIN_VALUE;// marked zero
    for (int k = 0; k < mappingTable[0].length; k++) {
    if (mappingTable[position][k] == 0) {
    mappingTable[position][k] = Integer.MAX_VALUE; // slashed zeros
    }
    }
    markedZeros++;
    }
    }
    if (markedZeros == mappingTable.length) {
    return true;
    }
    }
    // modificationDone is here always false
    // ambiguous zeros
    int aktMarkedZeros = markedZeros;
    for (int i = 0; i < mappingTable.length; i++) {
    for (int j = 0; j < mappingTable[i].length; j++) {
    if (mappingTable[i][j] == 0) {
    mappingTable[i][j] = Integer.MIN_VALUE;// marked zero
    for (int k = j + 1; k < mappingTable[i].length; k++) {
    if (mappingTable[i][k] == 0) {
    mappingTable[i][k] = Integer.MAX_VALUE; // slashed zeros in the same
    // column
    }
    }
    for (int k = 0; k < mappingTable.length; k++) {
    if (mappingTable[k][j] == 0) {
    mappingTable[k][j] = Integer.MAX_VALUE; // slashed zeros
    }
    }
    modificationDone = true;
    markedZeros++;
    break;
    }
    }
    if (aktMarkedZeros != markedZeros) {
    break;
    }
    }
    if (markedZeros == mappingTable.length) {
    return true;
    }
    }

    return false;
    }
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • student_computestudent_compute Member Posts: 73 Contributor II

    Hi, how should I use this code in the program? Where should I copy and use?
    Thankful
    Sorry i'm asking

  • student_computestudent_compute Member Posts: 73 Contributor II

    hi

    sorry

    please help me

    thanks

  • Muhammed_Fatih_Muhammed_Fatih_ Member Posts: 93 Maven
    Hi @mschmitz

    one further question in this connection. Which classification model does the "Map Clustering on Labels" operator consider with regard to the subsequent calculation of performance values? 

    Thank you in advance for your response! 

    Best regards!
  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    The Map Clustering on Labels "model" simply chooses a cluster for each class and maps to that, by minimizing the total number of errors produced by the mapping.  Assignments by cluster are exclusive. It then calculates the performance metrics by looking at "predictions" (based on the mapped clusters) and the "actual" (the label).  You need to have the same number of clusters as you have label classes for this operator to work.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.