🎉 🎉. RAPIDMINER 9.8 IS OUT!!! 🎉 🎉

RapidMiner 9.8 continues to innovate in data science collaboration, connectivity and governance


Some Questions Regarding Clustering

wufuturawufutura Member Posts: 38 Contributor I
HI Everyone! - Hope all are safe, healthy and happy this evening. I have several and "apparently" atypical questions regarding 3 "newer" clustering methods.  I wish to use them on polynomial data imported from an Excel spreadsheet with approximately 300 rows, 45 columns and lots and  lots of missing values. 
1. The Confusion Matrix Cluster - assume one has a known value of points to be clustered that approximate #190 in total.  The current techniques have been claimed to me as tending to introduce some bias.  This technique claims itself the "gold standard" by combining a "confusion matrix" in combination with a "k-means" cluster.  The difference is then "somehow" (emphasize "somehow") computed to yield the important & unbiased & clustered difference. QUESTION(s):
(a) What minimum number of operators, in what order, would I choose in  the design window?
(b) What operator would I want to attach to establish to show that I had accomplished my sought after goal on a statistical / performance basis?
2. The Silhouette Coefficient - the use of two operators in 4 different ways:(a) K-means operator, another and separate (b) K-means operator (identical kind or no?) (c) average the distances between the results yielded by the clustering that clustered in the points between (a) and (b), and finally (d) assume that the low values are outliers and the high values are well clustered & an  "optimal" number.   QUESTIONS(S):
(a & b) are these using the exact same K-means operators and how are they minimally arranged in the design view?
(c) is the "averaging" done with the use of some particular operator?
(d) what exact operator(s) determines the statistical output that shows the outlier (low scoring) vs well-clustered (high scoring) differences? How are these diagrammed?
3. The Mutual Interaction Information Cluster - the unspecified measurement of how much information is shared between a clustering operator and a "ground truth" classifier. The relationship is mean to detect "non-linear" similarities that effectively reduced bias in the resulting cluster. QUESTIONS(s):
(a) what is meant by "unspecified measurement" and can it be achieved by use of a RapidMinder operator, and if so, how?
(b) what is meant by a "ground truth" classifier? I am unfamiliar with the term. What would we call it if it's in inventory?
(c) how would we use our operators to both detect and measure "non-linear" similarities?

Please include many, many simple diagrams / screenshots for my simple mind.    Thank you and have a great evening.    Talk tomorrow, I hope & trust.   Richard

Best Answer

  • jacobcybulskijacobcybulski Member, University Professor Posts: 376   Unicorn
    edited September 2020 Solution Accepted
    @wufutura I am not exactly sure of the rest of the questions, however, if you are getting errors reading in this World Bank Excel file, make sure you deal with the junk lines at the top. So you will need to specify the valid range of cells to read, i.e. A4:AR268, and then the position of a header row, which is 4. It will read it in!


  • wufuturawufutura Member Posts: 38 Contributor I
    ATTENTION!: just discovered that Ingo has done a brief video on this very subject of unbiased clustering & that the operator exists in the Operators area under the title "Agglomerative Clustering."  I have three questions:

    1. given that I have now apparently found a suitable in-house Operator for the needed task what are the preferred settings of the three as currently offered?
    2. what is a simple diagram someone can provide me with that will allow me to do this kind of clustering without much fuss?
    3. do we have an Operator that will allow me to draw a Lorenz Curve from the resultant, newly clustered data-points?
    4. How do I get the "Impute Missing Values" Operator to work, with the proper settings,  since it always seems to malfunctions usually offering up the same complaints?
    5.  how do I properly load a dataset?  I knows this sounds like a stupid question but it's always hit-and-miss-miss for me?
    6. do I need any special output evaluative Operator to use last in line here to make sure that the proper clustering really happened?

    Thanks everyone! Richard
  • wufuturawufutura Member Posts: 38 Contributor I
    Agglomerative Clustering Problem
    • ok, having problems for sure but it's time for bed.  in the meantime maybe someone in a different timezone can take a quick peak at this snip file AND accompanying data file, as well ?  Question:  what is going on? Can some submit to me a simple diagram about how I would properly set up the operators and their proper settings to get what I'm saying I'm looking for?    Excel file (to be clustered) attached at bottom...

  • wufuturawufutura Member Posts: 38 Contributor I
    Please notice that i sent my entire data file  and just want to know three things:
    1. how do i properly load it without getting an error?
    2. use the Agglomerative Clustering Operator with proper settings?
    3. asking for a simple, simple snippet of a diagram so i can set this up in the design view.

Sign In or Register to comment.