Some Questions Regarding Clustering
HI Everyone! - Hope all are safe, healthy and happy this evening. I have several and "apparently" atypical questions regarding 3 "newer" clustering methods. I wish to use them on polynomial data imported from an Excel spreadsheet with approximately 300 rows, 45 columns and lots and lots of missing values.
1. The Confusion Matrix Cluster - assume one has a known value of points to be clustered that approximate #190 in total. The current techniques have been claimed to me as tending to introduce some bias. This technique claims itself the "gold standard" by combining a "confusion matrix" in combination with a "k-means" cluster. The difference is then "somehow" (emphasize "somehow") computed to yield the important & unbiased & clustered difference. QUESTION(s):
(a) What minimum number of operators, in what order, would I choose in the design window?
(b) What operator would I want to attach to establish to show that I had accomplished my sought after goal on a statistical / performance basis?
2. The Silhouette Coefficient - the use of two operators in 4 different ways:(a) K-means operator, another and separate (b) K-means operator (identical kind or no?) (c) average the distances between the results yielded by the clustering that clustered in the points between (a) and (b), and finally (d) assume that the low values are outliers and the high values are well clustered & an "optimal" number. QUESTIONS(S):
(a & b) are these using the exact same K-means operators and how are they minimally arranged in the design view?
(c) is the "averaging" done with the use of some particular operator?
(d) what exact operator(s) determines the statistical output that shows the outlier (low scoring) vs well-clustered (high scoring) differences? How are these diagrammed?
3. The Mutual Interaction Information Cluster - the unspecified measurement of how much information is shared between a clustering operator and a "ground truth" classifier. The relationship is mean to detect "non-linear" similarities that effectively reduced bias in the resulting cluster. QUESTIONS(s):
(a) what is meant by "unspecified measurement" and can it be achieved by use of a RapidMinder operator, and if so, how?
(b) what is meant by a "ground truth" classifier? I am unfamiliar with the term. What would we call it if it's in inventory?
(c) how would we use our operators to both detect and measure "non-linear" similarities?
Please include many, many simple diagrams / screenshots for my simple mind. Thank you and have a great evening. Talk tomorrow, I hope & trust. Richard