🎉 🎉. RAPIDMINER 9.8 IS OUT!!! 🎉 🎉
RapidMiner 9.8 continues to innovate in data science collaboration, connectivity and governance
CLICK HERE TO DOWNLOAD
Some Questions Regarding Clustering
HI Everyone!  Hope all are safe, healthy and happy this evening. I have several and "apparently" atypical questions regarding 3 "newer" clustering methods. I wish to use them on polynomial data imported from an Excel spreadsheet with approximately 300 rows, 45 columns and lots and lots of missing values.
1. The Confusion Matrix Cluster  assume one has a known value of points to be clustered that approximate #190 in total. The current techniques have been claimed to me as tending to introduce some bias. This technique claims itself the "gold standard" by combining a "confusion matrix" in combination with a "kmeans" cluster. The difference is then "somehow" (emphasize "somehow") computed to yield the important & unbiased & clustered difference. QUESTION(s):
(a) What minimum number of operators, in what order, would I choose in the design window?
(b) What operator would I want to attach to establish to show that I had accomplished my sought after goal on a statistical / performance basis?
2. The Silhouette Coefficient  the use of two operators in 4 different ways:(a) Kmeans operator, another and separate (b) Kmeans operator (identical kind or no?) (c) average the distances between the results yielded by the clustering that clustered in the points between (a) and (b), and finally (d) assume that the low values are outliers and the high values are well clustered & an "optimal" number. QUESTIONS(S):
(a & b) are these using the exact same Kmeans operators and how are they minimally arranged in the design view?
(c) is the "averaging" done with the use of some particular operator?
(d) what exact operator(s) determines the statistical output that shows the outlier (low scoring) vs wellclustered (high scoring) differences? How are these diagrammed?
3. The Mutual Interaction Information Cluster  the unspecified measurement of how much information is shared between a clustering operator and a "ground truth" classifier. The relationship is mean to detect "nonlinear" similarities that effectively reduced bias in the resulting cluster. QUESTIONS(s):
(a) what is meant by "unspecified measurement" and can it be achieved by use of a RapidMinder operator, and if so, how?
(b) what is meant by a "ground truth" classifier? I am unfamiliar with the term. What would we call it if it's in inventory?
(c) how would we use our operators to both detect and measure "nonlinear" similarities?
Please include many, many simple diagrams / screenshots for my simple mind. Thank you and have a great evening. Talk tomorrow, I hope & trust. Richard
0
Best Answer

jacobcybulski Member, University Professor Posts: 376 Unicorn@wufutura I am not exactly sure of the rest of the questions, however, if you are getting errors reading in this World Bank Excel file, make sure you deal with the junk lines at the top. So you will need to specify the valid range of cells to read, i.e. A4:AR268, and then the position of a header row, which is 4. It will read it in!1
Answers
1. given that I have now apparently found a suitable inhouse Operator for the needed task what are the preferred settings of the three as currently offered?
2. what is a simple diagram someone can provide me with that will allow me to do this kind of clustering without much fuss?
3. do we have an Operator that will allow me to draw a Lorenz Curve from the resultant, newly clustered datapoints?
4. How do I get the "Impute Missing Values" Operator to work, with the proper settings, since it always seems to malfunctions usually offering up the same complaints?
5. how do I properly load a dataset? I knows this sounds like a stupid question but it's always hitandmissmiss for me?
6. do I need any special output evaluative Operator to use last in line here to make sure that the proper clustering really happened?
Thanks everyone! Richard