RapidMiner 9.7 is Now Available
Lots of amazing new improvements including true version control! Learn more about what's new here.
"clustering atomic data files? [UPDATE]"
so I'm a total noob on data mining - never used such programs and and thought I post a quick question about the clustering-options of the tool.
I have a dataset between 1000 to 10^6 lines, each line has 6 values, namley x y z vx vy vz (for the interested one: coordinates and velocieites). The data in 2d looks like this:
So the dataset contains of several clusters beginning from 1 single point up to maybe 100-250 (or so ...). For my studies I created an artificial one with 1000 points, two big groups of points and 10-20 single dots in (for testing the program obviously).
The goal as an output would be at least two histograms like:
"number of clusters of size N (the grey lines in the picture)" vs "size N (black dots inside)" or
"number of clusters of size N" vs "center of mass velocity"
I tried a few operators: Learner -> Unsupervised -> Clustering -> EM_Clustering or W-XMEans worked best but I have to give initial values like number of clusters, which would in the obove example 8 (every single dot counts as a single "cluster" ). If the program groups the big ones each, it groups the single ones also as one sort :-(. So n=6 would give the best result (big ones, and one sort of single clusters). So thats not what I want.
For the histograms: I would need a new colum (which I get from the clustering (cluster 0, cluster 1, ...)) and then a sort of if-condition to sum up the important columns (velocities, ...) and then again do some analysis with the new columns and so on.
Is something like this possible? I read or tried to read the manual but its 600 pages and searching for clustering just leads me to the clusterwrite/read operator.
maybe I'm applying the operators wrong? I mean I cannont give a certrain number of clusters, becaus in a huge dataset, 10^5 lines and maybe 1000 groups all different sizes.Also I didn't get the thing with the attribues of the dataset.
But the distance in coordinates, say less then a certain value, would qualify some atoms to a cluster. Can I give a cutoff-radius or some other criterion to the cluster-algorithm?
grateful for every hint,
Edit: the way I did this was: new Operator -> IO -> Examples -> ExampleSource, then New Operator -> Unsupervised -> Clustering -> ...
Edit2: ideal cluster criterion would be: are there points inside a specific radius R_0 of particel n? if so,it counts to the cluster c1, then: is there another particle in R_0 if yes go on, if not move on to next particle. If one particle has no neighbours inside R_0, it counts as an independent cluster c2, etc