turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Community Home
- :
- Product Help
- :
- RapidMiner Studio Forum
- :
- clustering atomic data files? [UPDATE]

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

06-03-2009 09:07 AM

06-03-2009 09:07 AM

hi,

so I'm a total noob on data mining - never used such programs and and thought I post a quick question about the clustering-options of the tool.

I have a dataset between 1000 to 10^6 lines, each line has 6 values, namley x y z vx vy vz (for the interested one: coordinates and velocieites). The data in 2d looks like this:

So the dataset contains of several clusters beginning from 1 single point up to maybe 100-250 (or so ...). For my studies I created an artificial one with 1000 points, two big groups of points and 10-20 single dots in (for testing the program obviously).

The goal as an output would be at least two histograms like:

"number of clusters of size N (the grey lines in the picture)" vs "size N (black dots inside)" or

"number of clusters of size N" vs "center of mass velocity"

I tried a few operators: Learner -> Unsupervised -> Clustering -> EM_Clustering or W-XMEans worked best but I have to give initial values like number of clusters, which would in the obove example 8 (every single dot counts as a single "cluster" ). If the program groups the big ones each, it groups the single ones also as one sort :-(. So n=6 would give the best result (big ones, and one sort of single clusters). So thats not what I want.

For the histograms: I would need a new colum (which I get from the clustering (cluster 0, cluster 1, ...)) and then a sort of if-condition to sum up the important columns (velocities, ...) and then again do some analysis with the new columns and so on.

Is something like this possible? I read or tried to read the manual but its 600 pages and searching for clustering just leads me to the clusterwrite/read operator.

maybe I'm applying the operators wrong? I mean I cannont give a certrain number of clusters, becaus in a huge dataset, 10^5 lines and maybe 1000 groups all different sizes.Also I didn't get the thing with the attribues of the dataset.

But the distance in coordinates, say less then a certain value, would qualify some atoms to a cluster. Can I give a cutoff-radius or some other criterion to the cluster-algorithm?

grateful for every hint,

Stever

Edit: the way I did this was: new Operator -> IO -> Examples -> ExampleSource, then New Operator -> Unsupervised -> Clustering -> ...

Edit2: ideal cluster criterion would be: are there points inside a specific radius R_0 of particel n? if so,it counts to the cluster c1, then: is there another particle in R_0 if yes go on, if not move on to next particle. If one particle has no neighbours inside R_0, it counts as an independent cluster c2, etc

so I'm a total noob on data mining - never used such programs and and thought I post a quick question about the clustering-options of the tool.

I have a dataset between 1000 to 10^6 lines, each line has 6 values, namley x y z vx vy vz (for the interested one: coordinates and velocieites). The data in 2d looks like this:

So the dataset contains of several clusters beginning from 1 single point up to maybe 100-250 (or so ...). For my studies I created an artificial one with 1000 points, two big groups of points and 10-20 single dots in (for testing the program obviously).

The goal as an output would be at least two histograms like:

"number of clusters of size N (the grey lines in the picture)" vs "size N (black dots inside)" or

"number of clusters of size N" vs "center of mass velocity"

I tried a few operators: Learner -> Unsupervised -> Clustering -> EM_Clustering or W-XMEans worked best but I have to give initial values like number of clusters, which would in the obove example 8 (every single dot counts as a single "cluster" ). If the program groups the big ones each, it groups the single ones also as one sort :-(. So n=6 would give the best result (big ones, and one sort of single clusters). So thats not what I want.

For the histograms: I would need a new colum (which I get from the clustering (cluster 0, cluster 1, ...)) and then a sort of if-condition to sum up the important columns (velocities, ...) and then again do some analysis with the new columns and so on.

Is something like this possible? I read or tried to read the manual but its 600 pages and searching for clustering just leads me to the clusterwrite/read operator.

maybe I'm applying the operators wrong? I mean I cannont give a certrain number of clusters, becaus in a huge dataset, 10^5 lines and maybe 1000 groups all different sizes.Also I didn't get the thing with the attribues of the dataset.

But the distance in coordinates, say less then a certain value, would qualify some atoms to a cluster. Can I give a cutoff-radius or some other criterion to the cluster-algorithm?

grateful for every hint,

Stever

Edit: the way I did this was: new Operator -> IO -> Examples -> ExampleSource, then New Operator -> Unsupervised -> Clustering -> ...

Edit2: ideal cluster criterion would be: are there points inside a specific radius R_0 of particel n? if so,it counts to the cluster c1, then: is there another particle in R_0 if yes go on, if not move on to next particle. If one particle has no neighbours inside R_0, it counts as an independent cluster c2, etc

2 REPLIES

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

06-04-2009 04:52 AM

06-04-2009 04:52 AM

Hi,

first of all: Nothing you mentioned seems to be impossible. But it will need some more or less complex process construction. If I understood you correctly, you should take a look at the operators named Aggregation and AttributeConstruction. You will probably need them to build the data for the histograms.

The algorithm you proposed for clustering (cutting of above a distance) is similar to the behavior of the DBScan clustering which uses some sort of density measure to cluster and will return all outlying data points as noise (first cluster).

One final hint on clustering performance: You can't really say, if a clustering is good or bad, because usually you don't have any objective criterion you can measure the cluster assignment with. So there are two ways of asses a clustering: Using a heuristical measure like the Davies Bouldin Index (see ClusterCentroidEvaluator) or having real labels available, which is in real world applications impossible. In the latter case you could use Cluster2Prediction and afterwards measure the clustering like a classification algorithm.

Greetings,

Sebastian

first of all: Nothing you mentioned seems to be impossible. But it will need some more or less complex process construction. If I understood you correctly, you should take a look at the operators named Aggregation and AttributeConstruction. You will probably need them to build the data for the histograms.

The algorithm you proposed for clustering (cutting of above a distance) is similar to the behavior of the DBScan clustering which uses some sort of density measure to cluster and will return all outlying data points as noise (first cluster).

One final hint on clustering performance: You can't really say, if a clustering is good or bad, because usually you don't have any objective criterion you can measure the cluster assignment with. So there are two ways of asses a clustering: Using a heuristical measure like the Davies Bouldin Index (see ClusterCentroidEvaluator) or having real labels available, which is in real world applications impossible. In the latter case you could use Cluster2Prediction and afterwards measure the clustering like a classification algorithm.

Greetings,

Sebastian

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

06-05-2009 05:30 AM

06-05-2009 05:30 AM

hi,

yes everything works great, I just had to find the right cluster algorithm, as you said, the DBScan works best for me. Still there is one problem left, which I haven't figure out to solve:

The cluster algorithm adds a new column to my data, which I convert to a number ("cluster1" -> 1.0, "cluster87" -> 87.0) to do some awk stuff later on. The tab has a few entries like

cluster id x y z ...

...

I would like to calculate an average over the x (y,z...) columns but only for certain cluster. He should sum up all values in the x column but only for cluster1, than give me the avg, cluster2 -> avg and so on. I haven't found the right operator for doing this. Any hints?

best wishes,

Stever

yes everything works great, I just had to find the right cluster algorithm, as you said, the DBScan works best for me. Still there is one problem left, which I haven't figure out to solve:

The cluster algorithm adds a new column to my data, which I convert to a number ("cluster1" -> 1.0, "cluster87" -> 87.0) to do some awk stuff later on. The tab has a few entries like

cluster id x y z ...

...

I would like to calculate an average over the x (y,z...) columns but only for certain cluster. He should sum up all values in the x column but only for cluster1, than give me the avg, cluster2 -> avg and so on. I haven't found the right operator for doing this. Any hints?

best wishes,

Stever