"clusteranalysis of unknown data"

shaihuludshaihulud Member Posts: 20 Maven
edited May 2019 in Help
Hello Community,

i do want to read data from csv files.
Each line represents an instance with a name and a couple of attributes. The attributes AND attribute values are mostly strings and they can be ARBITRARY.
I need to find a way to identify some representatives for each "group" of instances i have in the data, without knowing the groups ( in other words: i dont know to which classes/concepts/clusters they need to be mapped because, as i said, the data can be arbitrary).
I need to narrow the masses of instances down to representatives as best as i can.
Even though the data is arbitrary they have many similar or equal attribute values.

I think clusteranalysis is the right approach.

I have already experimented with some clustering methods to get some results, it looks promising. Nevertheless i would love to know if you already have experince on such a scenario, so that you can give me a heads up on which clusteranalysis method(s) to focus/start from at best.

Another question would be if you have other ideas than cluster analysis to solve this problem?

I would appreciate any help on the topic.



  • shaihuludshaihulud Member Posts: 20 Maven
    oki ive read a bunch of stuff today and kinda have an idea of what i need to do.
    Its impossible for my to avoid using several algorithms in the cluster creating cycle.

    So if anybody has specific insight to that scenario, i would appreciate any help or cooperation, but the groundwork is pretty clear to me.
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    just a hint: If you are going to process arbitrary Texts, I would recommend using the Text Processing Extension to build a Word Vector from the texts before clustering them. Otherwise there's no information about any distance between two arbitrary strings.

  • shaihuludshaihulud Member Posts: 20 Maven
    Hi Sebastian

    thx for the hint, but i dont quite understand it. Why is preparing vectors with the attribute values different from just taking the values for clustering? Can you elaborate/ direct me to an elaborating paper/article etc. ?

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    the problem is, that there is no numerical distance defined between the strings "Hallöchen" and "Hi", except the fact that they aren't equal.
    If you want to do a good clustering of texts, you will need a distance measure that somehow grasps the equalness of texts. And this is usually done by forming a bag of words. You can google for that and you will probably find many sources.

Sign In or Register to comment.