"clusteranalysis of unknown data"

shaihulud · November 2010

Hello Community,

i do want to read data from csv files.
Each line represents an instance with a name and a couple of attributes. The attributes AND attribute values are mostly strings and they can be ARBITRARY.
I need to find a way to identify some representatives for each "group" of instances i have in the data, without knowing the groups ( in other words: i dont know to which classes/concepts/clusters they need to be mapped because, as i said, the data can be arbitrary).
I need to narrow the masses of instances down to representatives as best as i can.
Even though the data is arbitrary they have many similar or equal attribute values.

I think clusteranalysis is the right approach.

I have already experimented with some clustering methods to get some results, it looks promising. Nevertheless i would love to know if you already have experince on such a scenario, so that you can give me a heads up on which clusteranalysis method(s) to focus/start from at best.

Another question would be if you have other ideas than cluster analysis to solve this problem?

I would appreciate any help on the topic.

greetings

shaihulud · November 2010

oki ive read a bunch of stuff today and kinda have an idea of what i need to do.
Its impossible for my to avoid using several algorithms in the cluster creating cycle.

So if anybody has specific insight to that scenario, i would appreciate any help or cooperation, but the groundwork is pretty clear to me.

land · November 2010

Hi,
just a hint: If you are going to process arbitrary Texts, I would recommend using the Text Processing Extension to build a Word Vector from the texts before clustering them. Otherwise there's no information about any distance between two arbitrary strings.

Greetings,
Sebastian

shaihulud · November 2010

Hi Sebastian

thx for the hint, but i dont quite understand it. Why is preparing vectors with the attribute values different from just taking the values for clustering? Can you elaborate/ direct me to an elaborating paper/article etc. ?

thx
shai

land · November 2010

Hi,
the problem is, that there is no numerical distance defined between the strings "Hallöchen" and "Hi", except the fact that they aren't equal.
If you want to do a good clustering of texts, you will need a distance measure that somehow grasps the equalness of texts. And this is usually done by forming a bag of words. You can google for that and you will probably find many sources.

Greetings,
Sebastian

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"clusteranalysis of unknown data"

Answers