Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

[SOLVED] Clustering(K-Means) data from database

RucaRuca Member Posts: 13 Contributor II
edited November 2018 in Help
Hi all,
Sorry if this problem was already solved, but I’m a newbie and I was not able to locate a similar one.
My problem is the following:
I’ve a table with the following columns: doc_id; term; weight. Basically, for each document there are several terms occurrences and a weight associated to each term. This means that, each document is categorized by a set of pair attributes (term, weight)
Example:
Doc_id term weight
Doc1 color 0,45
Doc1 height 0,22
Doc1 weight 0,05
Doc2 altitude 0,04
Doc2 weight 0,35
I intend to perform a clustering analysis using k-means in order to check which documents are more similar against a predefined k clusters.
When I connect the "read database" operator to the "clustering" operator an error message appears saying that clustering doesn’t accept polynomial attributes. It’s not my intention to change both “doc_id” and “term” attributes to nominal ones. The result that I'm expecting should be somthing similar to:
Cluster_0 (Doc1, Doc32, Docx,...), Cluster_1(Doc_2, Doc45, Docy,...), etc.
Does anyone came across such problem?
Thank you for your support.

Best regards,

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi Ruca,

    first of all you have to De-Pivot your data with the equally named operator to get a dataset which contains exactly one document per row, like this:

    Doc_id color height weight altitude
    Doc1    0,45  0,22  0,05        0
    Doc2      0      0  0,35    0,04
    Then define Doc_id as Id with Set Role, and apply the clustering. That's it :)

    Best, Marius
  • RucaRuca Member Posts: 13 Contributor II
    Thank you Marius for your support. It worked like a charm.
    I've used the PIVOT operator instead of the DE-PIVOT.
    Regards,
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Ruca wrote:
    I've used the PIVOT operator instead of the DE-PIVOT.
    Oh sorry, of course you have to use Pivot oO

    Happy Mining!

    ~Marius
Sign In or Register to comment.