Options

"distance measures of text attributes"

shaihuludshaihulud Member Posts: 20 Contributor II
edited May 2019 in Help
Hi

ive read that the distance measure procedure of the most clusteranalysis algorithm merely looks if the various text attributes of two objects a and b are the same. In other words it measures how many text attributes have the same value. Do they not take string measurements into account? For example: if object a has an attribute x with the value car and object b has the attribute x with the value cars, are they evaluated as a fit?

Btw.: am i right in this section for those kind of questions?

thx for the help.

Answers

  • Options
    shaihuludshaihulud Member Posts: 20 Contributor II
    Hi Guys

    i would really love to read some answers to my question .. furthermore i would like to know if anybody knows if there are distance measure approaches for cluster analysis that take semantics into account. for example an attribute value 'car' will be matched on an attribute calue 'automobile'.

    Guys i would really appreciate any help you can give me on this distance measurement topics.

    greez
  • Options
    el_chiefel_chief Member Posts: 63 Contributor II
    Generally what you want to do is calculate the TF-IDF score of a term in a document. This tells you how important a term is with respect to the document it is in, compared to how important the same term is in the rest of your documents.

    Then, you would calculate the distance between documents, based on their TF-IDF term scores, generally using the cosine similarity measure.

    But, if you're trying to calculate the distance between terms, and not documents, then I would look into the Levenshtein Edit Distance, which I believe, is not (yet) implemented in RapidMiner.
Sign In or Register to comment.