Options

interpreting the sum of TF-IDF scores of words across documents

LindsayKelevraLindsayKelevra Member Posts: 5 Newbie
edited June 2020 in Help
hi guys! after doing a clustering on a list of documents with the k-means, I would like to analyze the words in each cluster (in order to correlate them with other attributes). About this I added up the value of tf-idf for each words, but I think that this solution can be wrong. Could it be more correct to use term frequency? Thnaks in advice.

Answers

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,507 RM Data Scientist
    Hi,
    i am not sure what you exactly asking? Can you eloberate a bit?

    And: Maybe LDA is something for you. It usually performs better to detect groups on texts.

    Best,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    LindsayKelevraLindsayKelevra Member Posts: 5 Newbie
    hi! I clustered (k-means) on an attribute containing an article for each record. Having used tf-idf  now i have a matrix of words and relative frequency. Now i'm trying to analyze, for each cluster, the words contained. Since I have many attributes is it possible to sum the tf-idf frequency for each words? Alternatively I thought to use the average, is it more correct?
  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,507 RM Data Scientist
    this is what i usually do to understand my clusters: https://towardsdatascience.com/understanding-clustering-cf0117148ef4#b7ae
    that should also work on tf-idf.

    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    Fundamentally you probably don't want to add TF-IDF values as it is not designed to be additive in nature (e.g., it doesn't have consistent scaling because it is multiplied by the log of the inverse document frequency). 
    If you want to use word your vector values directly, you should use one of the metrics that is inherently additive such as term occurrences, which is just a raw count of terms, or term frequency, which is just the unadjusted percentage of total terms that a particular term covers. 
    But I also agree with Martin that this is not the most intuitive way of trying to understand your clusters.  You can use some of the methods he describes, or you can also just look at the centroid values directly (one of the outputs of the cluster operators) and find the values that are most distinct from one cluster to another (the graph visualization is helpful for this).
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.