Options

Select column with non-zero value

ElenaVetElenaVet Member Posts: 9 Learner I
Hi everybody!
I've calculated TF-IDF with "Process document from data" and I found a matrix that have a word in every column and a body for every row and every cell of the matrix cointains TF-IDF's value. Now I filter by cluster, creates with k.means, and I want to see only columns with values non-zero. I firstly thought to do a sum of every column's value (with Aggregate) and take only those with sum greater than zero, but I also think that it's a mistake do the sum of TF-IDF and all the analysis would be distorted, so can you please tell me a solution to filter only columns with at least one value different from zero?
Thanks you so much! 

Answers

  • Options
    Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    Have you tried looking at the cluster centroid output?  This is essentially giving you the average value for each cluster for each attribute.  You should be able to filter that more easily.
    If you don't want to use that approach, you would need to loop over each cluster, do an Aggregation using the Max function and remove those attributes that have a max value of zero.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Options
    ElenaVetElenaVet Member Posts: 9 Learner I
    Hi @Telcontar120
    thank you for your answer! I found the cluster centroid output, as you suggested, but i don't really understand the value of every cell, can you explain me, please? I attach the screen of my results.
  • Options
    Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    Cluster centroids are showing the average value of the word vector metric (using whatever parameter metric you selected such as TF-IDF) for each cluster for each attribute.  You can see, for instance, the cluster that has the highest value for the token "aapl" is cluster 12.  You can use this to understand what attributes are most dominant for any particular cluster by sorting and filtering.  You can also compute differences between clusters if you like.
    I noticed you have a lot of clusters. This can sometimes make interpretation difficult, you should probably also think about whether you have a need for this many distinct clusters.  Or you could try another approach beyond k-means such as LDA analysis.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,524 RM Data Scientist
    Hi,
    too add one more thought: The operator Extract Cluster Centroid gives you that table as an example set to work with.

    Cheers,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.