The Altair Community is migrating to a new platform to provide a better experience for you. The RapidMiner Community will merge with the Altair Community at the same time. In preparation for the migration, both communities are on read-only mode from July 15th - July 24th, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here.
Options

[SOLVED] How to count 0 values in large set of attributes?

mikeb1029mikeb1029 Member Posts: 2 Contributor I
edited November 2018 in Help
I am working with TF-IDF data, and I want to count the number of terms (attributes) for each document (example) that have a TF-IDF value of 0.  At this time we have approximately 8000 examples and 4000 attributes.  It seems there is no way to count numeirc values, so I've tried several types of conversions.  I have tried converting the 0 values to missing attributes and counting those.  I have tried converting the numeric TF-IDF scores to polynomials and counting those.  I've also tried to discretize the TF-IDF values into 2 bins (0, and >0).  After the conversions I mentioned, I use an Extract Macro operator to count with parameters of macro type: statistics, statistics: count, attribute name: %{loop_attribute}, and the appropriate attribute value for the type of conversion I did previously in the process (an adaptation of Haddock's Missing Value Count workflow (http://www.myexperiment.org/workflows/1292.html).  All of these different conversions eventually provide the results I want, but the performance on a sample set of 850 examples and 165 attributes takes 5 minutes or more, so I know the performance with the full set of data would be way too slow for our needs.  

What am I doing wrong?  It seems that counting values like I want to do here should be something that RapidMiner could do in the blink of an eye?

Answers

  • Options
    MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi,

    what you could do is to convert non-zeros to missings (as you did before), and then use the Aggregate operator with count (no missings) as default aggregation.

    Best regards,
    Marius
  • Options
    mikeb1029mikeb1029 Member Posts: 2 Contributor I
    Thanks for your help, Marius.  Your suggestion made me realize that I could do this without resorting to the Loop Attributes operator, and that ended up speeding things up tremendously.
  • Options
    MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    :-) Nice to hear that!
Sign In or Register to comment.