Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

prune vales from TF-IDF vectors

RucaRuca Member Posts: 13 Contributor II
edited November 2018 in Help
Hi all,

I'm processing a set of documents for files, in order to generate a TF-IDF vector for each document.
For each document I'm getting severall scores below 1%.
Is there any possibility to prune all values below 1%?
I've tried all the prune by ranking method, but unfortunatley I'm not getting the results that I'm expecting. Is there any other way to work around?
Thank you.

Regards,

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi Ruca,

    what exactly do you want to do? If the TF-IDF value for one document is below 1%, it's probably 90% for another document. That means, that you can't remove the whole column, because then you would also remove the high value of the other document. So what should happen in that case?

    Best,
    Marius
  • RucaRuca Member Posts: 13 Contributor II
    Hello Marius,

    Thank you for your reply.
    Probably I was not clear enough with my explanation. Sorry for that.
    Lets assuming that I'm processing something like 10 docs. And I'm getting a TF-IDF vector for each document.
    As you mention, and you are 100% right, that are some words that are not relavant for a particular document (let's say below 5%) but maybe be very relevant for other documents (around 50%).
    But also, I can came across words that the range is between 0.0 and 0.04. Which means that the maximum rank that such word can have is 4% for a particular document.
    My question is: How to eliminate such words, that have a minimum inpact for all documents.
Sign In or Register to comment.