What is a good threshold for CosineSimilarity Measure?

mrcmrc Member Posts: 2 Newbie
Hi RM community,

I'm using the Cosine Similarity measure in the Cross Distance operator to determine the relevance of documents in a corpus of 5000 documents to a reference document. I'm getting results ranging from 0.8 to 1.6, without any significant breakpoint between relevant and not-so-relevant documents. How can I determine a threshold that is mathematically sound so that I know that documents below the threshold can be categorized as relevant and the ones above as not relevant? In short, how does one determine a threshold for cosine similarity measures with the cross distances operator?

Thanks so much, any insights will be greatly appreciated as I'm very new to this!



  • sgenzersgenzer 12Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,952  Community Manager
    cc @yyhuang ???
  • mrcmrc Member Posts: 2 Newbie
    I do not have an answer yet but since posting this I’ve used the Normalize operator to normalize the results between 0 and 1. I am
    now trying to decide what threshold makes sense - leaning towards 0.25 or 0.5. I’d like to justify my threshold choice with a mathematically sound answer but so far I have not come across one. Any insights to help? 

    Thanks much!
