RapidMiner 9.7 is Now Available
Lots of amazing new improvements including true version control! Learn more about what's new here.
What is a good threshold for CosineSimilarity Measure?
I'm using the Cosine Similarity measure in the Cross Distance operator to determine the relevance of documents in a corpus of 5000 documents to a reference document. I'm getting results ranging from 0.8 to 1.6, without any significant breakpoint between relevant and not-so-relevant documents. How can I determine a threshold that is mathematically sound so that I know that documents below the threshold can be categorized as relevant and the ones above as not relevant? In short, how does one determine a threshold for cosine similarity measures with the cross distances operator?
Thanks so much, any insights will be greatly appreciated as I'm very new to this!