"Clustering and similarity of the text documents"
I have been recently dealing with some extraction methods of the keyphrases from the text. Now I would like to solve another problem: Clustering the documents& similarity between them.
It goes like that: Let us suppose that we have some security documents from various sources. I would like to examine these documents and cluster them. Sometimes a document can be published from various sources about the same topic/device/problem. The goal is to find these 'overlapping' documents and put the in one cluster. Published documents have the following features: the structure may be changed, some words may be added, but the key phrases are the same, mainly a number that identifies a report or other key phrases, that appear repeatedly. Any suggestions about the model? I've tried to use several clustering parameters and metrics, but the results are rather not good. The approach based on frequency of common words would fail, because of the specific structure of the documents. Thanks in advance for any suggestions.