Best Practices for Term Reduction

mdcmdc Member Posts: 58 Maven
Hi,

I'm new to Text Mining and I'm trying to cluster technical documents or papers using RM. I've tried a lot of things already but couldn't get a good quality clusters. Now, I'm starting to realize that to get good clustering results, I need to provide clean list of terms (attributes). So far these are what I've tried:
1. Limit the terms to a list of Domain keywords - Ideally, I think this is the best but harder to generate. It requires a lot of manual work to have a list of domain keywords.
2. Stop Word Lists (english and domain specific)
3. Use small part of document for indexing
    - Title only - this results to fewer terms but I don't know if it affects the quality of clustering
    - Abstract only - I find it to generate lots of noisy terms
4. Use N-Gram - ok but it multiplies the number of terms
5. Stemmer (on/off)
6. InteractiveAttributeWeighing operator - I think this is ok but requires manual work --considering I always get thousand of attributes.

Is there anything else I'm missing that is available in RM? I've heard of this POS tagger or sentence analysis (?) but is not available in the Word Vector Plugin. Is this something I should be considering?

I know there are experts here in text mining technical papers ---could you please give me some guidance or ideas.  :-\

thanks,


Answers

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi,

    there is unfortunately no recipe which always led to the best results but at least I try to give you some comments on your ideas and extend them by some additional comments where applicable:

    1. Limit the terms to a list of Domain keywords - Ideally, I think this is the best but harder to generate. It requires a lot of manual work to have a list of domain keywords.
    Can help but usually for texts this is not really an option. And especially for clustering, where success is hard to measure, I would probably not encode too much information in the representation since it is very likely that you will then just produce a model which fulfills your expectations and not one which is actually hidden in the data...

    2. Stop Word Lists (english and domain specific)
    Removing stop words usually helps for text clustering. Performing some domain specific preprocessing (e.g. keeping short words which definitely are important or ensuring that words will not be divided into parts by tokenization although they should not etc.) usually also helps.

    3. Use small part of document for indexing
        - Title only - this results to fewer terms but I don't know if it affects the quality of clustering
        - Abstract only - I find it to generate lots of noisy terms
    I would be suspicous for the title but the abstract could work - at least it is likely that the number of dimensions is reduced to a minimal amount of necessary dimensions. And people usually do only use words in their abstracts if they are really important for describing the topics of the text. So you should give it a try at least for the abstract. The problem with the title is that it is sometimes more important to find a "catchy" title instead of one actually describing the contents.

    4. Use N-Gram - ok but it multiplies the number of terms
    Character n-grams are usually more important for shorter texts where probability for typos is larger (e.g. forum posts etc.). For long and well-written texts character n-grams usually become less important. But you could still try term n-grams if those are important in your domain.

    5. Stemmer (on/off)
    You have to try both. And you apparently did  ;)

    6. InteractiveAttributeWeighing operator - I think this is ok but requires manual work --considering I always get thousand of attributes.
    Same as for 1 - not really an option.

    Of course you try automatic feature selection but usually this is not too helpful for texts. For text classification, especially for classification with SVM, it has actually be shown that feature selection in most cases decreases the accuracy.


    The most important question beside preprocessing is that for the distance measure used. I would always first try cosine similarity since it often works quite well on texts. Euclidean distance often works less good - hence KMeans is not really an option here.

    Hope that helps,
    Ingo
  • mdcmdc Member Posts: 58 Maven
    Hi Ingo,

    Thanks, that helps me focus on fewer things.

    Matthew
Sign In or Register to comment.