Best practices for text mining an academic text

yoram_schafferyoram_schaffer Member Posts: 3 Contributor I
edited December 2018 in Help

I have long, complex texts which I want to classify to categories such as psychology, history etc.

What processes would you recommend to use? Eg. tokenization, n-grams etc.

Thank you


  • kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    Hi @yoram_schaffer


    Your question might seem a bit too general, as text categorizing is a pretty big topic :) 

    I migh cite my own answer on the very same subject from another discussion some time ago: 




    Maybe this could help with some ideas for your problem as well. 

  • yoram_schafferyoram_schaffer Member Posts: 3 Contributor I

    kypexin for taking the time to reply to me. 

    I read your other reply thoroughly. Did you ever try using some of the other processes, like stemming, locating POS?

    The texts I'm analyzing are academic in nature - i.e - I'm not trying to analyze client behavior, not do I try to locate a dependency between different factors (e.g - weather against purchase habits).

    My intention is to categorize texts according to the topic they are dealing with. The texts are usually 100-300 words.

    I understand it's beyond your experience. Do you have any idea for a resource which my be helpful on that?

  • kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    Hi @yoram_schaffer


    Well, basically I did more or less the same task - categorizing site contents (actually means, text data) into separate predefined categories. I used all the standard things there (like tokenization, stemming) inb my process, see screenshot #2. One thing though I didn't use were n-grams, as it would be pretty memory-consuming; otherwise I see that your problem is actually VERY similar, so I would recommend that you begin with re-creating the process setup as I have described and see the results (believe me, it really works! :)). I think one crucial thing here is to have a good training set, which means manually categorized documents corpus (and the complexity of this part depends on how much unique categories and total documents you have).


    In a more general sense, text mining is one of most popular topics so you can find a lot of posts on this forum if you search for 'text mining' and similar. Also look for operators description from Text Mining RM extension, everything basically is built around it. And Google suggests pretty much different resources about 'text mining rapidminer', and even some tutorial videos. 



  • yoram_schafferyoram_schaffer Member Posts: 3 Contributor I

    Thank you very much @kypexin!
    I will tryo the different setting, having your illustration as a source of inspiration. Yes, I have quite good samples as I'm working on it for a long time (actually, started with RapidMiner afterr seeing, to my surprise, how limited is Amazon ML in terms of applying different processes).

    Will report and share with the community once I have some insights about what brngs better results, at least for academic texts.

Sign In or Register to comment.