"Text Mining Generate n-germs giving me bad results"

pbaileypbailey Member Posts: 1 Learner I
edited May 2019 in Help

First time user of RapidMiner so be gentle.     


I have a file of support call notes that I'm trying to text mine to get the most used 2-word phrases.    I've watched a couple of videos and read a couple of posts on how to do this.   So I think I have everything set correct (but maybe not since it's not working).    Before using Generate n-germs,  the process returns single words just fine.   After I add Generate n-germs with max length of 2.   The below screen caps give a peek into my set up and results.


The Process:


The results:


Any help or direction would be greatly appreciated.


  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    Hello @pbailey - welcome to the community. Don't worry...we are actually very gentle here! 


    So in general the best way for us to help is for you to post your process XML and at least a little bit of your data (if it's sensitive, people use "dummy" data). This way we can actually run the process, tweak, and share to others. You can find instructions on how to do this here.


    Looking at your images, I honestly think that it is working. Why do you think it isn't? My hunch is that you have a lot of "junk" tokens that you'll probably want to filter out like "aaacds" and "aaba" in order to get some better resultsl. That's easy to do. Just use the "Filter Tokens (by Content)" operator. You may want to play around with the parameters and use the "matches" method with regular expressions. For example:


    Screen Shot 2018-11-02 at 10.19.30 AM.png


    This will filter OUT any token that starts with the letters "aa". Regular expressions are VERY helpful in text mining. :)


    Good luck!






Sign In or Register to comment.