ANNOUNCEMENT: RAPIDMINER 9.1 BETA HAS BEEN RELEASED TODAY!   PLEASE DOWNLOAD AND GIVE FEEDBACK. ENJOY AND HAPPY RAPIDMINING!   -- @sgenzer – Community Manager

Text Mining Generate n-germs giving me bad results

pbaileypbailey Member Posts: 1 Learner I
edited November 9 in Help

First time user of RapidMiner so be gentle.     

 

I have a file of support call notes that I'm trying to text mine to get the most used 2-word phrases.    I've watched a couple of videos and read a couple of posts on how to do this.   So I think I have everything set correct (but maybe not since it's not working).    Before using Generate n-germs,  the process returns single words just fine.   After I add Generate n-germs with max length of 2.   The below screen caps give a peek into my set up and results.

 

The Process:

 https://photos.app.goo.gl/ue98yXSvkeMKbuzq9

The results:

 https://photos.app.goo.gl/nJADtcHLDeMH2wMk9

Any help or direction would be greatly appreciated.

Tagged:
mschmitz

Answers

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager Posts: 1,832  Community Manager

    Hello @pbailey - welcome to the community. Don't worry...we are actually very gentle here! 

     

    So in general the best way for us to help is for you to post your process XML and at least a little bit of your data (if it's sensitive, people use "dummy" data). This way we can actually run the process, tweak, and share to others. You can find instructions on how to do this here.

     

    Looking at your images, I honestly think that it is working. Why do you think it isn't? My hunch is that you have a lot of "junk" tokens that you'll probably want to filter out like "aaacds" and "aaba" in order to get some better resultsl. That's easy to do. Just use the "Filter Tokens (by Content)" operator. You may want to play around with the parameters and use the "matches" method with regular expressions. For example:

     

    Screen Shot 2018-11-02 at 10.19.30 AM.png

     

    This will filter OUT any token that starts with the letters "aa". Regular expressions are VERY helpful in text mining. :)

     

    Good luck!

     

    Scott

     

     

     

    rfuentealba
Sign In or Register to comment.