Working with and editing wordlist in RM

svtorykhsvtorykh Member Posts: 35 Guru
edited December 2018 in Help

Hello people,

 

I have wordlist generated and stored after text processing. Wordlist contains N-grams as well as single words. I'm using this wordlist as WOR input in my next text processing operator, but I only need to keep N-Grams (contain _). There is Wordlist to Data operator that I can use to filter it, but there is no reverse Data to Wordlist Operator. Any other ways for me to filter the worldist?

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    Hi @svtorykh,

    i am afraid there is none. Isn't it okay in your case to filter the Attributes in the end?

     

    BR,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • svtorykhsvtorykh Member Posts: 35 Guru

    Hi Martin,

     

    Ultimately what I'm trying to sovle is how I can customize wordlist on the outside and use that as WOR input for Process Documents Operator. I think it's pretty important as it helps tremendously with filtering of the proper content while processing documents.

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    As @mschmitz says, there is not really a way to edit wordlists today, although I agree it would be helpful so please go ahead and submit it as a product idea!

    In the meantime, as long as the generated wordlist contains a superset of the words you actually need, there is no real functional problem in RapidMiner.  It will simply generate attributes for words that you don't care about, which can be ignored or filtered out later once they are attributes.

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    @svtorykh intersting application. I'm curious, how do you plan on using the customized wordlist? Could you use the Filter Tokens (by Region) opeator to automatically filter for '_' and then do 1 before an after?

  • svtorykhsvtorykh Member Posts: 35 Guru

    Yeah, I've figured out how to filter _ using filter tokent by content. The wordlist must contain both N-grams and single words though. It's easy for me to decide which of the single words should be included in the wordlist, and I can then merge N-grams output with single words on the outside.

    Can you guys elaborate on post process attribute filtering? Both wordlist and final attributes list will contain thousands of attributes, so not sure how complicated it can be to filter thousands from thousands in the post process?

  • svtorykhsvtorykh Member Posts: 35 Guru

    On the other note! One of the benefits to be able to import customized wordlist is the ability to actually generate N-grams better.

    E.g. I' m looking into business skills and have repository of skills with many of them being 3 words or more. In this case for "Business Process Optimization" using N-grams of 3, results will contain business, process, optimization, business_process, process_optimization, business_process_optimization. While if I could just replace spaces in Excel and have business_process_optimization as wordlist input, I won't see the noise of all other n-grams generated. Makes sense? :) Consider thousands of possible skills combination and scalability of filtering attributes becomes a problem.

Sign In or Register to comment.