Options

"Text pattern identification"

ratheesanratheesan Member Posts: 68 Maven
edited May 2019 in Help
Hello,
I have a text document related with insurance.In that data there is some words like "No alcohol content" and "alcohol content".While working with this documents the RM considering all "alcohol" together.How can I count the number of "alcohol" with neighbor term"no".

Thanks
Ratheesan

Answers

  • Options
    RalfKlinkenbergRalfKlinkenberg Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member, Unconfirmed, University Professor Posts: 68 RM Founder
    Hello Ratheesan,

    you can use the RapidMiner text preprocessing operator TermNGramGenerator in order to not only count individual words, but also word pairs or other multi-word terms. Alternatively or in addition, you can also use a TokenReplace operator before the StringTokenizer to map multi-word terms like no alcohol to one word tokens:

    operator name="Root" class="Process" expanded="yes">
        <operator name="TextInput" class="TextInput" expanded="yes">
            <list key="texts">
            </list>
            <list key="namespaces">
            </list>
            <operator name="ToLowerCaseConverter" class="ToLowerCaseConverter">
            </operator>
            <operator name="Replace 'no alcohol' by 'noalcohol' to count it us one new word" class="TokenReplace">
                <list key="replace_dictionary">
                  <parameter key="no alcohol" value="noalcohol"/>
                </list>
            </operator>
            <operator name="StringTokenizer" class="StringTokenizer">
            </operator>
            <operator name="Consider pairs of words in addition to individual words" class="TermNGramGenerator">
            </operator>
        </operator>
    </operator>
    Cheers,
    Ralf
  • Options
    ratheesanratheesan Member Posts: 68 Maven
    Hello Ralf ,
    I really appreciate your help.It is working fine.Here I am getting all the combinations of words such as single word,2 words,3 words etc.Here we can control the maximum number of words only.But I need to extract the combination of 3 words onwards.How can I achieve this goal.

    Thanks
    Ratheesan
Sign In or Register to comment.