Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

Problems with n-gram and POS Tags operators

DFDF Member Posts: 3 Contributor I
edited November 2018 in Help
Hi everyone,

I'm working on my MA dissertation and I'm having trouble getting results from some of the operators. I'm extracting terminology from .txt files using the Process Documents from Files Operator. Within it, I used the sub-processes of Tokenize (non-letters), Stopwords (English), Transform Cases (lower cases), Filter Tokens by length, filter tokens by POS Tags in English (here the expression: FW.*|JJ.*|JJR.*|JJS.*|NN.*|NNS.*|RB.*|RP.*|VB.*|VBD.*|VBP.*|VBZ.*|VBG.*|VBN.*) and Generate up to 5 n-grams. Once I execute the process, there are no compound words and I don't know where the POS Tags should appear, because I don't see any tags anywhere. Here, the code:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="text:process_document_from_file" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Files" width="90" x="112" y="30">
        <list key="text_directories">
          <parameter key="hack" value="C:\Users\Dya\Documents\Univ. Catolica\Semestre 4\Tesis\Tesis 2014\Capitulos\Metodologia\Corpus\Corpus RapidMiner\Txt\Hack"/>
        </list>
        <parameter key="use_file_extension_as_type" value="false"/>
        <parameter key="keep_text" value="true"/>
        <process expanded="true">
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.002" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="179" y="30"/>
          <operator activated="true" class="text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases" width="90" x="179" y="165"/>
          <operator activated="true" class="text:filter_tokens_by_pos" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (by POS Tags)" width="90" x="313" y="30">
            <parameter key="expression" value="FW.*|JJ.*|JJR.*|JJS.*|NN.*|NNS.*|RB.*|RP.*|VB.*|VBD.*|VBP.*|VBZ.*|VBG.*|VBN.*"/>
          </operator>
          <operator activated="true" class="text:generate_n_grams_terms" compatibility="5.3.002" expanded="true" height="60" name="Generate n-Grams (Terms)" width="90" x="313" y="165"/>
          <operator activated="true" class="text:filter_by_length" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="447" y="30">
            <parameter key="min_chars" value="3"/>
            <parameter key="max_chars" value="20"/>
          </operator>
          <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="447" y="165"/>
          <connect from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Filter Tokens (by POS Tags)" to_port="document"/>
          <connect from_op="Filter Tokens (by POS Tags)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
          <connect from_op="Generate n-Grams (Terms)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
      <connect from_op="Process Documents from Files" from_port="word list" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

Please, if anyone can help me figure out if there's any problem, I would appreciate it! I'm running out of time and options. Thanks!

Answers

  • awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
    Hello

    Move the tokenize operator to the beginning of the chain inside the Process Documents operator

    regards

    Andrew
  • DFDF Member Posts: 3 Contributor I
    Hi Andrew,

    Thanks! I tried it that way, and I get compound words, though I had to split my texts into two different processes because apparently it exceeded the memory capacity. Because of that same reason I can't export the results into an Excel sheet, so I guess I'll have to read and extract the desired results directly from the Software.

    However, I still don't get the POS Tags. Any idea?

    Best regards,

    Dya
  • bkrieverbkriever RapidMiner Certified Analyst, Member Posts: 11 Contributor II
    I don't believe the POS tags should be "appearing", as you are just filtering for them, not creating flags for them. 

    If you want to create a flag for them you could loop each value:
    FW.*
    JJ.*
    JJR.*
    etc..

    In the loop you can then use Filter Tokens (by POS Tags) where you only filter the current looped value (for example JJ.*) and then create a flag for that loop (Generate attributes: JJ = 1).
    There may be an easier way, but that should work.
  • DFDF Member Posts: 3 Contributor I
    Hi,

    Can you elaborate a little bit more on how to flag words with POS Tags? In the Filter Tokens (by POS Tags) I indicated the expression (FW*, NN*, etc.). However, I don't know how to link those tags with my extracted words. I've been trying to fill in the attribute name and corresponding function expression in the Generate Attributes operator, but it doesn't recognize the attributes. So, I'm a little bit lost here as it is the first time I use this operator.

    Thanks!
  • bkrieverbkriever RapidMiner Certified Analyst, Member Posts: 11 Contributor II
    Sorry I didn't response sooner.
    You will need to output a separate word list for each POS tag (with a label you create for them) and then append to get your full data set.
    Here is an example process:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.1.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.1.000" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="generate_data_user_specification" compatibility="6.1.000" expanded="true" height="60" name="Generate Data by User Specification" width="90" x="112" y="30">
            <list key="attribute_values">
              <parameter key="VB.*" value="1"/>
              <parameter key="FW.*" value="1"/>
              <parameter key="JJ.*" value="1"/>
              <parameter key="NNS.*" value="1"/>
            </list>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="loop_attributes" compatibility="6.1.000" expanded="true" height="94" name="Loop Attributes" width="90" x="246" y="30">
            <process expanded="true">
              <operator activated="true" class="text:create_document" compatibility="6.1.000" expanded="true" height="60" name="Create Document" width="90" x="45" y="165">
                <parameter key="text" value="this is test data for which to parse out part of speech tags after it has been created"/>
              </operator>
              <operator activated="true" class="text:filter_tokens_by_pos" compatibility="6.1.000" expanded="true" height="60" name="Filter Tokens (by POS Tags)" width="90" x="179" y="165">
                <parameter key="expression" value="%{loop_attribute}"/>
              </operator>
              <operator activated="true" class="text:process_documents" compatibility="6.1.000" expanded="true" height="94" name="Process Documents" width="90" x="313" y="165">
                <parameter key="create_word_vector" value="false"/>
                <parameter key="keep_text" value="true"/>
                <process expanded="true">
                  <operator activated="true" class="text:tokenize" compatibility="6.1.000" expanded="true" height="60" name="Tokenize" width="90" x="112" y="30"/>
                  <connect from_port="document" to_op="Tokenize" to_port="document"/>
                  <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
                  <portSpacing port="source_document" spacing="0"/>
                  <portSpacing port="sink_document 1" spacing="0"/>
                  <portSpacing port="sink_document 2" spacing="0"/>
                </process>
              </operator>
              <operator activated="true" class="text:wordlist_to_data" compatibility="6.1.000" expanded="true" height="76" name="WordList to Data" width="90" x="447" y="165"/>
              <operator activated="true" class="generate_attributes" compatibility="6.1.000" expanded="true" height="76" name="Generate Attributes" width="90" x="581" y="165">
                <list key="function_descriptions">
                  <parameter key="WordType" value="&quot;%{loop_attribute}&quot;"/>
                </list>
              </operator>
              <connect from_op="Create Document" from_port="output" to_op="Filter Tokens (by POS Tags)" to_port="document"/>
              <connect from_op="Filter Tokens (by POS Tags)" from_port="document" to_op="Process Documents" to_port="documents 1"/>
              <connect from_op="Process Documents" from_port="word list" to_op="WordList to Data" to_port="word list"/>
              <connect from_op="WordList to Data" from_port="example set" to_op="Generate Attributes" to_port="example set input"/>
              <connect from_op="Generate Attributes" from_port="example set output" to_port="result 1"/>
              <portSpacing port="source_example set" spacing="0"/>
              <portSpacing port="sink_example set" spacing="0"/>
              <portSpacing port="sink_result 1" spacing="36"/>
              <portSpacing port="sink_result 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="append" compatibility="6.1.000" expanded="true" height="76" name="Append" width="90" x="380" y="30"/>
          <connect from_op="Generate Data by User Specification" from_port="output" to_op="Loop Attributes" to_port="example set"/>
          <connect from_op="Loop Attributes" from_port="result 1" to_op="Append" to_port="example set 1"/>
          <connect from_op="Append" from_port="merged set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • student_computestudent_compute Member Posts: 73 Contributor II

    Hello
    I paste the submitted process into the application.
    But the process document operator is red. And disabled.
    Can someone send me the rmp file from this process?
    Thank you If anyone helps. I need very much
    Thankful

  • student_computestudent_compute Member Posts: 73 Contributor II
    Please help
    Thankful
    Sorry
  • kaymankayman Member Posts: 662 Unicorn

    If you are serious about POS and tagging I would recommend using the python NLTK package. It is much more robust than the build in POS options, and a whole lot faster also (developers, take this as a hint ;-))

     

    Attached example is not exactly what you need, but there are plenty of examples to find on the internet on how to work with NLTK.

    The sample is something I use myself a lot to seperate nouns from verbs, or look for combined strings (noun or verb phrases for instance) and it's pretty modular. 

     

    <?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="generate_data_user_specification" compatibility="8.2.000" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="45" y="85">
    <list key="attribute_values">
    <parameter key="content" value="&quot;I love this product, the price was really cheap for these types of headphones, and they don't hurt my ears too much after listening to music for hours on end! I ordered with Amazon prime, and it came the next day, I was very pleased.&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    <description align="center" color="transparent" colored="false" width="126">Simple string</description>
    </operator>
    <operator activated="true" class="nominal_to_text" compatibility="8.2.000" expanded="true" height="82" name="Nominal to Text" width="90" x="179" y="85">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="content"/>
    <description align="center" color="transparent" colored="false" width="126">ensure the string is text before we start conversion</description>
    </operator>
    <operator activated="true" class="subprocess" compatibility="8.2.000" expanded="true" height="82" name="chuncker (2)" width="90" x="313" y="85">
    <process expanded="true">
    <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="get POS phrases" width="90" x="112" y="34">
    <parameter key="script" value="import nltk, re&#10;from nltk.tokenize import sent_tokenize, word_tokenize, regexp_tokenize, wordpunct_tokenize&#10;from nltk.chunk import *&#10;from nltk.chunk.util import *&#10;from nltk.chunk.regexp import *&#10;from nltk import untag&#10;&#10;from nltk.stem import PorterStemmer, WordNetLemmatizer&#10;from nltk.stem.lancaster import LancasterStemmer&#10;from nltk.stem.snowball import SnowballStemmer&#10;&#10;&quot;&quot;&quot;&#10;The GetPOS class contains any type of POS combination you might be intrested in, and allows for relatively easy&#10;addition of different types based on the Part Of Speech attributes.&#10;&#10;Note that below examples are for demo purposes only, and may need to be modified to get better results, defined &#10;by the given datasets.&#10;&#10;&quot;&quot;&quot;&#10;&#10;class GetPOS:&#10;&#10; def __init__(self,txt):&#10; self.txt = txt&#10; &#10; def get_noun_phrases(self):&#10; self.chunk_rule = ChunkRule(&quot;&lt;JJ.*&gt;&lt;NN.*&gt;+|&lt;JJ.*&gt;*&lt;NN.*&gt;&lt;CC&gt;*&lt;NN.*&gt;+|&lt;CD&gt;&lt;NN.*&gt;&quot;, &quot;Simple noun phrase&quot;)&#10; self.tags = chunckMe(self.txt,[self.chunk_rule])&#10; return ', '.join(set(self.tags))&#10;&#10; def get_adverb_phrases(self):&#10; self.chunk_rule = ChunkRule(&quot;&lt;JJ.*&gt;&lt;CC&gt;&lt;JJ.*&gt;|&lt;JJ.*&gt;&lt;TO&gt;*&lt;VB.*&gt;&lt;TO&gt;*&lt;NN.*&gt;+&quot;, &quot;adjective phrase&quot;)&#10; self.tags = chunckMe(self.txt,[self.chunk_rule])&#10; return ', '.join(set(self.tags))&#10;&#10; def get_adverbs_adjectives(self):&#10; self.chunk_rule = ChunkRule(&quot;&lt;RB.*&gt;&lt;JJ.*&gt;|&lt;VB.*&gt;+&lt;RB.*&gt;&quot;, &quot;Adverb - Adjectives&quot;)&#10; self.tags = chunckMe(self.txt,[self.chunk_rule])&#10; return ', '.join(set(self.tags))&#10;&#10; def get_verbs_adjectives(self):&#10; self.chunk_rule = ChunkRule(&quot;&lt;VB.*&gt;(&lt;JJ.*&gt;|&lt;NN.*&gt;)+&quot;, &quot;verbs - Adjectives&quot;)&#10; self.tags = chunckMe(self.txt,[self.chunk_rule])&#10; return ', '.join(set(self.tags))&#10;&#10; def get_nouns(self):&#10; self.chunk_rule = ChunkRule(&quot;(&lt;WRB&gt;&lt;.*&gt;+)?&lt;NN.*&gt;+&quot;, &quot;Nouns&quot;)&#10; self.tags = chunckMe(self.txt,[self.chunk_rule])&#10; return ', '.join(set(self.tags))&#10;&#10; def get_verbs(self):&#10; self.chunk_rule = ChunkRule(&quot;&lt;VB.*&gt;+&quot;, &quot;Verbs&quot;)&#10; self.tags = chunckMe(self.txt,[self.chunk_rule])&#10; return ', '.join(set(self.tags))&#10;&#10; def get_verbs_lemma(self):&#10; stopwords=(['be', 'do', 'have', 'am'])&#10; lm=nltk.WordNetLemmatizer()&#10; self.chunk_rule = ChunkRule(&quot;&lt;VB.*&gt;+&quot;, &quot;Verbs&quot;)&#10; self.tags = chunckMe(self.txt,[self.chunk_rule])&#10; return ', '.join([word for word in nltk.word_tokenize(' '.join(set(lm.lemmatize(w, 'v') for w in self.tags))) if word.lower() not in stopwords])&#10;&#10; def return_tags(self):&#10; self.tags = chunckMe(self.txt,[self.chunk_rule])&#10; return ', '.join(set(self.tags))&#10;&#10;&quot;&quot;&quot;&#10;chunk_me will chunk the provided string, and return only the tokens (words) that apply to the given rule&#10;&#10;&quot;&quot;&quot;&#10;&#10;def chunckMe(txt,rule):&#10;&#10; np=[]&#10; chunk_parser = RegexpChunkParser(rule, chunk_label='LBL')&#10; sentences= sent_tokenize(txt)&#10;&#10; for sent in sentences:&#10; d_words=nltk.word_tokenize(sent)&#10; d_tagged=nltk.pos_tag(d_words)&#10; chunked_text = chunk_parser.parse(d_tagged)&#10;&#10; tree = chunked_text&#10; for subtree in tree.subtrees():&#10; if subtree.label() == 'LBL': np.append(&quot; &quot;.join(untag(subtree)).lower())&#10; &#10; return np;&#10;&#10;&quot;&quot;&quot; &#10;the rm_main def is the base as used by rapidminer. the dataframe (called data by default but can be changed to whatever) &#10;will be defined by the incoming port, the output will be what is returned to the process.&#10;&#10;It is perfectly possible to run a python module without retrieving or returning anything, in that case leave the attributes&#10;blank.&#10;&#10;In this example we use some lambda functions to call whatever type of POS we want to add to the dataframe / recordset.&#10;So we will have our original dataframe, and add n new series to return for further use within the workflow.&#10;&#10;&quot;&quot;&quot;&#10;&#10;def rm_main(data):&#10;&#10; body = data['content']&#10;&#10; data['noun_phrases'] = body.apply(lambda x: GetPOS(x).get_noun_phrases())&#10; data['adverb_phrases'] = body.apply(lambda x: GetPOS(x).get_adverb_phrases())&#10; data['nouns'] = body.apply(lambda x: GetPOS(x).get_nouns())&#10; data['verbs_lemma'] = body.apply(lambda x: GetPOS(x).get_verbs_lemma())&#10; &#10; return data"/>
    <description align="center" color="transparent" colored="false" width="126">Apply python (NLTK) to get POS tags and some other magic</description>
    </operator>
    <connect from_port="in 1" to_op="get POS phrases" to_port="input 1"/>
    <connect from_op="get POS phrases" from_port="output 1" to_port="out 1"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="source_in 2" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    <description align="center" color="yellow" colored="false" height="50" resized="false" width="570" x="87" y="243">https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html</description&gt;
    </process>
    <description align="center" color="transparent" colored="false" width="126">use python to set some POS logic for key phrases</description>
    </operator>
    <connect from_op="Generate Data by User Specification" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
    <connect from_op="Nominal to Text" from_port="example set output" to_op="chuncker (2)" to_port="in 1"/>
    <connect from_op="chuncker (2)" from_port="out 1" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
  • student_computestudent_compute Member Posts: 73 Contributor II

    Hello. thank you very much
    I run your code. The output is this way
    pos.JPG
    Now, if I want to separate and display the attributes and constraints, how should I write it? I did not run anyway .. !!
    Is it possible to say this too?
    And
    I want to emulate with the extraction and selection of pos tags and sentiment analysis by wordnet,
    Are I able to connect to the wordnet operator after extraction, nouns and verbs and adverbs , adjectives?
    Is it possible in Python coding?
    Sorry i am a beginner.
    Thank you
    have a nice day

Sign In or Register to comment.