Filter Tokens by POS Tags slow

AndyKirAndyKir Member, University Professor Posts: 3 University Professor
edited December 2018 in Help

I have Filter Tokens by POS Tags inside a loop and it's slow. My guess is that each iteration the tagger loads some data (dictionary?) from HD. Any tips on how to improve the performance? I see that same quesiton was asked 4 years ago and it was not answered. 

Tagged:

Answers

  • 781194025781194025 Member Posts: 32 Contributor I

    Try to pre-process the data as much as possible so the filter operation doesn't have to work as hard.

     

    I'm experimenting with disabling CPU hyper-threading, maybe you could try that? Another tip is set the amount of memory usable in settings to a higher amount.

     

    Otherwise, I dunno, some processes are SLOW! 

  • kaymankayman Member Posts: 662 Unicorn

    Use python NLTK instead if that's an option. It's much more flexible with regards to POS tagging and muuuuuch faster

    Or use R, that's also an option

     

    Below you can find something I created a while ago to give me different outputs based on POS combinations, maybe it can help you further.

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="get noun phrases (2)" width="90" x="246" y="34">
    <parameter key="script" value="import nltk, re&#10;from nltk.tokenize import sent_tokenize, word_tokenize, regexp_tokenize, wordpunct_tokenize&#10;from nltk.chunk import *&#10;from nltk.chunk.util import *&#10;from nltk.chunk.regexp import *&#10;from nltk import untag&#10;&#10;from nltk.stem import PorterStemmer, WordNetLemmatizer&#10;from nltk.stem.lancaster import LancasterStemmer&#10;from nltk.stem.snowball import SnowballStemmer&#10;&#10;def chunckMe(str,rule):&#10;&#10; np=[]&#10; chunk_parser = RegexpChunkParser(rule, chunk_label='LBL')&#10; sentences= sent_tokenize(str)&#10;&#10; for sent in sentences:&#10; d_words=nltk.word_tokenize(sent)&#10; d_tagged=nltk.pos_tag(d_words)&#10; chunked_text = chunk_parser.parse(d_tagged)&#10;&#10; tree = chunked_text&#10; for subtree in tree.subtrees():&#10; if subtree.label() == 'LBL': np.append(&quot; &quot;.join(untag(subtree)).lower())&#10; &#10; return np;&#10;&#10;def rm_main(data):&#10;&#9;&#10;&#9;np_all=[]&#10;&#9;ap_all=[]&#10;&#9;aa_all=[]&#10;&#9;vj_all=[]&#10;&#9;vb_all=[]&#10;&#9;nn_all=[]&#10;&#10;&#9;&#10;&#9;stopwords_dt=(['the','a','this','that','an','another','these','some','every','any'])&#10;&#10;&#9;lm=nltk.WordNetLemmatizer()&#10;&#9;&#10;&#9;for index,row in data.iterrows():&#10;&#10;&#9;&#9;str=row[&quot;case_details&quot;]&#10;&#10;&#9;&#9;chunk_rule = ChunkRule(&quot;&lt;JJ.*&gt;&lt;NN.*&gt;+|&lt;JJ.*&gt;*&lt;NN.*&gt;&lt;CC&gt;*&lt;NN.*&gt;+|&lt;CD&gt;&lt;NN.*&gt;&quot;, &quot;Simple noun phrase&quot;)&#10;&#9;&#9;tags = chunckMe(str,[chunk_rule])&#10;&#9;&#9;np_all.append(', '.join(set(tags)))&#10;&#10;&#9;&#9;chunk_rule = ChunkRule(&quot;&lt;JJ.*&gt;&lt;CC&gt;&lt;JJ.*&gt;|&lt;JJ.*&gt;&lt;TO&gt;*&lt;VB.*&gt;&lt;TO&gt;*&lt;NN.*&gt;+&quot;, &quot;adjective phrase&quot;)&#10;&#9;&#9;tags = chunckMe(str,[chunk_rule])&#10;&#9;&#9;ap_all.append(', '.join(set(tags)))&#10;&#10;&#9;&#9;chunk_rule = ChunkRule(&quot;&lt;RB.*&gt;&lt;JJ.*&gt;|&lt;VB.*&gt;+&lt;RB.*&gt;&quot;, &quot;Adverb - Adjectives&quot;)&#10;&#9;&#9;tags = chunckMe(str,[chunk_rule])&#10;&#9;&#9;aa_all.append(', '.join(set(tags)))&#10;&#10;&#9;&#9;chunk_rule = ChunkRule(&quot;&lt;VB.*&gt;(&lt;JJ.*&gt;|&lt;NN.*&gt;)+&quot;, &quot;verbs - Adjectives&quot;)&#10;&#9;&#9;tags = chunckMe(str,[chunk_rule])&#10;&#9;&#9;vj_all.append(', '.join(set(tags)))&#9;&#10;&#10;&#9;&#9;chunk_rule = ChunkRule(&quot;&lt;WRB&gt;&lt;.*&gt;+&lt;NN&gt;+&quot;, &quot;Nouns&quot;)&#10;&#9;&#9;tags = chunckMe(str,[chunk_rule])&#10;&#9;&#9;nn_all.append(', '.join(set(tags)))&#9;&#10;&#9;&#9;&#10;&#9;&#9;stopwords=(['be','do','have'])&#10;&#9;&#9;chunk_rule = ChunkRule(&quot;&lt;VB.*&gt;&quot;, &quot;Verbs&quot;)&#10;&#9;&#9;tags = chunckMe(str,[chunk_rule])&#10;&#9;&#9;vb_all.append(', '.join([word for word in nltk.word_tokenize(' '.join(set(lm.lemmatize(w, 'v') for w in tags))) if word.lower() not in stopwords]))&#10;&#10;&#10;&#9;data['noun_phrases']=np_all&#10;&#9;data['adjective_phrases']=ap_all&#10;&#9;data['adverb_phrases']=aa_all&#10;&#9;data['verb_phrases']=vj_all&#10;&#9;data['verbs']=vb_all&#10;&#9;data['nouns']=nn_all&#10;&#9;return data&#10;"/>
    <description align="center" color="transparent" colored="false" width="126">Apply python (NLTK) to get POS tags and some other magic</description>
    </operator>
    <connect from_port="input 1" to_op="get noun phrases (2)" to_port="input 1"/>
    <connect from_op="get noun phrases (2)" from_port="output 1" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="source_input 2" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
  • AndyKirAndyKir Member, University Professor Posts: 3 University Professor

    That's what I do for my research, but for teaching I use Rapidminer...

Sign In or Register to comment.