Stem Completion

carl · December 2016

Is there a "stem completion" operator that does something similar to stemCompletion in R? For example, stemming converts service, servicing, services, serviced etc. to servic, but I can't see an operator which then returns the stem to a meaningful form, e.g. service, based on some parameters, e.g. shortest form, longest form etc.

IngoRM · December 2016

Hi Carl,

Nope, there is no such operator. I also must admit that this might be a bit "dangerous" since you never would know if the completion is actually close to the original word or not... I guess for visualization purposes this might still be nice though.

Of course you could call the R function from the R Scripting operator (https://marketplace.rapidminer.com/UpdateServer/faces/product_details.xhtml?productId=rmx_r_scripting) which should be relatively easy.

Cheers,

Ingo

kayman · January 2017

That would be the lemma logic opposed to stemming. And I would love to see that supported in RM also :-)

I've done this myself using an 'execute python' operator and then using the NLTK toolkit which has a good lemmatizer.

One of the main complexities is that you need to know the part of speech in order to get the best lemma, so to get the best results you need to run quite some of the textprocessing logic in python. Not a real dealbreaker but it makes the RM workflow less clear.

If you are familiar with python and have the NLTK toolkit installed below raw and dirty operator does work, but you will have to modify the script a bit so that it accepts actual data from an example set instead of the inline test string. It's not the fastest and most elegant approach, but at least it's an option

<?xml version="1.0" encoding="UTF-8"?><process version="7.3.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.3.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="python_scripting:execute_python" compatibility="7.2.000" expanded="true" height="82" name="get lemma" width="90" x="447" y="34">
        <parameter key="script" value="import pandas as pd&#10;import nltk&#10;from nltk.tokenize import sent_tokenize, word_tokenize, regexp_tokenize&#10;from nltk.stem import WordNetLemmatizer&#10;from nltk.tokenize import wordpunct_tokenize&#10;&#10;lm=nltk.WordNetLemmatizer()&#10;&#10;def rm_main(data):&#10;&#10;&#9;text=&quot;I find the findings of the founding fathers a bit beyond of what I found&quot;&#10;&#9;words = [i for i in wordpunct_tokenize(text)]&#10;&#9;#words = [i for i in wordpunct_tokenize(data)]&#10;&#9;pos = nltk.pos_tag(words)&#10;&#9;lemmas = []&#10;&#9;&#10;&#9;for w, p in pos:&#10;&#9;    # get first 2 tokens of pos tag&#10;&#9;    p = p[:2].lower()   &#10;&#9;    # verbs&#10;&#9;    if p=='vb':  &#10;&#9;        lemmas.append(lm.lemmatize(w, 'v'))&#10;&#9;    # Adjectives&#10;&#9;    elif p=='jj':&#10;&#9;        lemmas.append(lm.lemmatize(w, 'a'))&#10;&#9;    # Adverbs&#10;&#9;    elif p=='rb':&#10;&#9;        lemmas.append(lm.lemmatize(w, 'r'))&#10;&#9;    # default (noun)&#10;&#9;    else:&#10;&#9;        lemmas.append(lm.lemmatize(w))&#10;&#10;&#9;data['lemmas']=&quot; &quot;.join(lemmas)&#10;&#10;&#9;return data&#10;"/>
      </operator>
      <connect from_port="input 1" to_op="get lemma" to_port="input 1"/>
      <connect from_op="get lemma" from_port="output 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="source_input 2" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Of course the same can be achieved with R also, but I am less familiar with that one. Just look at it as an alternative way to get external logic working with RM.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Stem Completion

Answers