RapidMiner

HELP please-Regular expressions (Replace tokens)

Contributor II

HELP please-Regular expressions (Replace tokens)

I want to find all tokens that are #hashtags and to replace them with the word "mention", but i want to leave certain subset of those hashtags,. 

Example: If i have words #apple #juice #tree #dog #table  i want to replace #apple and #juice with the word "mention"  and i want to leave tokens #tree #dog and #table as they are now. 
 
How to do that with operator replace tokens?

I would really appreciate any help...

2 REPLIES
Community Manager

Re: HELP please-Regular expressions (Replace tokens)

To drop the "#" you could do something like do a selection like #(.*) and then a replace by $1.

 

If you want to select #apple and replace it with "mention" you could do a selection like #apple and then replace with mention. This could get very messy if you have a lot of words you want to replace.

 

What I would suggest to do is use the Replace Dictionary operator and pass a list of words you want to change to mention. everything needs to be in a nominal data format first and then you have to convert it to text to let the Process Documents from Data work. In essence you do the token replacement before you text process.

Regards,
Thomas
LinkedIn: Thomas Ott
Blog: Neural Market Trends
Elite II

Re: HELP please-Regular expressions (Replace tokens)

Hi,

 

what you're trying to do is a so-called "negative lookahead", an advanced regular expression concept.

 

Take a look at this process:

<?xml version="1.0" encoding="UTF-8"?><process version="7.3.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.3.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="generate_data_user_specification" compatibility="7.3.001" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="112" y="34">
        <list key="attribute_values">
          <parameter key="example1" value="&quot;words #apple #juice #tree #dog #table  i want to replace&quot;"/>
          <parameter key="example2" value="&quot;other words like #apple, #ibm, #microsoft, #rapidminer, #dog, whatever&quot;"/>
        </list>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="replace" compatibility="7.3.001" expanded="true" height="82" name="Replace" width="90" x="246" y="34">
        <parameter key="replace_what" value="\#(?!(tree|dog|table))(\w+)"/>
        <parameter key="replace_by" value="mention"/>
      </operator>
      <connect from_op="Generate Data by User Specification" from_port="output" to_op="Replace" to_port="example set input"/>
      <connect from_op="Replace" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

It seems to do what you want.

The hashtags you don't want to match are given in this expression: \#(?!(tree|dog|table))(\w+)

 

Regards,

Balázs

--
Balázs Bárány
Data Scientist, Vienna
https://datascientist.at