HELP please-Regular expressions (Replace tokens)

happy_neidhappy_neid Member Posts: 10 Contributor I
edited November 2018 in Help

I want to find all tokens that are #hashtags and to replace them with the word "mention", but i want to leave certain subset of those hashtags,. 

Example: If i have words #apple #juice #tree #dog #table  i want to replace #apple and #juice with the word "mention"  and i want to leave tokens #tree #dog and #table as they are now. 
 
How to do that with operator replace tokens?

I would really appreciate any help...

Answers

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    To drop the "#" you could do something like do a selection like #(.*) and then a replace by $1.

     

    If you want to select #apple and replace it with "mention" you could do a selection like #apple and then replace with mention. This could get very messy if you have a lot of words you want to replace.

     

    What I would suggest to do is use the Replace Dictionary operator and pass a list of words you want to change to mention. everything needs to be in a nominal data format first and then you have to convert it to text to let the Process Documents from Data work. In essence you do the token replacement before you text process.

  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn

    Hi,

     

    what you're trying to do is a so-called "negative lookahead", an advanced regular expression concept.

     

    Take a look at this process:

    <?xml version="1.0" encoding="UTF-8"?><process version="7.3.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.3.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="generate_data_user_specification" compatibility="7.3.001" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="112" y="34">
    <list key="attribute_values">
    <parameter key="example1" value="&quot;words #apple #juice #tree #dog #table i want to replace&quot;"/>
    <parameter key="example2" value="&quot;other words like #apple, #ibm, #microsoft, #rapidminer, #dog, whatever&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="replace" compatibility="7.3.001" expanded="true" height="82" name="Replace" width="90" x="246" y="34">
    <parameter key="replace_what" value="\#(?!(tree|dog|table))(\w+)"/>
    <parameter key="replace_by" value="mention"/>
    </operator>
    <connect from_op="Generate Data by User Specification" from_port="output" to_op="Replace" to_port="example set input"/>
    <connect from_op="Replace" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    It seems to do what you want.

    The hashtags you don't want to match are given in this expression: \#(?!(tree|dog|table))(\w+)

     

    Regards,

    Balázs

  • Mustafa_AVDANMustafa_AVDAN Member Posts: 34 Contributor I

    hey ı have the same problem and ı did it like you said but result is not what ı want.please look at my screen and help me:\

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Are you looking to do something like this?

    2017-12-01_9-34-33.png

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.002">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.6.002" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="social_media:search_twitter" compatibility="7.3.000" expanded="true" height="68" name="Search Twitter" width="90" x="45" y="34">
    <parameter key="connection" value="Twtter Test"/>
    <parameter key="query" value="Windows"/>
    <parameter key="locale" value="en"/>
    </operator>
    <operator activated="true" class="replace" compatibility="7.6.002" expanded="true" height="82" name="Replace" width="90" x="246" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Text"/>
    <parameter key="replace_what" value="#(\w+)"/>
    <parameter key="replace_by" value="$1"/>
    </operator>
    <connect from_op="Search Twitter" from_port="output" to_op="Replace" to_port="example set input"/>
    <connect from_op="Replace" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
  • Mustafa_AVDANMustafa_AVDAN Member Posts: 34 Contributor I

    oow thanks Sir;

    when ı changed $1 as "myword" , it worked succesfully.Thanks to Rapid Miner Family:D

Sign In or Register to comment.