RapidMiner 9.7 is Now Available

Lots of amazing new improvements including true version control! Learn more about what's new here.

CLICK HERE TO DOWNLOAD

How to clean tweets from hashtags and @

baranbaran Member Posts: 5 Contributor I
edited November 2018 in Help
Hi everybody
I tried for 3 days to clean tweets from hashtags and @ but I couldn' t. Is there anybody for help

Answers

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,749  RM Founder

    Hi,

     

    Do you mean just getting rid of the symbols "@ and #" or do you also want to remove what is following after, e.g. "@ingomierswa" and "#datascience" should be completely removed?

     

    Both is easily possible with the operator "Replace" and a simple regular expression.  Below is a small sample process showing you how this is done.

     

    Hope this helps,

    Ingo

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.3.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.3.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="generate_data_user_specification" compatibility="7.3.001" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="112" y="34">
    <list key="attribute_values">
    <parameter key="sample_tweet" value="&quot;This is just a sample tweet from @ingomierswa on #datascience - end of tweet.&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="7.3.001" expanded="true" height="103" name="Multiply" width="90" x="246" y="34"/>
    <operator activated="true" class="replace" compatibility="7.3.001" expanded="true" height="82" name="Only remove symbols" width="90" x="380" y="34">
    <parameter key="replace_what" value="@|#"/&gt;
    </operator>
    <operator activated="true" class="replace" compatibility="7.3.001" expanded="true" height="82" name="Complete entities removed" width="90" x="380" y="136">
    <parameter key="replace_what" value="@[a-zA-Z]*|#[a-zA-Z]*"/&gt;
    </operator>
    <connect from_op="Generate Data by User Specification" from_port="output" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_op="Only remove symbols" to_port="example set input"/>
    <connect from_op="Multiply" from_port="output 2" to_op="Complete entities removed" to_port="example set input"/>
    <connect from_op="Only remove symbols" from_port="example set output" to_port="result 1"/>
    <connect from_op="Complete entities removed" from_port="example set output" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="84"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>
  • baranbaran Member Posts: 5 Contributor I
    Yes exactly Thank you I will try it tomorrow then edit this post.
  • HyramHyram Member Posts: 39 Contributor II
    Hi @IngoRM. This worked thank you, but I'm left with characters other than letters. So this clears up letters after the # but not other characters. For example, I had @g_smug and it only removed @g and stopped at the underscore. Any suggestions?

    Thanks 
  • kaymankayman Member Posts: 509   Unicorn

    Extend your regex a bit like this :

    \b(@|#)[^\. \s, ]+

    It looks a bit ugly but basically means find anything 'word' that starts with either @ or #, and select everything till the next space, dot or comma. You replace this with nothing and it's gone.

Sign In or Register to comment.