How to clean tweets from hashtags and @

baranbaran Member Posts: 5 Contributor II
edited November 2018 in Help
Hi everybody
I tried for 3 days to clean tweets from hashtags and @ but I couldn' t. Is there anybody for help

Answers

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder

    Hi,

     

    Do you mean just getting rid of the symbols "@ and #" or do you also want to remove what is following after, e.g. "@ingomierswa" and "#datascience" should be completely removed?

     

    Both is easily possible with the operator "Replace" and a simple regular expression.  Below is a small sample process showing you how this is done.

     

    Hope this helps,

    Ingo

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.3.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.3.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="generate_data_user_specification" compatibility="7.3.001" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="112" y="34">
    <list key="attribute_values">
    <parameter key="sample_tweet" value="&quot;This is just a sample tweet from @ingomierswa on #datascience - end of tweet.&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="7.3.001" expanded="true" height="103" name="Multiply" width="90" x="246" y="34"/>
    <operator activated="true" class="replace" compatibility="7.3.001" expanded="true" height="82" name="Only remove symbols" width="90" x="380" y="34">
    <parameter key="replace_what" value="@|#"/&gt;
    </operator>
    <operator activated="true" class="replace" compatibility="7.3.001" expanded="true" height="82" name="Complete entities removed" width="90" x="380" y="136">
    <parameter key="replace_what" value="@[a-zA-Z]*|#[a-zA-Z]*"/&gt;
    </operator>
    <connect from_op="Generate Data by User Specification" from_port="output" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_op="Only remove symbols" to_port="example set input"/>
    <connect from_op="Multiply" from_port="output 2" to_op="Complete entities removed" to_port="example set input"/>
    <connect from_op="Only remove symbols" from_port="example set output" to_port="result 1"/>
    <connect from_op="Complete entities removed" from_port="example set output" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="84"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>
  • baranbaran Member Posts: 5 Contributor II
    Yes exactly Thank you I will try it tomorrow then edit this post.
  • HyramHyram Member Posts: 39 Contributor II
    Hi @IngoRM. This worked thank you, but I'm left with characters other than letters. So this clears up letters after the # but not other characters. For example, I had @g_smug and it only removed @g and stopped at the underscore. Any suggestions?

    Thanks 
  • kaymankayman Member Posts: 662 Unicorn

    Extend your regex a bit like this :

    \b(@|#)[^\. \s, ]+

    It looks a bit ugly but basically means find anything 'word' that starts with either @ or #, and select everything till the next space, dot or comma. You replace this with nothing and it's gone.

Sign In or Register to comment.