Remove or replace URL and RT from Twitter dataset

ikayunida123ikayunida123 Member Posts: 17 Contributor II
edited December 2018 in Help

Hello everyone!

So right now I'm trying to do a data cleaning phase on text classification using Twitter dataset. But I have a problem about how to replace (or maybe remove) the URL, RT and @ character. I've read some post on the forum but I didn't understand anything :catsad:

For the URL on the dataset, I want to change the format from "https:" or "http:" to "link" (I don't know why it can't have a null value like " "). But after I executed my process using Replace operator, the result from "http://blablabla" didn't change into "link" only, but the result come out like this "linkblablabla". Maybe it has something to do with the RegEx? :catsad: I know what's RegEx but I don't how how to use and write it :catsad:

I'm really confused right now. Please help me.

This's my RapidMiner process :

<?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.1.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="8.1.001" expanded="true" height="68" name="Retrieve Dataset Skripsi" width="90" x="45" y="34">
<parameter key="repository_entry" value="Dataset Skripsi"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="8.1.001" expanded="true" height="82" name="Nominal to Text" width="90" x="179" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Text"/>
</operator>
<operator activated="true" class="set_role" compatibility="8.1.001" expanded="true" height="82" name="Set Role" width="90" x="313" y="34">
<parameter key="attribute_name" value="Label"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="filter_examples" compatibility="8.1.001" expanded="true" height="103" name="Filter Examples" width="90" x="447" y="34">
<parameter key="condition_class" value="no_missing_attributes"/>
<list key="filters_list"/>
</operator>
<operator activated="true" class="remove_duplicates" compatibility="8.1.001" expanded="true" height="103" name="Remove Duplicates" width="90" x="581" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Text"/>
</operator>
<operator activated="true" class="replace" compatibility="8.1.001" expanded="true" height="82" name="Replace" width="90" x="715" y="34">
<parameter key="replace_what" value="(https://)"/>
<parameter key="replace_by" value="link"/>
</operator>
<connect from_op="Retrieve Dataset Skripsi" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Filter Examples" from_port="example set output" to_op="Remove Duplicates" to_port="example set input"/>
<connect from_op="Remove Duplicates" from_port="example set output" to_op="Replace" to_port="example set input"/>
<connect from_op="Replace" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

I need your help. Thank you!

Tagged:

Best Answer

Answers

  • David_ADavid_A Administrator, Moderator, Employee, RMResearcher, Member Posts: 297 RM Research

    Woah great solution and very detailed.

    I took the liberty to re-use it to answer the same question on Stack Overflow.

  • ikayunida123ikayunida123 Member Posts: 17 Contributor II

    @rfuentealba Oh my god, thank you so much! It works nicely on my process :catvery-happy:

  • rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn

    Glad it helped. However, I was reading my answer again and found that I made a mistake. Not a serious one unless you are parsing thousands of URL's (in that case, every saved flops counts):

     

    https?://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]

     

    This is the final regular expression you should use. Using (http|https?) at the end is redundant (like asking if it's http or it's http or it's https), because s? means that the content might or might not have the character s at the end.

     

    Also, for future reference, I've found that on this implementation of regular expressions there is no need to escape the / character. That's a behaviour I acquired from using UNIX command line tools such as vim or sed.

     

  • AmosGHAmosGH Member Posts: 7 Learner I
    I also tried (https|http)(.*) for my URL and it worked
  • kaymankayman Member Posts: 662 Unicorn
    If you want a bit more 'readability' you could also change the A-Za-z0-9_ with \w\d which covers every word character and digit. 
Sign In or Register to comment.