RapidMiner 9.7 is Now Available

Lots of amazing new improvements including true version control! Learn more about what's new here.

CLICK HERE TO DOWNLOAD

Remove or replace URL and RT from Twitter dataset

ikayunida123ikayunida123 Member Posts: 17 Contributor II
edited December 2018 in Help

Hello everyone!

So right now I'm trying to do a data cleaning phase on text classification using Twitter dataset. But I have a problem about how to replace (or maybe remove) the URL, RT and @ character. I've read some post on the forum but I didn't understand anything :catsad:

For the URL on the dataset, I want to change the format from "https:" or "http:" to "link" (I don't know why it can't have a null value like " "). But after I executed my process using Replace operator, the result from "http://blablabla" didn't change into "link" only, but the result come out like this "linkblablabla". Maybe it has something to do with the RegEx? :catsad: I know what's RegEx but I don't how how to use and write it :catsad:

I'm really confused right now. Please help me.

This's my RapidMiner process :

<?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.1.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="8.1.001" expanded="true" height="68" name="Retrieve Dataset Skripsi" width="90" x="45" y="34">
<parameter key="repository_entry" value="Dataset Skripsi"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="8.1.001" expanded="true" height="82" name="Nominal to Text" width="90" x="179" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Text"/>
</operator>
<operator activated="true" class="set_role" compatibility="8.1.001" expanded="true" height="82" name="Set Role" width="90" x="313" y="34">
<parameter key="attribute_name" value="Label"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="filter_examples" compatibility="8.1.001" expanded="true" height="103" name="Filter Examples" width="90" x="447" y="34">
<parameter key="condition_class" value="no_missing_attributes"/>
<list key="filters_list"/>
</operator>
<operator activated="true" class="remove_duplicates" compatibility="8.1.001" expanded="true" height="103" name="Remove Duplicates" width="90" x="581" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Text"/>
</operator>
<operator activated="true" class="replace" compatibility="8.1.001" expanded="true" height="82" name="Replace" width="90" x="715" y="34">
<parameter key="replace_what" value="(https://)"/>
<parameter key="replace_by" value="link"/>
</operator>
<connect from_op="Retrieve Dataset Skripsi" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Filter Examples" from_port="example set output" to_op="Remove Duplicates" to_port="example set input"/>
<connect from_op="Remove Duplicates" from_port="example set output" to_op="Replace" to_port="example set input"/>
<connect from_op="Replace" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

I need your help. Thank you!

Tagged:

Best Answer

Answers

  • David_ADavid_A Administrator, Moderator, Employee, RMResearcher, Member Posts: 246  RM Research

    Woah great solution and very detailed.

    I took the liberty to re-use it to answer the same question on Stack Overflow.

    rfuentealbasgenzer
  • ikayunida123ikayunida123 Member Posts: 17 Contributor II

    @rfuentealba Oh my god, thank you so much! It works nicely on my process :catvery-happy:

    rfuentealbasgenzer
  • rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 512   Unicorn

    Glad it helped. However, I was reading my answer again and found that I made a mistake. Not a serious one unless you are parsing thousands of URL's (in that case, every saved flops counts):

     

    https?://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]

     

    This is the final regular expression you should use. Using (http|https?) at the end is redundant (like asking if it's http or it's http or it's https), because s? means that the content might or might not have the character s at the end.

     

    Also, for future reference, I've found that on this implementation of regular expressions there is no need to escape the / character. That's a behaviour I acquired from using UNIX command line tools such as vim or sed.

     

    sgenzer
  • AmosGHAmosGH Member Posts: 7 Learner I
    I also tried (https|http)(.*) for my URL and it worked
    Tghadially
  • kaymankayman Member Posts: 506   Unicorn
    If you want a bit more 'readability' you could also change the A-Za-z0-9_ with \w\d which covers every word character and digit. 
    Tghadially
Sign In or Register to comment.