how can i extract unique URL in a set of tweets for each user in twitter data set with rapidminer?

ramzanzadeh72ramzanzadeh72 Member Posts: 14 Contributor I
edited June 2019 in Help

hi

i have twitter data set and i want to extract  URLs from tweets  and count unique url for each user... can i do this process in rapid miner?? how??

i share tweets send by one user from my data set 

 

thanke you

data.csv 116.3K

Answers

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi @ramzanzadeh72,

     

    To extract URL from tweets you can use the Extract Entities operator from Aylien extension (to download from Marketplace and you have

    to obtain an API key on the Aylien site).

    However, in your case, you have to purchase a paid license because the free license is limited to 1000 examples / day.

    then you can use Aggregate operator to count the unique URL by user.

    and be patient....... the Extract Entities operator is very long to compute.

     

    Here the process : 

    <?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="read_csv" compatibility="8.2.000" expanded="true" height="68" name="Read CSV" width="90" x="112" y="34">
    <parameter key="csv_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Extract_URL\data.csv"/>
    <parameter key="column_separators" value=","/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations">
    <parameter key="0" value="Name"/>
    </list>
    <parameter key="encoding" value="windows-1252"/>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="text.true.polynominal.attribute"/>
    <parameter key="1" value="user_id.true.real.attribute"/>
    </list>
    </operator>
    <operator activated="true" class="filter_example_range" compatibility="8.2.000" expanded="true" height="82" name="Filter Example Range" width="90" x="246" y="34">
    <parameter key="first_example" value="40"/>
    <parameter key="last_example" value="100"/>
    </operator>
    <operator activated="true" class="com.aylien.textapi.rapidminer:aylien_entities" compatibility="0.2.000" expanded="true" height="68" name="Extract Entities" width="90" x="447" y="34">
    <parameter key="connection" value="Aylien_dkk"/>
    <parameter key="input_attribute" value="text"/>
    </operator>
    <operator activated="true" class="aggregate" compatibility="8.2.000" expanded="true" height="82" name="Aggregate" width="90" x="581" y="34">
    <list key="aggregation_attributes">
    <parameter key="url" value="count"/>
    </list>
    <parameter key="group_by_attributes" value="user_id"/>
    </operator>
    <connect from_op="Read CSV" from_port="output" to_op="Filter Example Range" to_port="example set input"/>
    <connect from_op="Filter Example Range" from_port="example set output" to_op="Extract Entities" to_port="Example Set"/>
    <connect from_op="Extract Entities" from_port="Example Set" to_op="Aggregate" to_port="example set input"/>
    <connect from_op="Aggregate" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    Regards,

     

    Lionel

     

     

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    If you don't want to pay for the Aylien plan, you could also try to extract URLs with specific regular expressions.  Search the forum for several examples of how to do this (it has been mentioned in a couple of other threads).  The manual method is a bit more cumbersome but should be able to extract any URL with the standard format of http://... or https://... or www....

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • ramzanzadeh72ramzanzadeh72 Member Posts: 14 Contributor I
    @Telcontar120
    My problem is that in some tweet exist two or more url and in this case what can I do?? I need to first store urls of each user and then count unique urls, is this posible in rapidminer??
  • ramzanzadeh72ramzanzadeh72 Member Posts: 14 Contributor I

    My problem is that some tweet contain two or more url and when I extract urls in this tweets and then use aggregate only first url considered, what can I do??
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    @ramzanzadeh72,

     

    How mentionned by @Telcontar120, a free method to extract URLs is to use specific regular expressions.

    However, I don't know if it is possible to perform what you want to do with RapidMiner's native operators.

    So I propose a process with 2 branches using 2 Python scripts : 

     - one branch used to extract all the URLs : 

    Extract_URL.png

     - one branch to extract the URLs and count them : 

    Extract_URL_2.png

    In your dataset, the URLs seems to be very simple, so I choose a simple regex to extract URLs (but you can look up a better pattern

    and set it in the Set Macro operator) : 

    Extract_URL_3.png

    Here the process :

    <?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="read_csv" compatibility="8.2.000" expanded="true" height="68" name="Read CSV" width="90" x="45" y="34">
    <parameter key="csv_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Extract_URL\data.csv"/>
    <parameter key="column_separators" value=","/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations">
    <parameter key="0" value="Name"/>
    </list>
    <parameter key="encoding" value="windows-1252"/>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="text.true.polynominal.attribute"/>
    <parameter key="1" value="user_id.true.real.attribute"/>
    </list>
    </operator>
    <operator activated="true" class="set_macro" compatibility="8.2.000" expanded="true" height="82" name="URL_Pattern" width="90" x="179" y="34">
    <parameter key="macro" value="urlPattern"/>
    <parameter key="value" value="r'(https?://\S+)'"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="8.2.000" expanded="true" height="103" name="Multiply" width="90" x="313" y="34"/>
    <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Extract URLs" width="90" x="514" y="34">
    <parameter key="script" value="import pandas as pd&#10;import re&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10;&#10; URLPATTERN = %{urlPattern}&#10;&#10; #data['urlcount'] = data.text.apply(lambda x: re.findall(URLPATTERN, x)).str.len()&#10; data['url'] = data.text.apply(lambda x: re.findall(URLPATTERN, x))&#10;&#10; #data.groupby('user_id').sum()['urlcount']&#10;&#10; # connect 2 output ports to see the results&#10; return data"/>
    </operator>
    <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Extract and count unique URL" width="90" x="514" y="187">
    <parameter key="script" value="import pandas as pd&#10;import re&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10;&#10; URLPATTERN = %{urlPattern}&#10;&#10; data['urlcount'] = data.text.apply(lambda x: re.findall(URLPATTERN, x)).str.len()&#10; &#10;&#10; #data.groupby('user_id').sum()['urlcount']&#10;&#10; # connect 2 output ports to see the results&#10; return data"/>
    </operator>
    <operator activated="true" class="aggregate" compatibility="8.2.000" expanded="true" height="82" name="Aggregate" width="90" x="648" y="187">
    <list key="aggregation_attributes">
    <parameter key="urlcount" value="sum"/>
    </list>
    <parameter key="group_by_attributes" value="user_id"/>
    </operator>
    <connect from_op="Read CSV" from_port="output" to_op="URL_Pattern" to_port="through 1"/>
    <connect from_op="URL_Pattern" from_port="through 1" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_op="Extract URLs" to_port="input 1"/>
    <connect from_op="Multiply" from_port="output 2" to_op="Extract and count unique URL" to_port="input 1"/>
    <connect from_op="Extract URLs" from_port="output 1" to_port="result 1"/>
    <connect from_op="Extract and count unique URL" from_port="output 1" to_op="Aggregate" to_port="example set input"/>
    <connect from_op="Aggregate" from_port="example set output" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>

    To execute this process, you need to : 

     - install Python on your computer.

     - install Execute Python operator (from the marketplace).

     

    I hope it helps,

     

    Regards,

     

    Lionel

     

     

     

  • rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn

    Hi @ramzanzadeh72,

     

    As @Telcontar120 and @lionelderkrikor mentioned, you may want to use regular expressions to identify your matches. A few days ago I wrote about identifying and removing URL's through regular expressions here. Long story short, you can use the Replace operator to apply a regular expression. This was the final expression: 

     

    https?://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]

     

    However, I've been playing with the most common patterns I know for at least 30 minutes now, and couldn't find a way to find everything that isn't this pattern (so you can remove the rest and get URL's only). It appears that in Java (hence, in RapidMiner) you can't use negative matching, because the idea is to actually create the pattern you want matched and then either replaceAll("") the matches or find() the next one and do something (among other methods).

     

    Sorry I couldn't come up with a solution, but at least you know that regular expressions with pure RapidMiner might not be the place to look at to do what you want (and btw, this looks like a nice to have feature, ain't it?).

     

    All the best,

     

  • jozeftomas_2020jozeftomas_2020 Member Posts: 40
     






    Hello
    How to get in rapidminer

    Improved spelling of words?
    For example a word


    meeseg< - message
    or
      veeeery gooood - >very good

    Does anyone know





  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    This last post should be in a new thread. 

    You can use "replace token" to swap a misspelling for a correct one.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • jozeftomas_2020jozeftomas_2020 Member Posts: 40

    Hello. I know . But my words are not fixed, and I've taken those examples. There is no way?

  • atamulewiczatamulewicz Member Posts: 21 Contributor II

    Hi there @jozeftomas_2020

     

    Please search first, as there a few posts in the Community on replacing text.

     

    Can you please post this question in a new thread under the Getting Started Forum? This way others that have the same question will be able to find it at a later date. 

     

    Thanks, 

    Allie Tamulewicz 

Sign In or Register to comment.