🦉 🎤   RapidMiner Wisdom 2020 - CALL FOR SPEAKERS DEADLINE IS NOVEMBER 15   🦉 🎤

CLICK HERE TO GO TO ENTRY FORM

Read full article RSS feeds with RapidMiner and a free API

SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 341   Unicorn
edited November 2018 in Knowledge Base

 Hi RapidMiners!

 

I wanted to share a process that I use to get full articles out of RSS feeds. It uses Python's Beautiful Soup and a web API called Mercury Postlight.

<?xml version="1.0" encoding="UTF-8"?><process version="8.2.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.2.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="web:read_rss" compatibility="7.3.000" expanded="true" height="68" name="Read RSS Feed" width="90" x="112" y="34">
<parameter key="url" value="https://www.presseportal.de/rss/polizei/laender/9.rss2"/>
<parameter key="user_agent" value="Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0"/>
</operator>
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="514" y="34">
<parameter key="script" value="import pandas&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10;&#9;import requests&#10;&#9;from bs4 import BeautifulSoup&#10;&#9;import json&#10;&#9;&#10;&#9;headers = {&quot;Content-Type&quot;: &quot;application/json&quot;,&#10;&#9; &quot;x-api-key&quot;: &quot;GET YOUR OWN!&quot;&#10;&#9; }&#10;&#9;&#10;&#9;results = []&#10;&#9;for address in data.Link:&#10;&#9;&#9;url = 'https://mercury.postlight.com/parser?url=' + address&#10;&#9;&#9;&#10;&#9;&#9;for dummy in range(10):&#10;&#9;&#9;&#9;try:&#10;&#9;&#9;&#9;&#9;response = requests.get(url, headers = headers)&#10;&#9;&#9;&#9;&#9;break&#10;&#9;&#9;&#9;except:&#10;&#9;&#9;&#9;&#9;continue&#10;&#9;&#9;&#10;&#9;&#9;html = json.loads(response.content)&#10;&#9;&#9;html = html['content']&#10;&#9;&#9;&#10;&#9;&#9;soup = BeautifulSoup(html, &quot;lxml&quot;)&#10;&#9;&#9;text = soup.get_text()&#10;&#9;&#9;text = text.replace('\n', ' ')&#10;&#9;&#9;results.append(text)&#10;&#9;&#10;&#9;data['main_text'] = results&#10;&#9;return data"/>
</operator>
<connect from_op="Read RSS Feed" from_port="output" to_op="Execute Python" to_port="input 1"/>
<connect from_op="Execute Python" from_port="output 1" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

Considering that there are comercial products that do the same, I think it is a valuable resource! The limit of API calls is however limited, so take it into account. It's speed is also much lower than using web scraping alternatives.I hope you enjoy it!

BalazsBaranyPavithra_Raorfuentealba

Comments

  • sgenzersgenzer 12Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,581  Community Manager

    this is GREAT, @SGolbert! Can I put this on the community repo (with full credit to you of course)?

    ----------------------
    Don't forget to submit your great ideas for Wisdom 2020! Deadline is November 15.

    Wisdom 2020 – Call for Speakers Form 
  • SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 341   Unicorn

    Yes, sure!

  • sgenzersgenzer 12Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,581  Community Manager

    DONE! You can find the process here.

     

    Scott

     

    ----------------------
    Don't forget to submit your great ideas for Wisdom 2020! Deadline is November 15.

    Wisdom 2020 – Call for Speakers Form 
  • SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 341   Unicorn
    Little update on the process: the code for the mercury API has been open sourced!

    You can find it under

    and use it in your own server, possibly making it a lot faster.

    Regards,
    Sebastian

    sgenzer
Sign In or Register to comment.