
Read full article RSS feeds with RapidMiner and a free API

SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 344 Unicorn
edited April 2020 in Knowledge Base

 Hi RapidMiners!


I wanted to share a process that I use to get full articles out of RSS feeds. It uses Python's Beautiful Soup and a web API called Mercury Postlight.

<?xml version="1.0" encoding="UTF-8"?><process version="8.2.001">
<operator activated="true" class="process" compatibility="8.2.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="web:read_rss" compatibility="7.3.000" expanded="true" height="68" name="Read RSS Feed" width="90" x="112" y="34">
<parameter key="url" value="https://www.presseportal.de/rss/polizei/laender/9.rss2"/>
<parameter key="user_agent" value="Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0"/>
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="514" y="34">
<parameter key="script" value="import pandas&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10;&#9;import requests&#10;&#9;from bs4 import BeautifulSoup&#10;&#9;import json&#10;&#9;&#10;&#9;headers = {&quot;Content-Type&quot;: &quot;application/json&quot;,&#10;&#9; &quot;x-api-key&quot;: &quot;GET YOUR OWN!&quot;&#10;&#9; }&#10;&#9;&#10;&#9;results = []&#10;&#9;for address in data.Link:&#10;&#9;&#9;url = 'https://mercury.postlight.com/parser?url=' + address&#10;&#9;&#9;&#10;&#9;&#9;for dummy in range(10):&#10;&#9;&#9;&#9;try:&#10;&#9;&#9;&#9;&#9;response = requests.get(url, headers = headers)&#10;&#9;&#9;&#9;&#9;break&#10;&#9;&#9;&#9;except:&#10;&#9;&#9;&#9;&#9;continue&#10;&#9;&#9;&#10;&#9;&#9;html = json.loads(response.content)&#10;&#9;&#9;html = html['content']&#10;&#9;&#9;&#10;&#9;&#9;soup = BeautifulSoup(html, &quot;lxml&quot;)&#10;&#9;&#9;text = soup.get_text()&#10;&#9;&#9;text = text.replace('\n', ' ')&#10;&#9;&#9;results.append(text)&#10;&#9;&#10;&#9;data['main_text'] = results&#10;&#9;return data"/>
<connect from_op="Read RSS Feed" from_port="output" to_op="Execute Python" to_port="input 1"/>
<connect from_op="Execute Python" from_port="output 1" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>

Considering that there are comercial products that do the same, I think it is a valuable resource! The limit of API calls is however limited, so take it into account. It's speed is also much lower than using web scraping alternatives.I hope you enjoy it!


  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    this is GREAT, @SGolbert! Can I put this on the community repo (with full credit to you of course)?

  • Options
    SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 344 Unicorn

    Yes, sure!

  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    DONE! You can find the process here.




  • Options
    SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 344 Unicorn
    Little update on the process: the code for the mercury API has been open sourced!

    You can find it under

    and use it in your own server, possibly making it a lot faster.


Sign In or Register to comment.