Options

Read full article RSS feeds with RapidMiner and a free API

SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 344 Unicorn
edited April 2020 in Knowledge Base

 Hi RapidMiners!

 

I wanted to share a process that I use to get full articles out of RSS feeds. It uses Python's Beautiful Soup and a web API called Mercury Postlight.

<?xml version="1.0" encoding="UTF-8"?><process version="8.2.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.2.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="web:read_rss" compatibility="7.3.000" expanded="true" height="68" name="Read RSS Feed" width="90" x="112" y="34">
<parameter key="url" value="https://www.presseportal.de/rss/polizei/laender/9.rss2"/>
<parameter key="user_agent" value="Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0"/>
</operator>
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="514" y="34">
<parameter key="script" value="import pandas&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10;&#9;import requests&#10;&#9;from bs4 import BeautifulSoup&#10;&#9;import json&#10;&#9;&#10;&#9;headers = {&quot;Content-Type&quot;: &quot;application/json&quot;,&#10;&#9; &quot;x-api-key&quot;: &quot;GET YOUR OWN!&quot;&#10;&#9; }&#10;&#9;&#10;&#9;results = []&#10;&#9;for address in data.Link:&#10;&#9;&#9;url = 'https://mercury.postlight.com/parser?url=' + address&#10;&#9;&#9;&#10;&#9;&#9;for dummy in range(10):&#10;&#9;&#9;&#9;try:&#10;&#9;&#9;&#9;&#9;response = requests.get(url, headers = headers)&#10;&#9;&#9;&#9;&#9;break&#10;&#9;&#9;&#9;except:&#10;&#9;&#9;&#9;&#9;continue&#10;&#9;&#9;&#10;&#9;&#9;html = json.loads(response.content)&#10;&#9;&#9;html = html['content']&#10;&#9;&#9;&#10;&#9;&#9;soup = BeautifulSoup(html, &quot;lxml&quot;)&#10;&#9;&#9;text = soup.get_text()&#10;&#9;&#9;text = text.replace('\n', ' ')&#10;&#9;&#9;results.append(text)&#10;&#9;&#10;&#9;data['main_text'] = results&#10;&#9;return data"/>
</operator>
<connect from_op="Read RSS Feed" from_port="output" to_op="Execute Python" to_port="input 1"/>
<connect from_op="Execute Python" from_port="output 1" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

Considering that there are comercial products that do the same, I think it is a valuable resource! The limit of API calls is however limited, so take it into account. It's speed is also much lower than using web scraping alternatives.I hope you enjoy it!

Comments

  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    this is GREAT, @SGolbert! Can I put this on the community repo (with full credit to you of course)?

  • Options
    SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 344 Unicorn

    Yes, sure!

  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    DONE! You can find the process here.

     

    Scott

     

  • Options
    SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 344 Unicorn
    Little update on the process: the code for the mercury API has been open sourced!

    You can find it under

    and use it in your own server, possibly making it a lot faster.

    Regards,
    Sebastian

Sign In or Register to comment.