Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

SOLVED_RSS feeds & MySQL- 100 Records Only!

dudesterdudester Member Posts: 15 Maven
edited November 2018 in Help
I'll try to be brief:  basically I have an issue trying to scrape complete RSS feeds into a MySQL database.  Largely it works OK; for some reason that I can't decipher, it will only read 100 entries into MySQL, and lately has been freezing my computer, likely due to memory constraints.  (I speculate that this may due to recent extension additions -Image Processing, IDA?).  
Anyway, according to the log, the RSS feed is pulled is less than 5 seconds, then it hangs while it tries to display results.  The system monitor shows available memory down to zip.  I believe I have the MySQL settings correct; the data example set in Rapid Miner never pulls more than 100 entries at a time, even while I've got the batch size at 10,000.  I need another pair of eyes...

So, the code below for the process:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.003">
 <context>
   <input/>
   <output/>
   <macros/>
 </context>
 <operator activated="true" class="process" compatibility="5.2.003" expanded="true" name="Process">
   <parameter key="logverbosity" value="all"/>
   <process expanded="true" height="466" width="797">
     <operator activated="true" class="web:read_rss" compatibility="5.2.000" expanded="true" height="60" name="Read RSS Feed" width="90" x="45" y="30">
       <parameter key="url" value="http://some random feed=rss"/>
       <parameter key="random_user_agent" value="true"/>
       <parameter key="connection_timeout" value="100000"/>
       <parameter key="read_timeout" value="100000"/>
     </operator>
     <operator activated="true" class="write_database" compatibility="5.2.003" expanded="true" height="60" name="Write Database" width="90" x="246" y="75">
       <parameter key="connection" value="dbconnectionvalue"/>
       <parameter key="use_default_schema" value="false"/>
       <parameter key="schema_name" value="schema1"/>
       <parameter key="table_name" value="tablename1"/>
       <parameter key="overwrite_mode" value="append"/>
       <parameter key="batch_size" value="10000"/>
     </operator>
     <connect from_op="Read RSS Feed" from_port="output" to_op="Write Database" to_port="input"/>
     <connect from_op="Write Database" from_port="through" to_port="result 1"/>
     <portSpacing port="source_input 1" spacing="0"/>
     <portSpacing port="sink_result 1" spacing="0"/>
     <portSpacing port="sink_result 2" spacing="0"/>
   </process>
 </operator>
</process>

Why the magic number of 100 feeds only pulled?  I don't see it, either here or in Rapid Miner preferences.

Answers

  • dudesterdudester Member Posts: 15 Maven
    Oops, my bad...nothing to do with either Rapid Miner or MySQL.

    Apparently Yahoo Pipes limits the amount of data you can scrape at a time to 100 items.  There is kind of a workaround but best to either use another online mashup, or perhaps a desktop variety for later input into DM.

    From http://pipes.yqlblog.net/.

    RSS pagination.
    "Initial RSS output is now limited to the first 100 items. Each paginated page is limited to 100 items as well. To access each subsequent page add parameter &page=2…etc. to the pipe.run url to retrieve more items." 
Sign In or Register to comment.