[SOLVED] Crawl Web not producing any results!

stringer_bellstringer_bell Member Posts: 2 Contributor I
edited November 2018 in Help
Trying to crawl and save every boxscore from http://www.pro-football-reference.com/years/2007/games.htm

It produces no results. Process starts and finishes in 0s.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
 <context>
   <input/>
   <output/>
   <macros/>
 </context>
 <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
   <process expanded="true" height="190" width="279">
     <operator activated="true" class="web:crawl_web" compatibility="5.2.003" expanded="true" height="60" name="Crawl Web" width="90" x="179" y="75">
       <parameter key="url" value="http://www.pro-football-reference.com/years/2007/games.htm"/>
       <list key="crawling_rules">
         <parameter key="follow_link_with_matching_url" value=".*boxscores/2007.*"/>
         <parameter key="store_with_matching_url" value=".*boxscores/2007.*"/>
       </list>
       <parameter key="output_dir" value="C:\Users\Stringer Bell\Desktop\scrape"/>
       <parameter key="extension" value="html"/>
       <parameter key="max_depth" value="3"/>
       <parameter key="obey_robot_exclusion" value="false"/>
       <parameter key="really_ignore_exclusion" value="true"/>
     </operator>
     <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
     <portSpacing port="source_input 1" spacing="0"/>
     <portSpacing port="sink_result 1" spacing="0"/>
     <portSpacing port="sink_result 2" spacing="0"/>
   </process>
 </operator>
</process>

If anyone can help it would be appreciated. I have spent hours on this and cannot figure it out.

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869   Unicorn
    Hi, you have to increase the max_page_size.

    Best, Marius
  • stringer_bellstringer_bell Member Posts: 2 Contributor I
    Thank you Marius!  :)
Sign In or Register to comment.