Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

[Solved] Web Crawler Operator: Empty folder and results

Kate_StrydomKate_Strydom Member Posts: 19 Contributor II
edited November 2018 in Help
Hi,

I have followed all the instructions with regards to  http://auburnbigdata.blogspot.com/2013/04/web-crawling-with-rapidminer.html. My web crawler folder is empty. What am I doing wrong? The system times out at 42s. Has anyone had this problem after changing to .+auburnbigdata.+?
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
 <context>
   <input/>
   <output/>
   <macros/>
 </context>
 <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
   <process expanded="true">
     <operator activated="true" class="web:crawl_web" compatibility="5.3.002" expanded="true" height="60" name="Crawl Web" width="90" x="447" y="75">
       <parameter key="url" value="http://auburnbigdata.blogspot.com"/>
       <list key="crawling_rules">
         <parameter key="follow_link_with_matching_url" value=".+auburnbigdata.+"/>
         <parameter key="store_with_matching_url" value=".+auburnbigdata.+"/>
       </list>
       <parameter key="output_dir" value="C:\Users\cec045\Desktop\CrawlData"/>
       <parameter key="max_depth" value="10"/>
       <parameter key="max_threads" value="2"/>
       <parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"/>
     </operator>
     <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
     <portSpacing port="source_input 1" spacing="0"/>
     <portSpacing port="sink_result 1" spacing="0"/>
     <portSpacing port="sink_result 2" spacing="0"/>
   </process>
 </operator>
</process>

Answers

  • Kate_StrydomKate_Strydom Member Posts: 19 Contributor II
    :) I am so excited to create data outside of the DWH. It gives new meaning to data mining.

    We do not really know what happened but it now works on our virtual machine setup although there still seems to be a problem still on RM on my pc.

    An SA RM user suggested that we:
    change the default max page size to 500.

    Our server expert played around, then we changed the max threads to 4. Perhaps the crawler operator needs more threads, as my pc is limited to 2 threads.

    We then tested it on a different website and I cannot wait to continue to learn the text processing part of RM.

    I noticed that leaving the max pages blank means the crawler pulls everything. We first tested on max pages 20.
Sign In or Register to comment.