[Solved] Web Crawler Operator: Empty folder and results

Kate_Strydom · January 2015

Hi,

I have followed all the instructions with regards to http://auburnbigdata.blogspot.com/2013/04/web-crawling-with-rapidminer.html. My web crawler folder is empty. What am I doing wrong? The system times out at 42s. Has anyone had this problem after changing to .+auburnbigdata.+?

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="web:crawl_web" compatibility="5.3.002" expanded="true" height="60" name="Crawl Web" width="90" x="447" y="75">
        <parameter key="url" value="http://auburnbigdata.blogspot.com"/>
        <list key="crawling_rules">
          <parameter key="follow_link_with_matching_url" value=".+auburnbigdata.+"/>
          <parameter key="store_with_matching_url" value=".+auburnbigdata.+"/>
        </list>
        <parameter key="output_dir" value="C:\Users\cec045\Desktop\CrawlData"/>
        <parameter key="max_depth" value="10"/>
        <parameter key="max_threads" value="2"/>
        <parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"/>
      </operator>
      <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Kate_Strydom · January 2015

I am so excited to create data outside of the DWH. It gives new meaning to data mining.

We do not really know what happened but it now works on our virtual machine setup although there still seems to be a problem still on RM on my pc.

An SA RM user suggested that we:
change the default max page size to 500.

Our server expert played around, then we changed the max threads to 4. Perhaps the crawler operator needs more threads, as my pc is limited to 2 threads.

We then tested it on a different website and I cannot wait to continue to learn the text processing part of RM.

I noticed that leaving the max pages blank means the crawler pulls everything. We first tested on max pages 20.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

[Solved] Web Crawler Operator: Empty folder and results

Answers