Webcrawling only works for some sites

Aqil · April 2012

I'm trying to get data from various websites (mostly with ads), so I'm trying out RapidMiner's web crawler function. I've successfully downloaded from wikipedia.org , google.com and some more, just for practicing purposes. But it seems that there are many sites from which I can't get any data. For example I can't get RapidMiner to crawl gumtree.com/property-for-sale . I noticed that web crawling is disliked by many, despite my good intentions. So I was thinking it was because of robot exclusions, but as you can see in the code below, that was not the problem. I also changed the name of the user agent to "Firefox", and played around with the other parameters. When it works it takes seconds or minutes to finish the task, and generates a bunch of neatly arranged txt files in the specified folder. When it doesn't work I don't get any error, but rather a message declaring that "New results were created". However, it finishes in 0 seconds, and no files are to be seen anywhere. Why doesn't the web crawler work for some sites (the ones with juicy data) and how can I make it work?

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.003" expanded="true" name="Process">
    <process expanded="true" height="145" width="212">
      <operator activated="true" class="web:crawl_web" compatibility="5.1.004" expanded="true" height="60" name="Crawl Web" width="90" x="45" y="30">
        <parameter key="url" value="http://www.gumtree.com/property-for-sale"/>
        <list key="crawling_rules">
          <parameter key="store_with_matching_url" value=".+t.+"/>
          <parameter key="follow_link_with_matching_url" value=".+t.+"/>
        </list>
        <parameter key="output_dir" value="/home/aqil/RapidMiner/rapidminer/repository/Test"/>
        <parameter key="max_depth" value="1"/>
        <parameter key="user_agent" value="Firefox"/>
        <parameter key="obey_robot_exclusion" value="false"/>
        <parameter key="really_ignore_exclusion" value="true"/>
      </operator>
      <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Thanks!

nelsonthekinger · April 2014

Whassup,
I had the same problem and that's a size matter.
you should incrise the size of your "max page size" parameter
so that you could get your pages

cheers

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Webcrawling only works for some sites

Answers