Webcrawling only works for some sites

AqilAqil Member Posts: 1 Contributor I
edited November 2018 in Help
I'm trying to get data from various websites (mostly with ads), so I'm trying out RapidMiner's web crawler function. I've successfully downloaded from wikipedia.org , google.com and some more, just for practicing purposes. But it seems that there are many sites from which I can't get any data. For example I can't get RapidMiner to crawl gumtree.com/property-for-sale . I noticed that web crawling is disliked by many, despite my good intentions. So I was thinking it was because of robot exclusions, but as you can see in the code below, that was not the problem. I also changed the name of the user agent to "Firefox", and played around with the other parameters. When it works it takes seconds or minutes to finish the task, and generates a bunch of neatly arranged txt files in the specified folder. When it doesn't work I don't get any error, but rather a message declaring that "New results were created". However, it finishes in 0 seconds, and no files are to be seen anywhere. Why doesn't the web crawler work for some sites (the ones with juicy data) and how can I make it work?
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.003">
  <operator activated="true" class="process" compatibility="5.2.003" expanded="true" name="Process">
    <process expanded="true" height="145" width="212">
      <operator activated="true" class="web:crawl_web" compatibility="5.1.004" expanded="true" height="60" name="Crawl Web" width="90" x="45" y="30">
        <parameter key="url" value="http://www.gumtree.com/property-for-sale"/>
        <list key="crawling_rules">
          <parameter key="store_with_matching_url" value=".+t.+"/>
          <parameter key="follow_link_with_matching_url" value=".+t.+"/>
        <parameter key="output_dir" value="/home/aqil/RapidMiner/rapidminer/repository/Test"/>
        <parameter key="max_depth" value="1"/>
        <parameter key="user_agent" value="Firefox"/>
        <parameter key="obey_robot_exclusion" value="false"/>
        <parameter key="really_ignore_exclusion" value="true"/>
      <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>


  • Options
    nelsonthekingernelsonthekinger Member Posts: 5 Contributor II
    I had the same problem and that's a size matter.
    you should incrise the size of your "max page size" parameter
    so that you could get your pages :)

Sign In or Register to comment.