Webcrawling - problem with storing sites

blintblint Member Posts: 1 Contributor I
edited November 2018 in Help
Hey,

I'm working on a web crawling project to analyse various crowdfunding sites' projects via text mining in Rapidminer 5. I have already built a working text analyser, but I'm stuck at the web crawling part. The problem is that the web crawler does crawl through the requested sites, but doesn't store them. I have tried experimenting with page size, depth and the like, but still the program just skips those sites. It is probable that the problem is with my storing rules. They look like the following, when trying to crawl through Kickstarter's sites:

Follow with matching URL:
.+kickstarter.+
Store with matching URL:
https://www\.kickstarter\.com\/projects.+
http://www\.kickstarter\.com\/projects.+
(?i)http.*://www\.kickstarter\.com\/projects.+
An example URL that would need to be stored is:
http://www.kickstarter.com/projects/corvuse/bhaloidam-an-indie-tabletop-storytelling-game?ref=spotlight
(no advertising intended)

And the log looks like the following:
Mar 12, 2014 11:50:37 AM INFO: Following link http://www.kickstarter.com/projects/corvuse/bhaloidam-an-indie-tabletop-storytelling-game?ref=spotlight
Mar 12, 2014 11:50:37 AM INFO: Following link http://kickstarter.tumblr.com/post/12036057734/todays-project-of-the-day-is-bhaloidam-an-indie
Mar 12, 2014 11:50:37 AM INFO: Following link http://kickstarter.tumblr.com/tagged/bhaloidam
Mar 12, 2014 11:50:38 AM INFO: Discarded page "http://kickstarter.tumblr.com/post/79165806431/do-you-like-coloring-and-also-have-questions" because url does not match filter rules.
As you can see, it follows through with the process and just skips these links, and it doesn't even say that it doesn't match the filter rules so it's been discarded, so I'm not even sure that in these cases the program compares the links to the rules. I see a lot of links in the log preceded with ("Following link..") but very few preceded with ("Discarded page..."). Does this mean that it just checks a few pages, or just that it won't notify me for every discarded page?

Thanks in advance!
Cheers

Answers

  • MHMH Member Posts: 3 Contributor I
    I have had the same issue. 

    My parameters:
    url: http://connect.jems.com/profiles/blog/list?tag=EMS
    store with url, follow with url: .+blog.+
    output directory: C:\Program Files\Rapid-I\myfiles\webcrawl
    extension: html
    max pages: 20
    max depth: 20
    domain: web
    delay: 500
    max threads: 2
    max page size: 500
    obey robot exclusion: T

    The output provides the first page.  This is contrary to simafore instruction and example
    http://www.simafore.com/blog/bid/112223/text-mining-how-to-fine-tune-job-searches-using-web-crawling
    which states that the last file is stored.

    I also tried to follow vancouver blog spot
    https://www.youtube.com/watch?v=zMyrw0HsREg#t=13
    and duplicate the result.  For all of my runs, it only shows the first page, although the log shows that it obeys and follows the follow link rule.

    Any help would be greatly appreciated!  I am getting really frustrated with this.

    My code is below:
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.015">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="web:crawl_web" compatibility="5.3.001" expanded="true" height="60" name="Crawl Web" width="90" x="112" y="75">
            <parameter key="url" value="http://connect.jems.com/profiles/blog/list?tag=EMS"/>
            <list key="crawling_rules">
              <parameter key="store_with_matching_url" value=".+blog.+"/>
              <parameter key="follow_link_with_matching_url" value=".+blog.+"/>
            </list>
            <parameter key="add_pages_as_attribute" value="true"/>
            <parameter key="output_dir" value="C:\Program Files\Rapid-I\myfiles\webcrawl"/>
            <parameter key="extension" value="html"/>
            <parameter key="max_pages" value="20"/>
            <parameter key="max_depth" value="20"/>
            <parameter key="delay" value="500"/>
            <parameter key="max_threads" value="2"/>
            <parameter key="max_page_size" value="500"/>
            <parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"/>
          </operator>
          <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • MHMH Member Posts: 3 Contributor I
    I was able to resolve my issue (and you may be able to resolve yours) by working on the user agent name.  Using the site that was recommended by vancouver data (whatismyuseragent), it reported a long string with punctuation (parenthesis, semicolons, etc.)  I revised it to just text with periods.  It worked fine after that.
Sign In or Register to comment.