"Web Crawler problem"

mmaragmmarag Member Posts: 35 Maven
edited May 2019 in Help
Hi all,

i am phasing a serious bug when using the web crawler or the process documents from web processes. I am attempting to run a simple opinion mining experiment on http://www.opengov.gr/ web site, which according to the robots.txt file allows every agent freely.

Howeever, nothing happens and there is nothing in my log as well. I did not use any rule for your information. Kind regards

mmarag
Tagged:

Answers

  • haddockhaddock Member Posts: 849 Maven
    Hi there Mmarag,

    For the future, if you paste the XML of your process it makes it easier to check, for the present the following code appears to work, so I ponder where the "serious bug" really lies.

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.006">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.1.006" expanded="true" name="Process">
        <parameter key="encoding" value="UTF-8"/>
        <process expanded="true" height="454" width="812">
          <operator activated="true" class="web:crawl_web" compatibility="5.1.000" expanded="true" height="60" name="Crawl Web" width="90" x="111" y="242">
            <parameter key="url" value="http://www.opengov.gr/"/>
            <list key="crawling_rules">
              <parameter key="follow_link_with_matching_url" value=".*gr.*"/>
              <parameter key="store_with_matching_url" value=".*gr.*"/>
            </list>
            <parameter key="write_pages_into_files" value="false"/>
            <parameter key="add_pages_as_attribute" value="true"/>
            <parameter key="output_dir" value="C:\Documents and Settings\Administrator.KNOWLEDG-P6715Y\My Documents"/>
            <parameter key="max_pages" value="10"/>
            <parameter key="obey_robot_exclusion" value="false"/>
            <parameter key="really_ignore_exclusion" value="true"/>
          </operator>
          <operator activated="true" class="web:get_webpage" compatibility="5.1.000" expanded="true" height="60" name="Get Page" width="90" x="62" y="117">
            <parameter key="url" value="http://www.opengov.gr/home/"/>
            <list key="query_parameters"/>
          </operator>
          <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
          <connect from_op="Get Page" from_port="output" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
  • mmaragmmarag Member Posts: 35 Maven
    Dear Sir,

    thank you very much for the rapid response.

    Mmarag
Sign In or Register to comment.