"problem with web crawling"

platanas20platanas20 Member Posts: 22 Contributor II
edited May 2019 in Help
Hello all,

we want to take some comments(only text) from a website using xpath.we tried a lot of differents commands but we cant find what goes wrong.Can anyone help?

platanas20

our xml code is:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.004">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.004" expanded="true" name="Process">
    <parameter key="encoding" value="UTF-8"/>
    <process expanded="true" height="603" width="880">
      <operator activated="true" class="web:process_web" compatibility="5.1.000" expanded="true" height="60" name="Process Documents from Web" width="90" x="246" y="165">
        <parameter key="url" value="http://www.opengov.gr/ypes/?p=877#comments"/>
        <list key="crawling_rules">
          <parameter key="store_with_matching_url" value=".*page.*"/>
          <parameter key="follow_link_with_matching_url" value=".*page.*|.*.gr.*"/>
        </list>
        <parameter key="max_pages" value="10"/>
        <parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.71 Safari/534.24"/>
        <process expanded="true" height="485" width="979">
          <operator activated="true" class="text:extract_information" compatibility="5.1.001" expanded="true" height="60" name="Extract Information (2)" width="90" x="210" y="30">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="comment" value="//div[@class=&amp;quot;comment even thread-even depth-1&quot;]/p/h:/text()"/>
            </list>
            <list key="namespaces"/>
            <list key="index_queries"/>
          </operator>
          <connect from_port="document" to_op="Extract Information (2)" to_port="document"/>
          <connect from_op="Extract Information (2)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Process Documents from Web" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>
Tagged:

Answers

  • el_chiefel_chief Member Posts: 63 Contributor II
    make sure you test with google spreadsheets first (or a similar program) so that you can see if it works

    this xpath seemed to work:

    //ul[@class='comment_list']/li/div[2]/p/text()

    remember, in rapidminer, you have to preceed tagnames with "h:", so it should be

    //h:ul[@class='comment_list']/h:li/h:div[2]/h:p/text()

    see

    http://vancouverdata.blogspot.com/2011/02/how-to-web-scraping-xpath-html-google.html
  • platanas20platanas20 Member Posts: 22 Contributor II
    Hello Neil,
    Thank you very much.This xpath command works for our project.
    But now we use the operator "crawl web" and we want the pages from http://www.opengov.gr/ypes/?p=877#comments and we dont have results.Do you know what is the problem?
    Because with other websites this project works fine (with necessary changes in parameter keys of course).

    My xml code:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.004">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.1.004" expanded="true" name="Process">
        <parameter key="encoding" value="UTF-8"/>
        <process expanded="true" height="603" width="880">
          <operator activated="true" class="web:crawl_web" compatibility="5.1.000" expanded="true" height="60" name="Crawl Web" width="90" x="179" y="210">
            <parameter key="url" value="http://www.opengov.gr/ypes/?p=877#comments"/>
            <list key="crawling_rules">
              <parameter key="store_with_matching_url" value=".*page.*"/>
              <parameter key="follow_link_with_matching_url" value=".*page.*|.*.gr.*"/>
            </list>
            <parameter key="output_dir" value="C:\Users\elenious\Desktop\diplomatiki\newresults\temp"/>
            <parameter key="max_pages" value="10"/>
            <parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1) AppleWebKit/534.30 (KHTML, like Gecko) Chrome/12.0.742.100 Safari/534.30"/>
          </operator>
          <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
Sign In or Register to comment.