The Altair Community is migrating to a new platform to provide a better experience for you. The RapidMiner Community will merge with the Altair Community at the same time. In preparation for the migration, both communities are on read-only mode from July 15th - July 24th, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here.
Options

"problem with web crawling"

platanas20platanas20 Member Posts: 22 Contributor II
edited May 2019 in Help
Hello all,

we want to take some comments(only text) from a website using xpath.we tried a lot of differents commands but we cant find what goes wrong.Can anyone help?

platanas20

our xml code is:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.004">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.004" expanded="true" name="Process">
    <parameter key="encoding" value="UTF-8"/>
    <process expanded="true" height="603" width="880">
      <operator activated="true" class="web:process_web" compatibility="5.1.000" expanded="true" height="60" name="Process Documents from Web" width="90" x="246" y="165">
        <parameter key="url" value="http://www.opengov.gr/ypes/?p=877#comments"/>
        <list key="crawling_rules">
          <parameter key="store_with_matching_url" value=".*page.*"/>
          <parameter key="follow_link_with_matching_url" value=".*page.*|.*.gr.*"/>
        </list>
        <parameter key="max_pages" value="10"/>
        <parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.71 Safari/534.24"/>
        <process expanded="true" height="485" width="979">
          <operator activated="true" class="text:extract_information" compatibility="5.1.001" expanded="true" height="60" name="Extract Information (2)" width="90" x="210" y="30">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="comment" value="//div[@class=&amp;quot;comment even thread-even depth-1&quot;]/p/h:/text()"/>
            </list>
            <list key="namespaces"/>
            <list key="index_queries"/>
          </operator>
          <connect from_port="document" to_op="Extract Information (2)" to_port="document"/>
          <connect from_op="Extract Information (2)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Process Documents from Web" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>
Tagged:

Answers

  • Options
    el_chiefel_chief Member Posts: 63 Contributor II
    make sure you test with google spreadsheets first (or a similar program) so that you can see if it works

    this xpath seemed to work:

    //ul[@class='comment_list']/li/div[2]/p/text()

    remember, in rapidminer, you have to preceed tagnames with "h:", so it should be

    //h:ul[@class='comment_list']/h:li/h:div[2]/h:p/text()

    see

    http://vancouverdata.blogspot.com/2011/02/how-to-web-scraping-xpath-html-google.html
  • Options
    platanas20platanas20 Member Posts: 22 Contributor II
    Hello Neil,
    Thank you very much.This xpath command works for our project.
    But now we use the operator "crawl web" and we want the pages from http://www.opengov.gr/ypes/?p=877#comments and we dont have results.Do you know what is the problem?
    Because with other websites this project works fine (with necessary changes in parameter keys of course).

    My xml code:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.004">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.1.004" expanded="true" name="Process">
        <parameter key="encoding" value="UTF-8"/>
        <process expanded="true" height="603" width="880">
          <operator activated="true" class="web:crawl_web" compatibility="5.1.000" expanded="true" height="60" name="Crawl Web" width="90" x="179" y="210">
            <parameter key="url" value="http://www.opengov.gr/ypes/?p=877#comments"/>
            <list key="crawling_rules">
              <parameter key="store_with_matching_url" value=".*page.*"/>
              <parameter key="follow_link_with_matching_url" value=".*page.*|.*.gr.*"/>
            </list>
            <parameter key="output_dir" value="C:\Users\elenious\Desktop\diplomatiki\newresults\temp"/>
            <parameter key="max_pages" value="10"/>
            <parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1) AppleWebKit/534.30 (KHTML, like Gecko) Chrome/12.0.742.100 Safari/534.30"/>
          </operator>
          <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
Sign In or Register to comment.