Not getting any results for "Process Documents from Web"

RionArisuRionArisu Member Posts: 13 Contributor I
I'm trying to perform web scraping on a URL by using "Process Documents from Web" operator, and have set a xpath query using "Extract information" operator. I have tested the xpath query at google spreadsheet "importxml" function and it seemed to work fine. However, when I run the process in rapidminer, it does not return any results.

What could be the reason?
Would really appreciate if anyone can help me :smile: 

My xml codes:
<?xml version="1.0" encoding="UTF-8"?><process version="9.8.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.8.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="web:process_web_modern" compatibility="9.3.001" expanded="true" height="68" name="Process Documents from Web" width="90" x="112" y="85">
        <parameter key="url" value="https://en.wikipedia.org/wiki/List_of_Running_Man_episodes_(2020)"/>
        <list key="crawling_rules"/>
        <parameter key="max_crawl_depth" value="2"/>
        <parameter key="retrieve_as_html" value="false"/>
        <parameter key="enable_basic_auth" value="false"/>
        <parameter key="add_content_as_attribute" value="false"/>
        <parameter key="max_page_size" value="1000"/>
        <parameter key="delay" value="200"/>
        <parameter key="max_concurrent_connections" value="100"/>
        <parameter key="max_connections_per_host" value="100"/>
        <parameter key="user_agent" value="rapidminer-web-mining-extension-crawler"/>
        <parameter key="ignore_robot_exclusion" value="false"/>
        <process expanded="true">
          <operator activated="true" breakpoints="after" class="text:extract_information" compatibility="9.3.001" expanded="true" height="68" name="Extract Information" width="90" x="112" y="34">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <parameter key="attribute_type" value="Nominal"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="Title" value="//*[@id=&amp;quot;mw-content-text&quot;]/div[1]/table[2]/tbody/tr[2]/td[2]/i"/>
            </list>
            <list key="namespaces"/>
            <parameter key="ignore_CDATA" value="true"/>
            <parameter key="assume_html" value="true"/>
            <list key="index_queries"/>
            <list key="jsonpath_queries"/>
          </operator>
          <connect from_port="document" to_op="Extract Information" to_port="document"/>
          <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Process Documents from Web" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>


Tagged:

Best Answer

  • kaymankayman Member Posts: 662 Unicorn
    edited January 2021 Solution Accepted
    Yeah, this is a bit tricky because of namespaces. Rapidminer by defaults needs h: to define the (x)html namespace. Namespaces are a required thing that makes xml live overly complex.

    The right syntax to use would therefore be something like this :

    //*[@id="JSID_cwCompanyNews"]/h:div/h:div/h:div/h:div[1]/h:ul/h:li[1]/h:span/h:span[1]/h:span[2]/h:a/text()

    Note the h: in front of every element, this allows rapidminer to parse correctly as now it knows it's dealing with html. I also added the text() operator, now it returns 

    アナリストが予想する22年3月期の業績急改善企業

    as Title attribute. In order to get this you need to convert your document to an exampleset .

    Note that you could bypass the namespace problems if you would be working with properly parsed xhtml, then you can disable the 'assume html' option and work with xpath 'the easy way'. Your website isn't proper xhtml so in order to get to this you would have to first use the html to xml convertor to ensure it's parseable, and then remove the namespaces. You could do this with a regex that just replaces everything till the <html> tag as that's where the namespaces are. 

    Something like (?s)^.*?<html.*?>  replace with <html>. Now there are no more namespaces so you could use the 'standard' notation. Googledocs does this for you behind the scenes, which is fine for html, but makes it unuseable for any other XML, and that's where RM offers more options then.

Answers

  • kaymankayman Member Posts: 662 Unicorn
    You didn't set any crawling rules as far as I can see, hence why it's not doing anything indeed. 
    Since you only need to get one page you might be better of with using the get_page operator followed by the extract information to get your xpath.
  • RionArisuRionArisu Member Posts: 13 Contributor I
    Thanks, will try to figure out the crawling rules.
    I have tried to use get page followed by extract information as well, but that doesn't seem to return any results as well.

    My xml:
    <?xml version="1.0" encoding="UTF-8"?><process version="9.8.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.8.001" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="web:get_webpage" compatibility="9.3.001" expanded="true" height="68" name="Get Page" width="90" x="179" y="187">
            <parameter key="url" value="https://www.nikkei.com/nkd/company/news/?scode=7203&amp;ba=1&amp;DisplayType=1"/>
            <parameter key="random_user_agent" value="false"/>
            <parameter key="connection_timeout" value="10000"/>
            <parameter key="read_timeout" value="10000"/>
            <parameter key="follow_redirects" value="true"/>
            <parameter key="accept_cookies" value="none"/>
            <parameter key="cookie_scope" value="global"/>
            <parameter key="request_method" value="GET"/>
            <list key="query_parameters"/>
            <list key="request_properties"/>
            <parameter key="override_encoding" value="false"/>
            <parameter key="encoding" value="SYSTEM"/>
            <parameter key="keep_sensitive_headers" value="false"/>
          </operator>
          <operator activated="true" class="text:extract_information" compatibility="9.3.001" expanded="true" height="68" name="Extract Information (2)" width="90" x="313" y="187">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <parameter key="attribute_type" value="Nominal"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="Title" value="//*[@id=&amp;quot;JSID_cwCompanyNews&quot;]/div/div/div/div[1]/ul/li[1]/span/span[1]/span[2]/a"/>
            </list>
            <list key="namespaces"/>
            <parameter key="ignore_CDATA" value="true"/>
            <parameter key="assume_html" value="true"/>
            <list key="index_queries"/>
            <list key="jsonpath_queries"/>
          </operator>
          <connect from_op="Get Page" from_port="output" to_op="Extract Information (2)" to_port="document"/>
          <connect from_op="Extract Information (2)" from_port="document" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
    


  • RionArisuRionArisu Member Posts: 13 Contributor I
    Thanks a lot, that solve my issue
Sign In or Register to comment.