Not getting any results for "Process Documents from Web"

RionArisu · December 2020

I'm trying to perform web scraping on a URL by using "Process Documents from Web" operator, and have set a xpath query using "Extract information" operator. I have tested the xpath query at google spreadsheet "importxml" function and it seemed to work fine. However, when I run the process in rapidminer, it does not return any results.

What could be the reason?
Would really appreciate if anyone can help me

My xml codes:

<?xml version="1.0" encoding="UTF-8"?><process version="9.8.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.8.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="web:process_web_modern" compatibility="9.3.001" expanded="true" height="68" name="Process Documents from Web" width="90" x="112" y="85">
        <parameter key="url" value="https://en.wikipedia.org/wiki/List_of_Running_Man_episodes_(2020)"/>
        <list key="crawling_rules"/>
        <parameter key="max_crawl_depth" value="2"/>
        <parameter key="retrieve_as_html" value="false"/>
        <parameter key="enable_basic_auth" value="false"/>
        <parameter key="add_content_as_attribute" value="false"/>
        <parameter key="max_page_size" value="1000"/>
        <parameter key="delay" value="200"/>
        <parameter key="max_concurrent_connections" value="100"/>
        <parameter key="max_connections_per_host" value="100"/>
        <parameter key="user_agent" value="rapidminer-web-mining-extension-crawler"/>
        <parameter key="ignore_robot_exclusion" value="false"/>
        <process expanded="true">
          <operator activated="true" breakpoints="after" class="text:extract_information" compatibility="9.3.001" expanded="true" height="68" name="Extract Information" width="90" x="112" y="34">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <parameter key="attribute_type" value="Nominal"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="Title" value="//*[@id=&amp;quot;mw-content-text&quot;]/div[1]/table[2]/tbody/tr[2]/td[2]/i"/>
            </list>
            <list key="namespaces"/>
            <parameter key="ignore_CDATA" value="true"/>
            <parameter key="assume_html" value="true"/>
            <list key="index_queries"/>
            <list key="jsonpath_queries"/>
          </operator>
          <connect from_port="document" to_op="Extract Information" to_port="document"/>
          <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Process Documents from Web" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

kayman · January 2021

Yeah, this is a bit tricky because of namespaces. Rapidminer by defaults needs h: to define the (x)html namespace. Namespaces are a required thing that makes xml live overly complex.

The right syntax to use would therefore be something like this :

//*[@id="JSID_cwCompanyNews"]/h:div/h:div/h:div/h:div[1]/h:ul/h:li[1]/h:span/h:span[1]/h:span[2]/h:a/text()

Note the h: in front of every element, this allows rapidminer to parse correctly as now it knows it's dealing with html. I also added the text() operator, now it returns

アナリストが予想する22年３月期の業績急改善企業

as Title attribute. In order to get this you need to convert your document to an exampleset .

Note that you could bypass the namespace problems if you would be working with properly parsed xhtml, then you can disable the 'assume html' option and work with xpath 'the easy way'. Your website isn't proper xhtml so in order to get to this you would have to first use the html to xml convertor to ensure it's parseable, and then remove the namespaces. You could do this with a regex that just replaces everything till the <html> tag as that's where the namespaces are.

Something like (?s)^.*?<html.*?> replace with <html>. Now there are no more namespaces so you could use the 'standard' notation. Googledocs does this for you behind the scenes, which is fine for html, but makes it unuseable for any other XML, and that's where RM offers more options then.

kayman · January 2021

You didn't set any crawling rules as far as I can see, hence why it's not doing anything indeed.
Since you only need to get one page you might be better of with using the get_page operator followed by the extract information to get your xpath.

RionArisu · January 2021

Thanks, will try to figure out the crawling rules.
I have tried to use get page followed by extract information as well, but that doesn't seem to return any results as well.

My xml:

<?xml version="1.0" encoding="UTF-8"?><process version="9.8.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.8.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="web:get_webpage" compatibility="9.3.001" expanded="true" height="68" name="Get Page" width="90" x="179" y="187">
        <parameter key="url" value="https://www.nikkei.com/nkd/company/news/?scode=7203&amp;ba=1&amp;DisplayType=1"/>
        <parameter key="random_user_agent" value="false"/>
        <parameter key="connection_timeout" value="10000"/>
        <parameter key="read_timeout" value="10000"/>
        <parameter key="follow_redirects" value="true"/>
        <parameter key="accept_cookies" value="none"/>
        <parameter key="cookie_scope" value="global"/>
        <parameter key="request_method" value="GET"/>
        <list key="query_parameters"/>
        <list key="request_properties"/>
        <parameter key="override_encoding" value="false"/>
        <parameter key="encoding" value="SYSTEM"/>
        <parameter key="keep_sensitive_headers" value="false"/>
      </operator>
      <operator activated="true" class="text:extract_information" compatibility="9.3.001" expanded="true" height="68" name="Extract Information (2)" width="90" x="313" y="187">
        <parameter key="query_type" value="XPath"/>
        <list key="string_machting_queries"/>
        <parameter key="attribute_type" value="Nominal"/>
        <list key="regular_expression_queries"/>
        <list key="regular_region_queries"/>
        <list key="xpath_queries">
          <parameter key="Title" value="//*[@id=&amp;quot;JSID_cwCompanyNews&quot;]/div/div/div/div[1]/ul/li[1]/span/span[1]/span[2]/a"/>
        </list>
        <list key="namespaces"/>
        <parameter key="ignore_CDATA" value="true"/>
        <parameter key="assume_html" value="true"/>
        <list key="index_queries"/>
        <list key="jsonpath_queries"/>
      </operator>
      <connect from_op="Get Page" from_port="output" to_op="Extract Information (2)" to_port="document"/>
      <connect from_op="Extract Information (2)" from_port="document" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

RionArisu · January 2021

Thanks a lot, that solve my issue

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Not getting any results for "Process Documents from Web"

Best Answer

Answers