Problem with Xpath query? Processing documents from web

pix123pix123 Member Posts: 27 Contributor I
edited August 10 in Help
Hi there,

I am trying to extract documents from a film review site. When I run the process below I get 0 results but can't figure out the problem, can anyone help? Thanks.

<?xml version=1.0 encoding=UTF-8?><process version="9.0.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="concurrency:loop" compatibility="9.0.003" expanded="true" height="82" name="Loop" width="90" x="313" y="238">
        <parameter key="number_of_iterations" value="10"/>
        <process expanded="true">
          <operator activated="true" class="web:process_web_modern" compatibility="9.0.000" expanded="true" height="68" name="Process Documents from Web" width="90" x="179" y="85">
            <parameter key="url" value="https://www.rottentomatoes.com/m/chef_2014/reviews/"/>
            <list key="crawling_rules"/>
            <process expanded="true">
              <operator activated="true" class="text:cut_document" compatibility="8.1.000" expanded="true" height="68" name="Cut Document" width="90" x="246" y="34">
                <parameter key="query_type" value="XPath"/>
                <list key="string_machting_queries"/>
                <list key="regular_expression_queries"/>
                <list key="regular_region_queries"/>
                <list key="xpath_queries">
                  <parameter key="seg" value="//h:table[@class='table table-striped']/h:tr"/>
                </list>
                <list key="namespaces"/>
                <list key="index_queries"/>
                <list key="jsonpath_queries"/>
                <process expanded="true">
                  <connect from_port="segment" to_port="document 1"/>
                  <portSpacing port="source_segment" spacing="0"/>
                  <portSpacing port="sink_document 1" spacing="0"/>
                  <portSpacing port="sink_document 2" spacing="0"/>
                </process>
              </operator>
              <operator activated="true" class="text:extract_information" compatibility="8.1.000" expanded="true" height="68" name="Extract Information" width="90" x="447" y="34">
                <parameter key="query_type" value="XPath"/>
                <list key="string_machting_queries"/>
                <list key="regular_expression_queries"/>
                <list key="regular_region_queries"/>
                <list key="xpath_queries">
                  <parameter key="text" value="//h:p/text|)"/>
                </list>
                <list key="namespaces"/>
                <list key="index_queries"/>
                <list key="jsonpath_queries"/>
              </operator>
              <connect from_port="document" to_op="Cut Document" to_port="document"/>
              <connect from_op="Cut Document" from_port="documents" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Process Documents from Web" from_port="example set" to_port="output 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Loop" from_port="output 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>





Best Answer

Answers

  • kaymankayman Member Posts: 368   Unicorn
    Are you sure about the page or the used path?

    The reviews are not in a table but in a div, the used logic is looking for a table but that is not existing (table-striped cannot be found in the source code)

    This is how a review is stored, using a div with class 'the_review'. 

    <div class="the_review">
    This is a lovely, funny, wonderfully acted film. The big problem is, it's an 80-minute movie that takes two hours. By the time you get to the real story, you're out of gas.
    </div>

    so try with 
    <parameter key="seg" value="//h:div[@class='the_review']"/>

    It's untested, so don't take it for granted :-)

    What could have happened is that you tested the site during an A/B test, or that the page code is different depending on the agent used by Rapidminer. 

  • pix123pix123 Member Posts: 27 Contributor I
    edited December 2018
    @kayman Thank you, this helped a lot, it had been a while since I had worked with the process.

    I've now got most of the xPath attributes up and running but can not retrieve the score for each. I get a question mark when I run the process, all other attributes are ok. Any ideas?

    <?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="concurrency:loop" compatibility="9.0.003" expanded="true" height="82" name="Loop" width="90" x="179" y="34">
            <parameter key="number_of_iterations" value="10"/>
            <process expanded="true">
              <operator activated="true" class="web:get_webpage" compatibility="9.0.000" expanded="true" height="68" name="Get Page" width="90" x="112" y="34">
                <parameter key="url" value="https://www.rottentomatoes.com/m/chef_2014/reviews/?page=%{iteration}"/>
                <list key="query_parameters"/>
                <list key="request_properties"/>
              </operator>
              <operator activated="true" class="text:process_documents" compatibility="8.1.000" expanded="true" height="103" name="Process Documents" width="90" x="380" y="34">
                <process expanded="true">
                  <operator activated="true" class="text:extract_information" compatibility="8.1.000" expanded="true" height="68" name="Extract Information" width="90" x="246" y="85">
                    <parameter key="query_type" value="XPath"/>
                    <list key="string_machting_queries"/>
                    <list key="regular_expression_queries"/>
                    <list key="regular_region_queries"/>
                    <list key="xpath_queries">
                      <parameter key="Review" value="/h:html/h:body/h:div[5]/h:div[4]/h:div[2]/h:section/h:div/h:div/h:div[2]/h:div[4]/h:div[1]/h:div[2]/h:div[2]/h:div[2]/h:div[1]/text() "/>
                      <parameter key="Date Posted" value="/h:html/h:body/h:div[5]/h:div[4]/h:div[2]/h:section/h:div/h:div/h:div[2]/h:div[4]/h:div[1]/h:div[2]/h:div[2]/h:div[1]/text()"/>
                      <parameter key="Publisher" value="/h:html/h:body/h:div[5]/h:div[4]/h:div[2]/h:section/h:div/h:div/h:div[2]/h:div[4]/h:div[1]/h:div[1]/h:div[3]/h:a[2]/h:em/text()"/>
                      <parameter key="Score" value="/h:html/h:body/h:div[5]/h:div[4]/h:div[2]/h:section/h:div/h:div/h:div[2]/h:div[4]/h:div[1]/h:div[2]/h:div[2]/h:div[2]/h:div[2]/text"/>
                      <parameter key="Critic Name" value="/h:html/h:body/h:div[5]/h:div[4]/h:div[2]/h:section/h:div/h:div/h:div[2]/h:div[4]/h:div[1]/h:div[1]/h:div[3]/h:a[1]/text() "/>
                    </list>
                    <list key="namespaces"/>
                    <list key="index_queries"/>
                    <list key="jsonpath_queries"/>
                  </operator>
                  <operator activated="true" class="text:cut_document" compatibility="8.1.000" expanded="true" height="68" name="Cut Document" width="90" x="447" y="85">
                    <parameter key="query_type" value="XPath"/>
                    <list key="string_machting_queries"/>
                    <list key="regular_expression_queries"/>
                    <list key="regular_region_queries"/>
                    <list key="xpath_queries">
                      <parameter key="Feedback_text" value="//h:div[@class='the_review']/text()"/>
                    </list>
                    <list key="namespaces"/>
                    <list key="index_queries"/>
                    <list key="jsonpath_queries"/>
                    <process expanded="true">
                      <connect from_port="segment" to_port="document 1"/>
                      <portSpacing port="source_segment" spacing="0"/>
                      <portSpacing port="sink_document 1" spacing="0"/>
                      <portSpacing port="sink_document 2" spacing="0"/>
                    </process>
                  </operator>
                  <connect from_port="document" to_op="Extract Information" to_port="document"/>
                  <connect from_op="Extract Information" from_port="document" to_op="Cut Document" to_port="document"/>
                  <connect from_op="Cut Document" from_port="documents" to_port="document 1"/>
                  <portSpacing port="source_document" spacing="0"/>
                  <portSpacing port="sink_document 1" spacing="0"/>
                  <portSpacing port="sink_document 2" spacing="0"/>
                </process>
              </operator>
              <connect from_op="Get Page" from_port="output" to_op="Process Documents" to_port="documents 1"/>
              <connect from_op="Process Documents" from_port="example set" to_port="output 1"/>
              <portSpacing port="source_input 1" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="store" compatibility="9.0.003" expanded="true" height="68" name="Store" width="90" x="1251" y="85">
            <parameter key="repository_entry" value="New Output of Web Pages/RT Reviews"/>
          </operator>
          <connect from_op="Loop" from_port="output 1" to_op="Store" to_port="input"/>
          <connect from_op="Store" from_port="through" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>






  • pix123pix123 Member Posts: 27 Contributor I
    @kayman thank you, that has been really helpful
  • cbaslancbaslan Member Posts: 6 Contributor II
    I really wondered can't I use Cut Document (using Xpath) directly to the output of Get Page? Why do I need all the hassle? I am trying to apply a Xpath selector to a webpage and couldn't get it working until I tried your solution but I want to learn the logic. Should I convert all html files to xml first? Isn't there a way to select a part of a web site directly?
  • kaymankayman Member Posts: 368   Unicorn
    edited July 23
    Xpath does expect proper XML to work with. If your source code (your get page output) is proper XHTML, and does't use to many namespaces you can do this directly. But as in reality most websites use a very flexible way of dealing with XHTML, and have doctypes all in the wrong places it is always safer to do some cleaning in advance. By experience I know only a small amount of websites are having real valid XML code in their source data.

    Now, if you are pretty familiar with XPath and XSLT I'd suggest to use the process XSLT operator instead. Just insert your XSLT (v1.0) in a document and convert your page any way you like as a pro...
Sign In or Register to comment.