RapidMiner 9.7 is Now Available

Lots of amazing new improvements including true version control! Learn more about what's new here.

CLICK HERE TO DOWNLOAD

[Solved] XPath queries are empty

Legacy UserLegacy User Member Posts: 0 Newbie
edited November 2018 in Help
Hi there, I am trying to extract text information from http://www.tripadvisor.com/ShowTopic-g29220-i86-k1487815-Alamo-Maui_Hawaii.html using the Get Page and Process Documents with the extract Information Subprocess.

The query result however is empty no matter what I try. Has anyone an idea?

here the Process Code:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
 <context>
   <input/>
   <output/>
   <macros/>
 </context>
 <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
   <process expanded="true">
     <operator activated="true" class="web:get_webpage" compatibility="5.3.001" expanded="true" height="60" name="Get Page" width="90" x="45" y="75">
       <parameter key="url" value="http://www.tripadvisor.com/ShowTopic-g29220-i86-k1487815-Alamo-Maui_Hawaii.html"/>
       <parameter key="random_user_agent" value="true"/>
       <list key="query_parameters"/>
       <list key="request_properties"/>
     </operator>
     <operator activated="true" class="text:process_documents" compatibility="5.3.002" expanded="true" height="94" name="Process Documents" width="90" x="380" y="30">
       <process expanded="true">
         <operator activated="true" class="text:extract_information" compatibility="5.3.002" expanded="true" height="60" name="Extract Information" width="90" x="45" y="30">
           <parameter key="query_type" value="XPath"/>
           <list key="string_machting_queries"/>
           <list key="regular_expression_queries"/>
           <list key="regular_region_queries"/>
           <list key="xpath_queries">
             <parameter key="xpath1" value="//div[@class='postBody']"/>
             <parameter key="xpath2" value="//div[@class='postBody']/text()"/>
             <parameter key="xpath3" value="//div[@class='postBody']/p[not(*)][text()]"/>
           </list>
           <list key="namespaces"/>
           <list key="index_queries"/>
         </operator>
         <connect from_port="document" to_op="Extract Information" to_port="document"/>
         <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
         <portSpacing port="source_document" spacing="0"/>
         <portSpacing port="sink_document 1" spacing="0"/>
         <portSpacing port="sink_document 2" spacing="0"/>
       </process>
     </operator>
     <connect from_op="Get Page" from_port="output" to_op="Process Documents" to_port="documents 1"/>
     <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
     <portSpacing port="source_input 1" spacing="0"/>
     <portSpacing port="sink_result 1" spacing="0"/>
     <portSpacing port="sink_result 2" spacing="0"/>
   </process>
 </operator>
</process>

Thank you very much in advance.  ;D

Answers

  • Legacy UserLegacy User Member Posts: 0 Newbie
    Does anyone have an Idea please? I Have the feeling I am very close to the solution but I am missing something.

    My Problem seems to be quite simmilar to the one discussed here: http://rapid-i.com/rapidforum/index.php/topic,7753.0.html but I just dont get it working for me.

    ???
  • frasfras Member Posts: 93 Contributor II
    XPath expressions in RapidMiner need an additional namespace classivier "h:".
    So change
    //div[@class='postBody']
    to
    //h:div[@class='postBody']
  • Legacy UserLegacy User Member Posts: 0 Newbie
    thank you very much for your reply.

    It seems like I am getting closer to my goal.
    Now I think only my XPath query is not completely correct.

    With th query: //h:div[@class='postBody'][not(contains(.,'http://www.'))]

    I get the following output:
    <div xmlns="http://www.w3.org/1999/xhtml" class="postBody">
      <div id="pst_adm_9020974" />
      <div id="top_adm_1487815" />
      <div id="usr_adm_tgienger" />
      <p>Just a quick comment on Alamo car rentals. Was just out there last week and had a convertible from Alamo. Got it thru priceline and was really concerned, based on all the "bad press" that Alamo has gotten on here. To my surprise, had zero problems with Alamo. Got a nearly-new Sebring conv, with 3000 miles on it.</p>
      <p />
      <p />
      <p />
      <p>Dreaded the waiting-in-line, but had no problems there either. In and out in short-order. Probably less than 10 minutes either day.</p>
      <p />
      <p />
      <p />
      <p>I did have a problem with the Sebring, but it had nothing to do with Alamo. Seems that Chrysler, in their infinite wisdom, decided to have the conv top take up space in the trunk. That works fine when the trunk is empty. But one night we forgot the beach chairs in the trunk. And when we put the top down the next morning, it shattered the back glass! Seems to me that Chrysler could have done a better job designing the conv!</p>
      <p />
      <p />
      <p />
      <p>Still waiting to hear back from Alamo what that's going to cost me (and my insurance)...</p>
    </div>
    This is already a very good result. But how do I get rid of the last bits of HTML-Tags? And why do I have to add the  namespace classifier exactly?


    The XML now is:
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.015">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="web:get_webpage" compatibility="5.3.001" expanded="true" height="60" name="Get Page" width="90" x="45" y="75">
            <parameter key="url" value="http://www.tripadvisor.com/ShowTopic-g29220-i86-k1487815-Alamo-Maui_Hawaii.html"/>
            <parameter key="random_user_agent" value="true"/>
            <list key="query_parameters"/>
            <list key="request_properties"/>
          </operator>
          <operator activated="true" class="multiply" compatibility="5.3.015" expanded="true" height="76" name="Multiply" width="90" x="179" y="75"/>
          <operator activated="true" class="text:process_documents" compatibility="5.3.002" expanded="true" height="94" name="Process Documents (2)" width="90" x="380" y="75">
            <process expanded="true">
              <operator activated="true" class="text:extract_information" compatibility="5.3.002" expanded="true" height="60" name="Extract Information (2)" width="90" x="380" y="30">
                <parameter key="query_type" value="XPath"/>
                <list key="string_machting_queries"/>
                <list key="regular_expression_queries"/>
                <list key="regular_region_queries">
                  <parameter key="extract" value="&lt;p&gt;+.&lt;/p&gt;"/>
                </list>
                <list key="xpath_queries">
                  <parameter key="xpath1" value="//h:div[@class='postBody'][not(contains(.,'http://www.'))]"/>
                </list>
                <list key="namespaces"/>
                <list key="index_queries"/>
              </operator>
              <connect from_port="document" to_op="Extract Information (2)" to_port="document"/>
              <connect from_op="Extract Information (2)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Get Page" from_port="output" to_op="Multiply" to_port="input"/>
          <connect from_op="Multiply" from_port="output 1" to_op="Process Documents (2)" to_port="documents 1"/>
          <connect from_op="Process Documents (2)" from_port="example set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>









    Again, thank you very much for your help
  • Legacy UserLegacy User Member Posts: 0 Newbie
    I Just fount the solution! :)

    Thank you for your help.

    The XPath query has to be: string(//h:div[@class='postBody'][not(contains(.,'http://www.'))])
Sign In or Register to comment.