Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

Extracting Information With XPath

el_chiefel_chief Member Posts: 63 Contributor II
edited November 2018 in Help
Hello,

I am having trouble getting a value from a HTML using XPATH. This is my process:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
 <context>
   <input/>
   <output/>
   <macros/>
 </context>
 <operator activated="true" class="process" compatibility="5.0.8" expanded="true" name="Process">
   <process expanded="true" height="505" width="415">
     <operator activated="true" class="text:create_document" compatibility="5.0.6" expanded="true" height="60" name="Create Document" width="90" x="45" y="30">
       <parameter key="text" value="&lt;html&gt;&#13;&#10;&lt;head&gt;&#13;&#10;&lt;title&gt;hello&lt;/title&gt;&#13;&#10;&lt;/head&gt;&#13;&#10;&lt;body&gt;&#13;&#10;&lt;div class=&quot;class1&quot;&gt;goodbye&lt;/div&gt;&#13;&#10;&lt;/body&gt;&#13;&#10;&lt;/html&gt;"/>
     </operator>
     <operator activated="true" class="text:extract_information" compatibility="5.0.6" expanded="true" height="60" name="Extract Information" width="90" x="179" y="165">
       <parameter key="query_type" value="XPath"/>
       <list key="string_machting_queries"/>
       <list key="regular_expression_queries"/>
       <list key="regular_region_queries"/>
       <list key="xpath_queries">
         <parameter key="some_value" value="/html/head/title"/>
       </list>
       <list key="namespaces"/>
       <list key="index_queries"/>
     </operator>
     <connect from_op="Create Document" from_port="output" to_op="Extract Information" to_port="document"/>
     <connect from_op="Extract Information" from_port="document" to_port="result 1"/>
     <portSpacing port="source_input 1" spacing="0"/>
     <portSpacing port="sink_result 1" spacing="0"/>
     <portSpacing port="sink_result 2" spacing="0"/>
   </process>
 </operator>
</process>
The resulting value is "?"

What I am doing wrong?

Works when I change the xpath to this:
/h:html/h:head/h:title/text()
Is there a way to get rid of that "h:" ?

Something to do with the namespace I suspect

Thanks

Neil

Answers

  • colocolo Member Posts: 236 Maven
    Hi el chief,

    you almost answered your question yourself. The different behaviour (with or without "h:") indeed is depending on the namespace. If you take a look at the "Extract Information" operator there is one expert parameter "assume html". This allows a bit more tolerance in nesting elements than XML does. The parser "repairs" documents by adding missing tags and creating a valid XML-like code. Simultaneously HTML elements get bound to the respective namespace and the identifier "h" is assigned (compare to operator documentation on namespaces: Specifies pairs of identifier and namespace for use in XPath queries. The namespace for (x)html is bound automatically to the identifier h.).
    If you uncheck the "assume html" parameter no namespace binding will be done automatically and your process works as you posted it above. You can define your own namespaces and identifiers by the namespaces parameter list if you like. For plain XML-like code with custom elements you don't need to define a namespace if you want to accept all elements without checking them against a namespace.

    Regards,
    Matthias
  • el_chiefel_chief Member Posts: 63 Contributor II
    sounds good. will try it without "assume html", and no h:
Sign In or Register to comment.