Extracting Information With XPath

el_chief · July 2010

Hello,

I am having trouble getting a value from a HTML using XPATH. This is my process:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.0.8" expanded="true" name="Process">
    <process expanded="true" height="505" width="415">
      <operator activated="true" class="text:create_document" compatibility="5.0.6" expanded="true" height="60" name="Create Document" width="90" x="45" y="30">
        <parameter key="text" value="&lt;html&gt;&#13;&#10;&lt;head&gt;&#13;&#10;&lt;title&gt;hello&lt;/title&gt;&#13;&#10;&lt;/head&gt;&#13;&#10;&lt;body&gt;&#13;&#10;&lt;div class=&quot;class1&quot;&gt;goodbye&lt;/div&gt;&#13;&#10;&lt;/body&gt;&#13;&#10;&lt;/html&gt;"/>
      </operator>
      <operator activated="true" class="text:extract_information" compatibility="5.0.6" expanded="true" height="60" name="Extract Information" width="90" x="179" y="165">
        <parameter key="query_type" value="XPath"/>
        <list key="string_machting_queries"/>
        <list key="regular_expression_queries"/>
        <list key="regular_region_queries"/>
        <list key="xpath_queries">
          <parameter key="some_value" value="/html/head/title"/>
        </list>
        <list key="namespaces"/>
        <list key="index_queries"/>
      </operator>
      <connect from_op="Create Document" from_port="output" to_op="Extract Information" to_port="document"/>
      <connect from_op="Extract Information" from_port="document" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

The resulting value is "?"

What I am doing wrong?

Works when I change the xpath to this:

/h:html/h:head/h:title/text()

Is there a way to get rid of that "h:" ?

Something to do with the namespace I suspect

Thanks

Neil

colo · July 2010

Hi el chief,

you almost answered your question yourself. The different behaviour (with or without "h:") indeed is depending on the namespace. If you take a look at the "Extract Information" operator there is one expert parameter "assume html". This allows a bit more tolerance in nesting elements than XML does. The parser "repairs" documents by adding missing tags and creating a valid XML-like code. Simultaneously HTML elements get bound to the respective namespace and the identifier "h" is assigned (compare to operator documentation on namespaces: Specifies pairs of identifier and namespace for use in XPath queries. The namespace for (x)html is bound automatically to the identifier h.).
If you uncheck the "assume html" parameter no namespace binding will be done automatically and your process works as you posted it above. You can define your own namespaces and identifiers by the namespaces parameter list if you like. For plain XML-like code with custom elements you don't need to define a namespace if you want to accept all elements without checking them against a namespace.

Regards,
Matthias

el_chief · July 2010

sounds good. will try it without "assume html", and no h:

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Extracting Information With XPath

Answers