basic xpath problem

KintaroKintaro Member Posts: 7 Contributor II
edited November 2018 in Help

I'm trying to extract data with xpath from an html page.

I have:
Create Document => Extract Information

Create Document:

Extract Information configurated with:
query type: xpath
attribute type: nominal
xpath queries: //title
namespace: n/a
ignore CDATA: true
assume html: true

attribute name: ?

What am I doing wrong?  >:(


  • Options
    KintaroKintaro Member Posts: 7 Contributor II
    I'm asking this because if I try the same thing in a online path test it work without any problem... so I don't know why Rapidminer isn't.
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.3.000">
      <operator activated="true" class="process" compatibility="6.3.000" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="text:create_document" compatibility="6.1.000" expanded="true" height="60" name="Create Document" width="90" x="112" y="255">
            <parameter key="text" value="&lt;html&gt;&#10;&lt;head&gt;&#10;&lt;title&gt;TITLE&lt;/title&gt;&#10;&lt;/head&gt;&#10;&lt;body&gt;BODY&lt;/body&gt;&#10;&lt;/html&gt;"/>
          <operator activated="true" class="text:extract_information" compatibility="6.1.000" expanded="true" height="60" name="Extract Information" width="90" x="313" y="30">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries">
              <parameter key="nome" value="&lt;title&gt;.&lt;/title&gt;"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="nome" value="//title"/>
            <list key="namespaces"/>
            <list key="index_queries"/>
            <list key="jsonpath_queries"/>
          <connect from_op="Create Document" from_port="output" to_op="Extract Information" to_port="document"/>
          <connect from_op="Extract Information" from_port="document" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
  • Options
    KintaroKintaro Member Posts: 7 Contributor II

    I can't use path like this, I have to use for example:


    text() to extract only the text from the title tag

    and I have to use h: because is html, right?
Sign In or Register to comment.