The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.

basic xpath problem

KintaroKintaro Member Posts: 7 Contributor II
edited November 2018 in Help
Hello,

I'm trying to extract data with xpath from an html page.

I have:
Create Document => Extract Information

Create Document:

<html>
<head>
<title>TITLE</title>
</head>
<body>BODY</body>
</html>
Extract Information configurated with:
query type: xpath
attribute type: nominal
xpath queries: //title
namespace: n/a
ignore CDATA: true
assume html: true

Result:
attribute name: ?

What am I doing wrong?  >:(

Answers

  • KintaroKintaro Member Posts: 7 Contributor II
    I'm asking this because if I try the same thing in a online path test it work without any problem... so I don't know why Rapidminer isn't.
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.3.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.3.000" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="text:create_document" compatibility="6.1.000" expanded="true" height="60" name="Create Document" width="90" x="112" y="255">
            <parameter key="text" value="&lt;html&gt;&#10;&lt;head&gt;&#10;&lt;title&gt;TITLE&lt;/title&gt;&#10;&lt;/head&gt;&#10;&lt;body&gt;BODY&lt;/body&gt;&#10;&lt;/html&gt;"/>
          </operator>
          <operator activated="true" class="text:extract_information" compatibility="6.1.000" expanded="true" height="60" name="Extract Information" width="90" x="313" y="30">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries">
              <parameter key="nome" value="&lt;title&gt;.&lt;/title&gt;"/>
            </list>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="nome" value="//title"/>
            </list>
            <list key="namespaces"/>
            <list key="index_queries"/>
            <list key="jsonpath_queries"/>
          </operator>
          <connect from_op="Create Document" from_port="output" to_op="Extract Information" to_port="document"/>
          <connect from_op="Extract Information" from_port="document" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • KintaroKintaro Member Posts: 7 Contributor II
    Solved

    I can't use path like this, I have to use for example:

    //h:title/text()

    text() to extract only the text from the title tag

    and I have to use h: because is html, right?
Sign In or Register to comment.