Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Extracting Information With XPath
Hello,
I am having trouble getting a value from a HTML using XPATH. This is my process:
What I am doing wrong?
Works when I change the xpath to this:
Something to do with the namespace I suspect
Thanks
Neil
I am having trouble getting a value from a HTML using XPATH. This is my process:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>The resulting value is "?"
<process version="5.0">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.0.8" expanded="true" name="Process">
<process expanded="true" height="505" width="415">
<operator activated="true" class="text:create_document" compatibility="5.0.6" expanded="true" height="60" name="Create Document" width="90" x="45" y="30">
<parameter key="text" value="<html> <head> <title>hello</title> </head> <body> <div class="class1">goodbye</div> </body> </html>"/>
</operator>
<operator activated="true" class="text:extract_information" compatibility="5.0.6" expanded="true" height="60" name="Extract Information" width="90" x="179" y="165">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="some_value" value="/html/head/title"/>
</list>
<list key="namespaces"/>
<list key="index_queries"/>
</operator>
<connect from_op="Create Document" from_port="output" to_op="Extract Information" to_port="document"/>
<connect from_op="Extract Information" from_port="document" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
What I am doing wrong?
Works when I change the xpath to this:
/h:html/h:head/h:title/text()Is there a way to get rid of that "h:" ?
Something to do with the namespace I suspect
Thanks
Neil
0
Answers
you almost answered your question yourself. The different behaviour (with or without "h:") indeed is depending on the namespace. If you take a look at the "Extract Information" operator there is one expert parameter "assume html". This allows a bit more tolerance in nesting elements than XML does. The parser "repairs" documents by adding missing tags and creating a valid XML-like code. Simultaneously HTML elements get bound to the respective namespace and the identifier "h" is assigned (compare to operator documentation on namespaces: Specifies pairs of identifier and namespace for use in XPath queries. The namespace for (x)html is bound automatically to the identifier h.).
If you uncheck the "assume html" parameter no namespace binding will be done automatically and your process works as you posted it above. You can define your own namespaces and identifiers by the namespaces parameter list if you like. For plain XML-like code with custom elements you don't need to define a namespace if you want to accept all elements without checking them against a namespace.
Regards,
Matthias