Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
[Solved] XPath queries are empty
Legacy User
Member Posts: 0 Newbie
Hi there, I am trying to extract text information from http://www.tripadvisor.com/ShowTopic-g29220-i86-k1487815-Alamo-Maui_Hawaii.html using the Get Page and Process Documents with the extract Information Subprocess.
The query result however is empty no matter what I try. Has anyone an idea?
here the Process Code:
Thank you very much in advance. ;D
The query result however is empty no matter what I try. Has anyone an idea?
here the Process Code:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="web:get_webpage" compatibility="5.3.001" expanded="true" height="60" name="Get Page" width="90" x="45" y="75">
<parameter key="url" value="http://www.tripadvisor.com/ShowTopic-g29220-i86-k1487815-Alamo-Maui_Hawaii.html"/>
<parameter key="random_user_agent" value="true"/>
<list key="query_parameters"/>
<list key="request_properties"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="5.3.002" expanded="true" height="94" name="Process Documents" width="90" x="380" y="30">
<process expanded="true">
<operator activated="true" class="text:extract_information" compatibility="5.3.002" expanded="true" height="60" name="Extract Information" width="90" x="45" y="30">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="xpath1" value="//div[@class='postBody']"/>
<parameter key="xpath2" value="//div[@class='postBody']/text()"/>
<parameter key="xpath3" value="//div[@class='postBody']/p[not(*)][text()]"/>
</list>
<list key="namespaces"/>
<list key="index_queries"/>
</operator>
<connect from_port="document" to_op="Extract Information" to_port="document"/>
<connect from_op="Extract Information" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Get Page" from_port="output" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Thank you very much in advance. ;D
0
Answers
My Problem seems to be quite simmilar to the one discussed here: http://rapid-i.com/rapidforum/index.php/topic,7753.0.html but I just dont get it working for me.
???
So change
//div[@class='postBody']
to
//h:div[@class='postBody']
It seems like I am getting closer to my goal.
Now I think only my XPath query is not completely correct.
With th query: //h:div[@class='postBody'][not(contains(.,'http://www.'))]
I get the following output: This is already a very good result. But how do I get rid of the last bits of HTML-Tags? And why do I have to add the namespace classifier exactly?
The XML now is:
Again, thank you very much for your help
Thank you for your help.
The XPath query has to be: string(//h:div[@class='postBody'][not(contains(.,'http://www.'))])