RapidMiner

Problems with Xpath queries

Contributor II

Problems with Xpath queries

I'm trying to crawl the Dell website and extract information from their laptops.

 

I have 2 problems:

 

1:

I'm not having any success trying to extract the Processor info in RapidMiner. 

 

This is an example laptop page where I need to extract from:  http://www.dell.com/en-us/work/shop/productdetails/inspiron-15-5565-laptop   

 

I'm trying to get the first Processor data (7th Generation AMD A9-9400 Processor with Radeon™ R5 Graphics). 

 

I figured out the correct XPATH query in Google Chrome to extract it, but I can't get it to work in RapidMiner.

 

I have: $x("string(//span[contains(.,'Processor')]/../../../../following-sibling::div/div/div/div/div/span)") 

to find it in Chrome.

 

I have tried this: string(//h:span[contains(.,'Processor')]/../../../../following-sibling::h:div/h:div/h:div/h:div/h:div/h:span)

and others, with no success in RapidMiner5 or RapidMiner7. 

 

Does anyone know what is wrong with my XPATH query syntax for RapidMiner?

 

2.

The XPATH queries: normalize-space(//*[@id='sharedPdPageProductTitle']/text()) 

normalize-space(//*[@id='starting-price']/text())

both work in RapidMiner5 but not in RapidMiner7. 

Is there something different with the XPATH syntax between RapidMiner5 and RapidMiner7? 

 

 

Here are my Processes in XML form:

 

RapidMiner5:

 

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="web:crawl_web" compatibility="5.3.002" expanded="true" height="60" name="Crawl Web" width="90" x="45" y="75">
<parameter key="url" value="http://www.dell.com/en-us/work/shop/productdetails/inspiron-15-5565-laptop"/>
<list key="crawling_rules">
<parameter key="follow_link_with_matching_url" value=".*/?productdetails/.*"/>
<parameter key="store_with_matching_url" value=".*/?productdetails/.*"/>
</list>
<parameter key="write_pages_into_files" value="false"/>
<parameter key="add_pages_as_attribute" value="true"/>
<parameter key="max_pages" value="1000"/>
<parameter key="domain" value="server"/>
<parameter key="delay" value="2000"/>
<parameter key="max_threads" value="4"/>
<parameter key="max_page_size" value="10000"/>
</operator>
<operator activated="true" class="textSmiley Tonguerocess_document_from_data" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Data" width="90" x="179" y="30">
<parameter key="create_word_vector" value="false"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:extract_information" compatibility="5.3.002" expanded="true" height="60" name="Extract Information" width="90" x="514" y="30">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="Name" value="normalize-space(//*[@id='sharedPdPageProductTitle']/text())"/>
<parameter key="Unit Purchase Price" value="normalize-space(//*[@id='starting-price']/text())"/>
<parameter key="Processor" value="string(//h:span[contains(.,'Processor')]/../../../../following-sibling::h:div/h:div/h:div/h:div/h:div/h:span)"/>
</list>
<list key="namespaces"/>
<parameter key="ignore_CDATA" value="false"/>
<list key="index_queries"/>
</operator>
<connect from_port="document" to_op="Extract Information" to_port="document"/>
<connect from_op="Extract Information" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Crawl Web" from_port="Example Set" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

 

 

RapidMiner7:

 

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.5.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="web:crawl_web_modern" compatibility="7.3.000" expanded="true" height="68" name="Crawl Web" width="90" x="112" y="85">
<parameter key="url" value="http://www.dell.com/en-us/work/shop/productdetails/inspiron-15-5565-laptop"/>
<list key="crawling_rules">
<parameter key="follow_link_with_matching_url" value=".*/?productdetails/.*"/>
<parameter key="store_with_matching_url" value=".*/?productdetails/.*"/>
</list>
<parameter key="max_crawl_depth" value="2"/>
<parameter key="add_content_as_attribute" value="true"/>
<parameter key="max_pages" value="1000"/>
<parameter key="max_page_size" value="10000"/>
<parameter key="delay" value="2000"/>
</operator>
<operator activated="true" class="textSmiley Tonguerocess_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="246" y="85">
<parameter key="create_word_vector" value="false"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:extract_information" compatibility="7.5.000" expanded="true" height="68" name="Extract Information" width="90" x="447" y="34">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="Name" value="normalize-space(//*[@id='sharedPdPageProductTitle']/text())"/>
<parameter key="Unit Purchase Price" value="normalize-space(//*[@id='starting-price']/text())"/>
<parameter key="Processor" value="string(//h:span[contains(.,'Processor')]/../../../../following-sibling::h:div/h:div/h:div/h:div/h:div/h:span)"/>
</list>
<list key="namespaces"/>
<list key="index_queries"/>
<list key="jsonpath_queries"/>
</operator>
<connect from_port="document" to_op="Extract Information" to_port="document"/>
<connect from_op="Extract Information" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Crawl Web" from_port="example set" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

5 REPLIES
Highlighted
Elite III

Re: Problems with Xpath queries

I have actually had a similar problem where older XPath queries I created stopped working.  I assumed it was because something at the web page had changed and I didn't bother to try to track it down, but based on this post, I am wondering whether it was instead because of a change in the implementation of XPath in RapidMiner.  Hopefully one of the developers can provide some insight on this topic.

 

Brian T., Lindon Ventures - www.lindonventures.com
Analytics Consulting by Certified RapidMiner Analysts
Elite III

Re: Problems with Xpath queries

@Thomas_Ott any chance we could ask one of the developers to take a look at this?   Thanks.

Brian T., Lindon Ventures - www.lindonventures.com
Analytics Consulting by Certified RapidMiner Analysts
Moderator

Re: Problems with Xpath queries

Sure. I pinged 'em. 

Contributor II

Re: Problems with Xpath queries

Any response about this issue yet? Thank you for reaching out to a developer about this!

RMStaff

Re: Problems with Xpath queries

Hi Trevor,

 

sorry for the delay. I just got the confirmation that no changes have been made regarding the XPath implementation.

 

Nevertheless I would like to thoroughly investigate this issue and try to find a solution.

 

Best,

Edin