Problems with Xpath queries

trevor_davistrevor_davis Member Posts: 4 Contributor I
edited November 2018 in Help

I'm trying to crawl the Dell website and extract information from their laptops.

 

I have 2 problems:

 

1:

I'm not having any success trying to extract the Processor info in RapidMiner. 

 

This is an example laptop page where I need to extract from:  http://www.dell.com/en-us/work/shop/productdetails/inspiron-15-5565-laptop   

 

I'm trying to get the first Processor data (7th Generation AMD A9-9400 Processor with Radeon™ R5 Graphics). 

 

I figured out the correct XPATH query in Google Chrome to extract it, but I can't get it to work in RapidMiner.

 

I have: $x("string(//span[contains(.,'Processor')]/../../../../following-sibling::div/div/div/div/div/span)") 

to find it in Chrome.

 

I have tried this: string(//h:span[contains(.,'Processor')]/../../../../following-sibling::h:div/h:div/h:div/h:div/h:div/h:span)

and others, with no success in RapidMiner5 or RapidMiner7. 

 

Does anyone know what is wrong with my XPATH query syntax for RapidMiner?

 

2.

The XPATH queries: normalize-space(//*[@id='sharedPdPageProductTitle']/text()) 

normalize-space(//*[@id='starting-price']/text())

both work in RapidMiner5 but not in RapidMiner7. 

Is there something different with the XPATH syntax between RapidMiner5 and RapidMiner7? 

 

 

Here are my Processes in XML form:

 

RapidMiner5:

 

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="web:crawl_web" compatibility="5.3.002" expanded="true" height="60" name="Crawl Web" width="90" x="45" y="75">
<parameter key="url" value="http://www.dell.com/en-us/work/shop/productdetails/inspiron-15-5565-laptop"/>
<list key="crawling_rules">
<parameter key="follow_link_with_matching_url" value=".*/?productdetails/.*"/>
<parameter key="store_with_matching_url" value=".*/?productdetails/.*"/>
</list>
<parameter key="write_pages_into_files" value="false"/>
<parameter key="add_pages_as_attribute" value="true"/>
<parameter key="max_pages" value="1000"/>
<parameter key="domain" value="server"/>
<parameter key="delay" value="2000"/>
<parameter key="max_threads" value="4"/>
<parameter key="max_page_size" value="10000"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Data" width="90" x="179" y="30">
<parameter key="create_word_vector" value="false"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:extract_information" compatibility="5.3.002" expanded="true" height="60" name="Extract Information" width="90" x="514" y="30">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="Name" value="normalize-space(//*[@id='sharedPdPageProductTitle']/text())"/>
<parameter key="Unit Purchase Price" value="normalize-space(//*[@id='starting-price']/text())"/>
<parameter key="Processor" value="string(//h:span[contains(.,'Processor')]/../../../../following-sibling::h:div/h:div/h:div/h:div/h:div/h:span)"/>
</list>
<list key="namespaces"/>
<parameter key="ignore_CDATA" value="false"/>
<list key="index_queries"/>
</operator>
<connect from_port="document" to_op="Extract Information" to_port="document"/>
<connect from_op="Extract Information" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Crawl Web" from_port="Example Set" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

 

 

RapidMiner7:

 

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.5.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="web:crawl_web_modern" compatibility="7.3.000" expanded="true" height="68" name="Crawl Web" width="90" x="112" y="85">
<parameter key="url" value="http://www.dell.com/en-us/work/shop/productdetails/inspiron-15-5565-laptop"/>
<list key="crawling_rules">
<parameter key="follow_link_with_matching_url" value=".*/?productdetails/.*"/>
<parameter key="store_with_matching_url" value=".*/?productdetails/.*"/>
</list>
<parameter key="max_crawl_depth" value="2"/>
<parameter key="add_content_as_attribute" value="true"/>
<parameter key="max_pages" value="1000"/>
<parameter key="max_page_size" value="10000"/>
<parameter key="delay" value="2000"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="246" y="85">
<parameter key="create_word_vector" value="false"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:extract_information" compatibility="7.5.000" expanded="true" height="68" name="Extract Information" width="90" x="447" y="34">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="Name" value="normalize-space(//*[@id='sharedPdPageProductTitle']/text())"/>
<parameter key="Unit Purchase Price" value="normalize-space(//*[@id='starting-price']/text())"/>
<parameter key="Processor" value="string(//h:span[contains(.,'Processor')]/../../../../following-sibling::h:div/h:div/h:div/h:div/h:div/h:span)"/>
</list>
<list key="namespaces"/>
<list key="index_queries"/>
<list key="jsonpath_queries"/>
</operator>
<connect from_port="document" to_op="Extract Information" to_port="document"/>
<connect from_op="Extract Information" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Crawl Web" from_port="example set" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

Answers

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    I have actually had a similar problem where older XPath queries I created stopped working.  I assumed it was because something at the web page had changed and I didn't bother to try to track it down, but based on this post, I am wondering whether it was instead because of a change in the implementation of XPath in RapidMiner.  Hopefully one of the developers can provide some insight on this topic.

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    @Thomas_Ott any chance we could ask one of the developers to take a look at this?   Thanks.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Sure. I pinged 'em. 

  • trevor_davistrevor_davis Member Posts: 4 Contributor I

    Any response about this issue yet? Thank you for reaching out to a developer about this!

  • Edin_KlapicEdin_Klapic Moderator, Employee, RMResearcher, Member Posts: 299 RM Data Scientist

    Hi Trevor,

     

    sorry for the delay. I just got the confirmation that no changes have been made regarding the XPath implementation.

     

    Nevertheless I would like to thoroughly investigate this issue and try to find a solution.

     

    Best,

    Edin

Sign In or Register to comment.