Writing an XPATH query to retrieve text within quotes

a_heavey2a_heavey2 Member Posts: 5 Contributor I
edited December 2018 in Help

I'm having trouble retrieving text within double quotes from a webpage using information extraction. I already have a number of xpaths which are working as expected (all of my xpaths work apart from the last one in the xml process code). Does anyone know what the terminology is for retrieving text that is inside double quotes? 

 

The following xpath works fine in google docs but doesn't in rapidminer: Google docs is still retireves the text even though it's within quotes. In Rapidminer it gives blank values.

<parameter key="TEST" value="//*[@class=&amp;quot;single-review&quot;]/text()"/>

Overall process:

<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="false" class="text:extract_information" compatibility="7.5.000" expanded="true" height="68" name="Extract Information" width="90" x="514" y="34">
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries"/>
<list key="namespaces"/>
<list key="index_queries"/>
<list key="jsonpath_queries"/>
</operator>
<operator activated="true" class="text:process_document_from_file" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Files (2)" width="90" x="246" y="34">
<list key="text_directories">
<parameter key="all" value="C:\Users\heaveya\Desktop\Text-Mining\project_1"/>
</list>
<parameter key="use_file_extension_as_type" value="false"/>
<parameter key="prune_below_absolute" value="2"/>
<parameter key="prune_above_absolute" value="999"/>
<process expanded="true">
<operator activated="true" class="text:extract_information" compatibility="7.5.000" expanded="true" height="68" name="Extract Information (2)" width="90" x="246" y="34">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="Game Title" value="//*[@class=&amp;quot;id-app-title&quot;]/text()"/>
<parameter key="Date of First Review" value="//*[@class=&amp;quot;review-date&quot;]/text()"/>
<parameter key="Description" value="//*[@jsname=&amp;quot;C4s9Ed&quot;]/text()"/>
<parameter key="No:OfReviews" value="//*[@class=&amp;quot;reviews-num&quot;]/text()"/>
<parameter key="Overall Average Rating" value="//*[@class=&amp;quot;score&quot;]/text()"/>
<parameter key="Game Makers" value="//*[@class=&amp;quot;document-subtitle primary&quot;]/h:span/text()"/>
<parameter key="No. of Downloads" value="//*[@itemprop=&amp;quot;numDownloads&quot;]/text()"/>
<parameter key="Last Updated" value="//*[@itemprop=&amp;quot;datePublished&quot;]/text()"/>
<parameter key="What's new" value="//*[@class=&amp;quot;recent-change&quot;]/text()"/>
<parameter key="What's new 1" value="//h:div[2][contains(@class,'recent-change')]/text()"/>
<parameter key="What's new 2" value="//h:div[3][contains(@class,'recent-change')]/text()"/>
<parameter key="What's new 3" value="//h:div[4][contains(@class,'recent-change')]/text()"/>
<parameter key="TEST" value="//*[@class=&amp;quot;single-review&quot;]/text()"/>
</list>
<list key="namespaces"/>
<list key="index_queries"/>
<list key="jsonpath_queries"/>
</operator>
<connect from_port="document" to_op="Extract Information (2)" to_port="document"/>
<connect from_op="Extract Information (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Process Documents from Files (2)" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

Best Answer

Answers

  • a_heavey2a_heavey2 Member Posts: 5 Contributor I

    Hi again,

    Maybe i'll try to explain my problem a little bit more. As you can see below the phrase is inside double quotes and as this is the case I can't seem to be able to get this phrase to appear in my results by simply attaching the /text() like i've been using previously. So if anyone knows the syntax to retrieve the text here within the quotes then I should be ok. Even if it only works normally in google docs I might be able to figure it out through trial and error.

     

    Capture.PNG

    Thanks for reading,

    Aidan

  • Pavithra_RaoPavithra_Rao Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 123 RM Data Scientist

    Hi,

     

    Would it be possible to share sample data here(in specific the part of the data which matches the XPath query "Test") so that I can try re-creating the error and resolve it?

     

    Cheers,

  • a_heavey2a_heavey2 Member Posts: 5 Contributor I

    Hi Pavithra,

    My process starts with a process document from files operator (using a repo, the file attached is one such file)

    Inside I have extract information operator with nominal and XPATH chosen. I also have extract txy only (content type: txt) assume html ticked on.

    The data is from this page (which is the text file too) : https://play.google.com/store/apps/details?id=com.squareenixmontreal.hitmansniperandroid&hl=enl

    And then I inspect element for the review to find my class that's in my TEST parameter: 

    Capture.PNGCapture.PNG

     

    Thanks for looking into it,

    Regards,

    Aidan

  • a_heavey2a_heavey2 Member Posts: 5 Contributor I

    Hi Pavithra,

     

    Thanks a lot, your post was very helpful. The amazon review site is very similar to mine. I still haven't been able to figure it out. If anyone else has any ideas then great? if not, I can have another go tomorrow with a clear head!

     

    Rgds,

    Aidan

  • a_heavey2a_heavey2 Member Posts: 5 Contributor I

    Hi again,

    I've played around with this again and not been able to get it to work.

    The games that have review content are coming up blank, whereas games without content are coming up with a question mark. So, I believe it's working, it's just not spitting out any content for me. I've also tried  //h:div[@class='review-text']/descendant::text() and also //h* at the start and also //h:div[1][contains(@class,'review-text')]/text(), which seem to be the correct syntax but don't display values. 

    Also //h:span[@class='review-title']/text() displays values that are not in quotes (very small amount). 

    Would anyone have any more suggestions for me? 

Sign In or Register to comment.