How to use XPath for extracting multiple review data from a single webpage

subhasisdasguptsubhasisdasgupt Member Posts: 15 Contributor II
edited November 2018 in Help
I am new to XPath but I need to extract mutiple reviews from a single webpage. My objective is to extract reviewer's name, date of review, ratings and the entire review text. Each reviewer should be a separate record in my example set. Is there any way to do that. I was working with review pages of epinion.com.

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    You could download the complete page with Get Page or Get Pages, pass it into a Process Documents operator, which in turn contains a Cut Document that finally executes an Extract Information operator.

    Best regards,
    Marius
  • subhasisdasguptsubhasisdasgupt Member Posts: 15 Contributor II
    Thanks Marius. Your suggestion did help me extracting what I wanted to extract. However, I am facing a second problem. As I am using Extract Information operator, all my extractions are falling under Special Attributes and I want to do further analysis on those attributes like Tokenization, Clustering etc etc. One of such attributes is "Reviews" and these are nothing but consumer reviews. I don't know how to select only this special attribute for doing the the regular text processing while keeping other special attributes intact. I tried with Select Attribute operator but it did not show the special attributes in the subset selection mode. I also tried to export the output to a CSV file but once exported, the same CSV file could not be read by Rapid Miner. Hence I am bit stuck over here. Can you suggest any solution?
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    You can use Set Role to set the role back to "regular". If you want to do text analysis on the attribute you probably also need the Nominal to Text operator.

    Best regards,
    Marius
  • subhasisdasguptsubhasisdasgupt Member Posts: 15 Contributor II
    Dear Marius,

    I am attaching the XML code so that you can have a look. I am unable to save the output in a proper .csv format so that later I can import the same data. I want to do some analysis on the customer reviews. But the "Review" attribute is not appearing in the "Select Attribute" node and neither appearing in the "Set Role" node once connected.

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.005">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.005" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="text:process_document_from_file" compatibility="5.3.000" expanded="true" height="76" name="Process Documents from Files" width="90" x="45" y="30">
            <list key="text_directories">
              <parameter key="S3" value="D:\Web_crawl_S3"/>
              <parameter key="Advance S" value="D:\Web Data 1"/>
            </list>
            <parameter key="extract_text_only" value="false"/>
            <parameter key="create_word_vector" value="false"/>
            <process expanded="true">
              <operator activated="true" class="text:cut_document" compatibility="5.3.000" expanded="true" height="60" name="Cut Document (2)" width="90" x="246" y="30">
                <parameter key="query_type" value="XPath"/>
                <list key="string_machting_queries"/>
                <list key="regular_expression_queries"/>
                <list key="regular_region_queries"/>
                <list key="xpath_queries">
                  <parameter key="Start" value="//h:div[@class='fclear fk-review fk-position-relative line']"/>
                </list>
                <list key="namespaces"/>
                <list key="index_queries"/>
                <process expanded="true">
                  <operator activated="true" class="text:remove_document_parts" compatibility="5.3.000" expanded="true" height="60" name="Remove Document Parts (2)" width="90" x="112" y="30">
                    <parameter key="deletion_regex" value="(&lt;br clear=&quot;none&quot; /&gt;)"/>
                  </operator>
                  <operator activated="true" class="text:extract_information" compatibility="5.3.000" expanded="true" height="60" name="Extract Information (5)" width="90" x="246" y="30">
                    <parameter key="query_type" value="XPath"/>
                    <list key="string_machting_queries"/>
                    <list key="regular_expression_queries"/>
                    <list key="regular_region_queries"/>
                    <list key="xpath_queries">
                      <parameter key="Reviewer" value="//h:a[@profile_name]/text()"/&gt;
                      <parameter key="Review date" value="//h:div[@class='date line fk-font-small']/text()"/>
                      <parameter key="Rating" value="//h:div[@class='fk-stars-small']/@title"/&gt;
                      <parameter key="Review" value="//h:p[@class='line bmargin10']/text()"/>
                    </list>
                    <list key="namespaces"/>
                    <list key="index_queries"/>
                  </operator>
                  <connect from_port="segment" to_op="Remove Document Parts (2)" to_port="document"/>
                  <connect from_op="Remove Document Parts (2)" from_port="document" to_op="Extract Information (5)" to_port="document"/>
                  <connect from_op="Extract Information (5)" from_port="document" to_port="document 1"/>
                  <portSpacing port="source_segment" spacing="0"/>
                  <portSpacing port="sink_document 1" spacing="0"/>
                  <portSpacing port="sink_document 2" spacing="0"/>
                </process>
              </operator>
              <connect from_port="document" to_op="Cut Document (2)" to_port="document"/>
              <connect from_op="Cut Document (2)" from_port="documents" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>

    Can you help me in this regard?
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    If the attribute is in the data, but you can't see it in subsequent operators, you can simply type its name into the corresponding fields, even if they don't show up.

    Best regards,
    Marius
Sign In or Register to comment.