Advanced web crawling question

SteveMM · January 2017

I am working on a project that requires me to web crawl macys.com. What I would like to do is bring back reviews from this website, but have it group the reviews based on similar topics. Is this something I can do using web crawling along with some possible grouping functionality? Just wondering if RapidMiner can give me that level of detail.

Thank you

Thomas_Ott · January 2017

Yes. You'll need the Web and Text Mining extension and will probably scrap those reviews using XPath.

kayman · January 2017

Perfectly possible but since the website is very dynamic and using very dirty code it is not very straightforward. Going through the whole process might take a bit too much time but this should get you started already :

<?xml version="1.0" encoding="UTF-8"?><process version="7.3.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.3.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="web:get_webpage" compatibility="7.3.000" expanded="true" height="68" name="get page" width="90" x="45" y="85">
        <parameter key="url" value="http://www1.macys.com/shop/mens-clothing/shop-all-mens-footwear?id=55822"/>
        <parameter key="random_user_agent" value="true"/>
        <parameter key="connection_timeout" value="50000"/>
        <parameter key="read_timeout" value="50000"/>
        <parameter key="accept_cookies" value="all"/>
        <list key="query_parameters"/>
        <list key="request_properties"/>
      </operator>
      <operator activated="true" class="text:create_document" compatibility="7.3.000" expanded="true" height="68" name="Create Document" width="90" x="313" y="187">
        <parameter key="text" value="&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;&#10;&lt;xsl:stylesheet version=&quot;1.0&quot; xmlns:xsl=&quot;http://www.w3.org/1999/XSL/Transform&quot;&gt;&#10;&lt;xsl:output method=&quot;xml&quot; version=&quot;1.0&quot; encoding=&quot;UTF-8&quot; indent=&quot;yes&quot;/&gt;&#10;&lt;xsl:template match=&quot;/&quot;&gt;&#10;&lt;root&gt;&#10;&lt;xsl:for-each select=&quot;//li[.//div[@class='pdpreviews']]&quot;&gt;&#10;&lt;row &#10;model=&quot;{.//div[@class='shortDescription']/a/normalize-space(.)}&quot; &#10;rating=&quot;{.//div[@class='pdpreviews']/span[@class='rating']/span/@style}&amp;quot;&#10;reviewqty=&quot;{.//div[@class='pdpreviews']/span[2]/normalize-space(.)}&quot;&#10;productpage=&quot;{concat('http://www1.macys.com',.//div[@class='fullColorOverlayOff']/a/@href)}&amp;quot;&#10;/&gt;&#10;&lt;/xsl:for-each&gt;&#10;&lt;/root&gt;&#10;&lt;/xsl:template&gt;&#10;&lt;/xsl:stylesheet&gt;"/>
      </operator>
      <operator activated="true" class="text:replace_tokens" compatibility="7.3.000" expanded="true" height="68" name="Replace Tokens" width="90" x="179" y="85">
        <list key="replace_dictionary">
          <parameter key="(?sm)^.*(&lt;ul id=&quot;thumbnails&quot;.*?&lt;/ul&gt;).*$" value="&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;$1&lt;/body&gt;&lt;/html&gt;"/>
          <parameter key="(?sm)(&lt;img[^&gt;].*?)/?&gt;" value="$1/&gt;"/>
          <parameter key="(?sm)(&lt;input[^&gt;].*?)/?&gt;" value="$1/&gt;"/>
          <parameter key="&amp;" value="&amp;amp;"/>
          <parameter key="(?sm)(&lt;meta[^&gt;].*?)/?&gt;" value="$1/&gt;"/>
          <parameter key="&lt;br&gt;" value="&lt;br/&gt;"/>
          <parameter key="&lt;!--\s.*?\s--&gt;(.)?" value="$1"/>
        </list>
      </operator>
      <operator activated="true" class="text:combine_documents" compatibility="7.3.000" expanded="true" height="82" name="Combine Documents" width="90" x="313" y="85"/>
      <operator activated="true" class="text:process_xslt" compatibility="7.3.000" expanded="true" height="82" name="Process Xslt" width="90" x="447" y="187"/>
      <operator activated="true" class="text:cut_document" compatibility="7.3.000" expanded="true" height="68" name="Cut Document" width="90" x="45" y="340">
        <parameter key="query_type" value="Regular Region"/>
        <list key="string_machting_queries"/>
        <list key="regular_expression_queries"/>
        <list key="regular_region_queries">
          <parameter key="row" value="&lt;row./&gt;"/>
        </list>
        <list key="xpath_queries"/>
        <list key="namespaces"/>
        <list key="index_queries"/>
        <list key="jsonpath_queries"/>
        <process expanded="true">
          <operator activated="true" class="text:extract_information" compatibility="7.3.000" expanded="true" height="68" name="Extract Information" width="90" x="112" y="34">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="model" value="//@model"/&gt;
              <parameter key="rating" value="//@rating"/&gt;
              <parameter key="reviewqty" value="//@reviewqty"/&gt;
              <parameter key="productpage" value="//@productpage"/&gt;
            </list>
            <list key="namespaces"/>
            <list key="index_queries"/>
            <list key="jsonpath_queries"/>
          </operator>
          <connect from_port="segment" to_op="Extract Information" to_port="document"/>
          <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
          <portSpacing port="source_segment" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="text:documents_to_data" compatibility="7.3.000" expanded="true" height="82" name="Documents to Data" width="90" x="179" y="340">
        <parameter key="text_attribute" value="pages"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="7.3.000" expanded="true" height="82" name="Select Attributes" width="90" x="313" y="340">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="pages|query_key"/>
        <parameter key="invert_selection" value="true"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <connect from_op="get page" from_port="output" to_op="Replace Tokens" to_port="document"/>
      <connect from_op="Create Document" from_port="output" to_op="Process Xslt" to_port="xslt document"/>
      <connect from_op="Replace Tokens" from_port="document" to_op="Combine Documents" to_port="documents 1"/>
      <connect from_op="Combine Documents" from_port="document" to_op="Process Xslt" to_port="document"/>
      <connect from_op="Process Xslt" from_port="document" to_op="Cut Document" to_port="document"/>
      <connect from_op="Cut Document" from_port="documents" to_op="Documents to Data" to_port="documents 1"/>
      <connect from_op="Documents to Data" from_port="example set" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

What does this do ?

-> it starts with one single url and cleans the dirty code so only the relevant page blocks remain. Typically you should use the HTML to XML convertor but given the quality of the source code this is not a good option now. The cleanup process works fine for this specific page, you may need to add some if you test other pages.

-> next there is a simpe xslt applied to get all the products that have reviews , and the document data is converted to an example set. In the sample it will contain the product name, product page, review qty and rating, but of course you can add whatever you want using the same logic.

From here onwards is basically looping the examples. For each url you apply the same logic, open the page, clean out all the rubish so you are only left with the reviews and so on. The example here is focussing on one single collection page, but you can retrieve the amount of pages for a given category also using xpath, use that in a loop logic again and so you can travel through all the pages.

So your final flow could look like this :

-> create a csv with starting pages (your categories)

-> loop through these one by one, get the number of pages for each category and use this as a loop variable to get all the product collection pages.

-> for every product page that has reviews get the final page (actual single product page)

-> Get the reviews, store them and play around

-> back to start

Hope this helps

Telcontar120 · January 2017

Wow @kayman, great sample process!

The only thing to add here is that you can enhance this process by using either a "get pages" and utilizing a txt file of the specific links that you want to retrieve, or you can use the "crawl web" or "process documents from web" operators and specify a set of automatic crawling rules. That should make it a bit easier to cycle through a lot of different categories/pages in the site.

kayman · January 2017

True, but it might not work pretty well with this specific website given the high level of dynamic code and redirections behind the scenes, the structure of the page makes my eyes bleed tbh :-)

Therefore the risk is pretty high original poster would get lost or get nothing when relying on crawling rules for this site. But of course for any 'normal' site these are the first operators to look at.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Advanced web crawling question

Answers