"Web Mining crawling prices of an internet page"

luiz_vidal · January 2018

Guys,

I am trying to create a process to crawl web pages from a site in order to get the prices of a variety of products. I am trying to do the following, I created a loop, because I want to crawl to get page by page and save into my disk, after that I want to get this html saved into my disk and extract only the name of the product and price for example, but I'm not being able to do that. Would you guys please help me?
I was able to get the pages in sequence, but somehow I can't save into the disk as they are overwritten

First I want to collect the pages:

https://www.buscape.com.br/cerveja?pagina=1

https://www.buscape.com.br/cerveja?pagina=2

...

https://www.buscape.com.br/cerveja?pagina=200

Follow my process below

<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="concurrency:loop" compatibility="8.0.001" expanded="true" height="103" name="Loop" width="90" x="179" y="34">
        <process expanded="true">
          <operator activated="true" class="web:crawl_web_modern" compatibility="7.3.000" expanded="true" height="68" name="Crawl Web" width="90" x="246" y="238">
            <parameter key="url" value="https://www.buscape.com.br/cerveja?pagina=%{page}"/>
            <list key="crawling_rules">
              <parameter key="follow_link_with_matching_url" value="cerveja"/>
            </list>
            <parameter key="retrieve_as_html" value="true"/>
            <parameter key="add_content_as_attribute" value="true"/>
            <parameter key="write_pages_to_disk" value="true"/>
            <parameter key="output_dir" value="C:\Users\luiz.vidal\Desktop\Cloudera\SEFA-PA\Cerveja"/>
          </operator>
          <operator activated="true" class="generate_macro" compatibility="8.0.001" expanded="true" height="82" name="Generate Macro" width="90" x="112" y="34">
            <list key="function_descriptions">
              <parameter key="page" value="%{page}"/>
            </list>
          </operator>
          <connect from_port="input 1" to_op="Generate Macro" to_port="through 1"/>
          <connect from_op="Crawl Web" from_port="example set" to_port="output 2"/>
          <connect from_op="Generate Macro" from_port="through 1" to_port="output 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
          <portSpacing port="sink_output 3" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Loop" from_port="output 1" to_port="result 2"/>
      <connect from_op="Loop" from_port="output 2" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

After that when I have all pages "collected", I was trying to use xpath to get only the field I need inside the html.

But, somehow when I copy paste it from google, it doesn't work.

Can you guys please help me create a simple example of process ?

Thanks in advance.

luiz_vidal · January 2018

Ugh,

After almost giving up I was able to retrieve the piece of data I want, the thing is that it brings only the first that it finds..

I need to find a way to fetch all products names and prices

//*[@name="priceProduct"]

//*[@name="productName"]

miner · January 2018

Hi Luiz-Vidal,

I came across that issue a few days ago.

Just copy&paste the xml from google wont work due to namespace

Google gives

//*[@id="product_383527"]/div/div[1]/div[3]/div[1]/a/span for the first product: Paulistânia Puro Malte Premium Lager Garrafa 600 ml 1 Unidade and

//*[@id="product_383527"]/div/div[2]/div[1]/div[1]/a/span for the price 14,99

In RM you have to use //*[@id="product_383527"]/h:div/dh:iv[1]/h:div[3]/h:div[1]/h:a/h:span

and //*[@id="product_383527"]/h:div/h:div[2]/h:div[1]/h:div[1]/h:a/h:span

See the discussion here: https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/Extracting-Information-With-XPath/td-p/9883

Cheers

miner

luiz_vidal · January 2018

Hey,

Thanks for your reply

Although I still can't make it..

<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="false" class="concurrency:loop" compatibility="8.0.001" expanded="true" height="103" name="Loop" width="90" x="179" y="34">
        <parameter key="iteration_macro" value="page"/>
        <process expanded="true">
          <operator activated="true" class="web:crawl_web_modern" compatibility="7.3.000" expanded="true" height="68" name="Crawl Web" width="90" x="246" y="238">
            <parameter key="url" value="https://www.buscape.com.br/cerveja?pagina=%{page}"/>
            <list key="crawling_rules">
              <parameter key="follow_link_with_matching_url" value="cerveja"/>
            </list>
            <parameter key="retrieve_as_html" value="true"/>
            <parameter key="add_content_as_attribute" value="true"/>
            <parameter key="write_pages_to_disk" value="true"/>
            <parameter key="output_dir" value="C:\Users\luiz.vidal\Desktop\Cloudera\SEFA-PA\Cerveja\"/>
          </operator>
          <operator activated="true" class="rename_file" compatibility="8.0.001" expanded="true" height="82" name="Rename File" width="90" x="514" y="238">
            <parameter key="file" value="C:\Users\luiz.vidal\Desktop\Cloudera\SEFA-PA\Cerveja\0.txt"/>
            <parameter key="new_name" value="%{page}.txt"/>
          </operator>
          <operator activated="true" class="generate_macro" compatibility="8.0.001" expanded="true" height="82" name="Generate Macro" width="90" x="112" y="34">
            <list key="function_descriptions">
              <parameter key="page" value="%{page} + 1"/>
            </list>
          </operator>
          <connect from_port="input 1" to_op="Generate Macro" to_port="through 1"/>
          <connect from_op="Crawl Web" from_port="example set" to_op="Rename File" to_port="through 1"/>
          <connect from_op="Rename File" from_port="through 1" to_port="output 2"/>
          <connect from_op="Generate Macro" from_port="through 1" to_port="output 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
          <portSpacing port="sink_output 3" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="text:process_document_from_file" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Files" width="90" x="112" y="187">
        <list key="text_directories">
          <parameter key="cerveja" value="C:\Users\luiz.vidal\Desktop\Cloudera\SEFA-PA\Cerveja"/>
        </list>
        <parameter key="encoding" value="ISO-8859-1"/>
        <process expanded="true">
          <operator activated="true" class="text:cut_document" compatibility="8.1.000" expanded="true" height="68" name="Cut Document" width="90" x="45" y="34">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="product" value="//*[@id=&amp;quot;product_383527&quot;]/h:div/dh:iv[1]/h:div[3]/h:div[1]/h:a/h:span"/>
            </list>
            <list key="namespaces"/>
            <parameter key="assume_html" value="false"/>
            <list key="index_queries"/>
            <list key="jsonpath_queries"/>
            <process expanded="true">
              <operator activated="true" class="text:extract_information" compatibility="8.1.000" expanded="true" height="68" name="Extract Information" width="90" x="447" y="85">
                <parameter key="query_type" value="XPath"/>
                <list key="string_machting_queries"/>
                <list key="regular_expression_queries"/>
                <list key="regular_region_queries"/>
                <list key="xpath_queries">
                  <parameter key="product" value="//*[@id=&amp;quot;product_383527&quot;]/h:div/h:div[2]/h:div[1]/h:div[1]/h:a/h:span"/>
                </list>
                <list key="namespaces"/>
                <list key="index_queries"/>
                <list key="jsonpath_queries"/>
              </operator>
              <connect from_port="segment" to_op="Extract Information" to_port="document"/>
              <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
              <portSpacing port="source_segment" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_port="document" to_op="Cut Document" to_port="document"/>
          <connect from_op="Cut Document" from_port="documents" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
      <connect from_op="Process Documents from Files" from_port="word list" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

Any idea what am I doing wrong?

miner · January 2018

I´m not quite sure.

The website is using product-id for reference.

For the first product I took it was //*[@id="product_383527"] - assuming the id is changing for every product the xpaht is only working for this specific product.

Then you would have to go up the tree to get a "non-id-related" node and then pick the detail from there.

That would be /html/body/main/div[3]/div/div[3]/section?

luiz_vidal · January 2018

Sorry,

I know nothing about xpath, I've been trying all day to get ..

I try, try try and the extract document returs me only true or false or ?

name="productName"], it returns TRUE or FALSE.. but what I want is the value for productName and for priceProduct.. which will probably have to be return on a list.. or a huge string to be split.. I dont know yet.
A victory would be just getting one value returned

miner · January 2018

Hi @luiz_vidal

xpath can be a mess...

A good way to test xpath-strings is to use google docs where you can quickly copy the xpath from chrome to the spreadsheet and test the result. This is much faster than testing the structure in RM.

On Youtube you find a lot of tutorials to xpath and google docs.

My recommendation is the video of community member el chief - find it here: https://www.youtube.com/watch?v=UG6223p9fZE

Cheers

miner

luiz_vidal · February 2018

Overall,

It was a matter of getting to know how to use xpath and configuring it correctly along the operators.

Thanks for your help

Thomas_Ott · February 2018

"xpath can be a mess..."

Definately agree, but it's powerful when it works.

canh99alex · July 2018

Help me please. Which Currency is best to mine. https://en.bitcoinwiki.org/wiki/Web_mining here it is written that experts advice "monero".

SGolbert · July 2018

Hi,

Trying the XPaths in a shell environment can make things faster.

A simple command line tool is XML Shell:

http://www.xmlsh.org/CommandXPath

You can also find the same functionality in Python's scrapy, but it is overkill for your actual needs.

Regards,

Sebastian

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Web Mining crawling prices of an internet page"

Best Answer

Answers